aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2014-03-31net-sysfs: expose number of carrier on/off changesdavid decotigny2-1/+16
This allows to monitor carrier on/off transitions and detect link flapping issues: - new /sys/class/net/X/carrier_changes - new rtnetlink IFLA_CARRIER_CHANGES (getlink) Tested: - grep . /sys/class/net/*/carrier_changes + ip link set dev X down/up + plug/unplug cable - updated iproute2: prints IFLA_CARRIER_CHANGES - iproute2 20121211-2 (debian): unchanged behavior Signed-off-by: David Decotigny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: Allow modules to use is_skb_forwardableVlad Yasevich1-2/+2
Signed-off-by: Vlad Yasevich <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: filter: rework/optimize internal BPF interpreter's instruction setAlexei Starovoitov1-300/+1157
This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the *internal* format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Hagen Paul Pfeifer <[email protected]> Cc: Kees Cook <[email protected]> Cc: Paul Moore <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Acked-by: Kees Cook <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: ptp: do not reimplement PTP/BPF classifierDaniel Borkmann1-1/+7
There are currently pch_gbe, cpts, and ixp4xx_eth drivers that open-code and reimplement a BPF classifier for the PTP protocol. Since all of them effectively do the very same thing and load the very same PTP/BPF filter, we can just consolidate that code by introducing ptp_classify_raw() in the time-stamping core framework which can be used in drivers. As drivers get initialized after bootstrapping the core networking subsystem, they can make use of ptp_insns wrapped through ptp_classify_raw(), which allows to simplify and remove PTP classifier setup code in drivers. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Jiri Benc <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: ptp: use sk_unattached_filter_create() for BPFDaniel Borkmann1-7/+14
This patch migrates an open-coded sk_run_filter() implementation with proper use of the BPF API, that is, sk_unattached_filter_create(). This migration is needed, as we will be internally transforming the filter to a different representation, and therefore needs to be decoupled. It is okay to do so as skb_timestamping_init() is called during initialization of the network stack in core initcall via sock_init(). This would effectively also allow for PTP filters to be jit compiled if bpf_jit_enable is set. For better readability, there are also some newlines introduced, also ptp_classify.h is only in kernel space. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Jiri Benc <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: filter: move filter accounting to filter coreDaniel Borkmann1-2/+25
This patch basically does two things, i) removes the extern keyword from the include/linux/filter.h file to be more consistent with the rest of Joe's changes, and ii) moves filter accounting into the filter core framework. Filter accounting mainly done through sk_filter_{un,}charge() take care of the case when sockets are being cloned through sk_clone_lock() so that removal of the filter on one socket won't result in eviction as it's still referenced by the other. These functions actually belong to net/core/filter.c and not include/net/sock.h as we want to keep all that in a central place. It's also not in fast-path so uninlining them is fine and even allows us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward declaration. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: filter: keep original BPF program aroundDaniel Borkmann2-29/+80
In order to open up the possibility to internally transform a BPF program into an alternative and possibly non-trivial reversible representation, we need to keep the original BPF program around, so that it can be passed back to user space w/o the need of a complex decoder. The reason for that use case resides in commit a8fc92778080 ("sk-filter: Add ability to get socket filter program (v2)"), that is, the ability to retrieve the currently attached BPF filter from a given socket used mainly by the checkpoint-restore project, for example. Therefore, we add two helpers sk_{store,release}_orig_filter for taking care of that. In the sk_unattached_filter_create() case, there's no such possibility/requirement to retrieve a loaded BPF program. Therefore, we can spare us the work in that case. This approach will simplify and slightly speed up both, sk_get_filter() and sock_diag_put_filterinfo() handlers as we won't need to successively decode filters anymore through sk_decode_filter(). As we still need sk_decode_filter() later on, we're keeping it around. Joint work with Alexei Starovoitov. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-31net: filter: add jited flag to indicate jit compiled filtersDaniel Borkmann1-0/+1
This patch adds a jited flag into sk_filter struct in order to indicate whether a filter is currently jited or not. The size of sk_filter is not being expanded as the 32 bit 'len' member allows upper bits to be reused since a filter can currently only grow as large as BPF_MAXINSNS. Therefore, there's enough room also for other in future needed flags to reuse 'len' field if necessary. The jited flag also allows for having alternative interpreter functions running as currently, we can only detect jit compiled filters by testing fp->bpf_func to not equal the address of sk_run_filter(). Joint work with Alexei Starovoitov. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Pablo Neira Ayuso <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-12/+31
Conflicts: drivers/net/ethernet/marvell/mvneta.c The mvneta.c conflict is a case of overlapping changes, a conversion to devm_ioremap_resource() vs. a conversion to netdev_alloc_pcpu_stats. Signed-off-by: David S. Miller <[email protected]>
2014-03-29netpoll: Respect NETIF_F_LLTXEric W. Biederman1-5/+5
Stop taking the transmit lock when a network device has specified NETIF_F_LLTX. If no locks needed to trasnmit a packet this is the ideal scenario for netpoll as all packets can be trasnmitted immediately. Even if some locks are needed in ndo_start_xmit skipping any unnecessary serialization is desirable for netpoll as it makes it more likely a debugging packet may be trasnmitted immediately instead of being deferred until later. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-29netpoll: Remove strong unnecessary assumptions about skbsEric W. Biederman1-2/+2
Remove the assumption that the skbs that make it to netpoll_send_skb_on_dev are allocated with find_skb, such that skb->users == 1 and nothing is attached that would prevent the skbs from being freed from hard irq context. Remove this assumption by replacing __kfree_skb on error paths with dev_kfree_skb_irq (in hard irq context) and kfree_skb (in process context). Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-29netpoll: Rename netpoll_rx_enable/disable to netpoll_poll_disable/enableEric W. Biederman2-8/+8
The netpoll_rx_enable and netpoll_rx_disable functions have always controlled polling the network drivers transmit and receive queues. Rename them to netpoll_poll_enable and netpoll_poll_disable to make their functionality clear. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-29netpoll: Move rx enable/disable into __dev_close_manyEric W. Biederman1-9/+4
Today netpoll_rx_enable and netpoll_rx_disable are called from dev_close and and __dev_close, and not from dev_close_many. Move the calls into __dev_close_many so that we have a single call site to maintain, and so that dev_close_many gains this protection as well. Which importantly makes batched network device deletes safe. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-29netpoll: Only call ndo_start_xmit from a single placeEric W. Biederman1-25/+36
Factor out the code that needs to surround ndo_start_xmit from netpoll_send_skb_on_dev into netpoll_start_xmit. It is an unfortunate fact that as the netpoll code has been maintained the primary call site ndo_start_xmit learned how to handle vlans and timestamps but the second call of ndo_start_xmit in queue_process did not. With the introduction of netpoll_start_xmit this associated logic now happens at both call sites of ndo_start_xmit and should make it easy for that to continue into the future. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-29netpoll: Remove gfp parameter from __netpoll_setupEric W. Biederman1-4/+4
The gfp parameter was added in: commit 47be03a28cc6c80e3aa2b3e8ed6d960ff0c5c0af Author: Amerigo Wang <[email protected]> Date: Fri Aug 10 01:24:37 2012 +0000 netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup() slave_enable_netpoll() and __netpoll_setup() may be called with read_lock() held, so should use GFP_ATOMIC to allocate memory. Eric suggested to pass gfp flags to __netpoll_setup(). Cc: Eric Dumazet <[email protected]> Cc: "David S. Miller" <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]> The reason for the gfp parameter was removed in: commit c4cdef9b7183159c23c7302aaf270d64c549f557 Author: dingtianhong <[email protected]> Date: Tue Jul 23 15:25:27 2013 +0800 bonding: don't call slave_xxx_netpoll under spinlocks The slave_xxx_netpoll will call synchronize_rcu_bh(), so the function may schedule and sleep, it should't be called under spinlocks. bond_netpoll_setup() and bond_netpoll_cleanup() are always protected by rtnl lock, it is no need to take the read lock, as the slave list couldn't be changed outside rtnl lock. Signed-off-by: Ding Tianhong <[email protected]> Cc: Jay Vosburgh <[email protected]> Cc: Andy Gospodarek <[email protected]> Signed-off-by: David S. Miller <[email protected]> Nothing else that calls __netpoll_setup or ndo_netpoll_setup requires a gfp paramter, so remove the gfp parameter from both of these functions making the code clearer. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-28net: Account for all vlan headers in skb_mac_gso_segmentVlad Yasevich2-5/+11
skb_network_protocol() already accounts for multiple vlan headers that may be present in the skb. However, skb_mac_gso_segment() doesn't know anything about it and assumes that skb->mac_len is set correctly to skip all mac headers. That may not always be the case. If we are simply forwarding the packet (via bridge or macvtap), all vlan headers may not be accounted for. A simple solution is to allow skb_network_protocol to return the vlan depth it has calculated. This way skb_mac_gso_segment will correctly skip all mac headers. Signed-off-by: Vlad Yasevich <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-28net: net: add a core netdev->tx_dropped counterEric Dumazet1-0/+2
Dropping packets in __dev_queue_xmit() when transmit queue is stopped (NIC TX ring buffer full or BQL limit reached) currently outputs a syslog message. It would be better to get a precise count of such events available in netdevice stats so that monitoring tools can have a clue. This extends the work done in caf586e5f23ce ("net: add a core netdev->rx_dropped counter") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-27core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errorsZoltan Kiss1-7/+20
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-27net/core: Use RCU_INIT_POINTER(x, NULL) in netpoll.cMonam Agarwal1-1/+1
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-26net: Rename skb->rxhash to skb->hashTom Herbert3-12/+13
The packet hash can be considered a property of the packet, not just on RX path. This patch changes name of rxhash and l4_rxhash skbuff fields to be hash and l4_hash respectively. This includes changing uses of the field in the code which don't call the access functions. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Mahesh Bandewar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-4/+6
Conflicts: Documentation/devicetree/bindings/net/micrel-ks8851.txt net/core/netpoll.c The net/core/netpoll.c conflict is a bug fix in 'net' happening to code which is completely removed in 'net-next'. In micrel-ks8851.txt we simply have overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2014-03-24netpoll: fix the skb check in pkt_is_nsLi RongQing1-1/+1
Neighbor Solicitation is ipv6 protocol, so we should check skb->protocol with ETH_P_IPV6 Signed-off-by: Li RongQing <[email protected]> Cc: WANG Cong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-20rtnetlink: fix fdb notification flagsNicolas Dichtel1-4/+6
Commit 3ff661c38c84 ("net: rtnetlink notify events for FDB NTF_SELF adds and deletes") reuses the function nlmsg_populate_fdb_fill() to notify fdb events. But this function was used only for dump and thus was always setting the flag NLM_F_MULTI, which is wrong in case of a single notification. Libraries like libnl will wait forever for NLMSG_DONE. CC: Thomas Graf <[email protected]> Signed-off-by: Nicolas Dichtel <[email protected]> Acked-by: Thomas Graf <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-20net/core/flow.c: Fix CPU hotplug callback registrationSrivatsa S. Bhat1-1/+7
Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Instead, the correct and race-free way of performing the callback registration is: cpu_notifier_register_begin(); for_each_online_cpu(cpu) init_cpu(cpu); /* Note the use of the double underscored version of the API */ __register_cpu_notifier(&foobar_cpu_notifier); cpu_notifier_register_done(); Fix the code in net/core/flow.c by using this latter form of callback registration. Cc: Li RongQing <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Ingo Molnar <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-03-19cgroup: drop const from @buffer of cftype->write_string()Tejun Heo1-1/+1
cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]>
2014-03-17netpoll: Remove dead packet receive code (CONFIG_NETPOLL_TRAP)Eric W. Biederman2-529/+2
The netpoll packet receive code only becomes active if the netpoll rx_skb_hook is implemented, and there is not a single implementation of the netpoll rx_skb_hook in the kernel. All of the out of tree implementations I have found all call netpoll_poll which was removed from the kernel in 2011, so this change should not add any additional breakage. There are problems with the netpoll packet receive code. __netpoll_rx does not call dev_kfree_skb_irq or dev_kfree_skb_any in hard irq context. netpoll_neigh_reply leaks every skb it receives. Reception of packets does not work successfully on stacked devices (aka bonding, team, bridge, and vlans). Given that the netpoll packet receive code is buggy, there are no out of tree users that will be merged soon, and the code has not been used for in tree for a decade let's just remove it. Reverting this commit can server as a starting point for anyone who wants to resurrect netpoll packet reception support. Acked-by: Eric Dumazet <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Move all receive processing under CONFIG_NETPOLL_TRAPEric W. Biederman1-17/+64
Make rx_skb_hook, and rx in struct netpoll depend on CONFIG_NETPOLL_TRAP Make rx_lock, rx_np, and neigh_tx in struct netpoll_info depend on CONFIG_NETPOLL_TRAP Make the functions netpoll_rx_on, netpoll_rx, and netpoll_receive_skb no-ops when CONFIG_NETPOLL_TRAP is not set. Only build netpoll_neigh_reply, checksum_udp service_neigh_queue, pkt_is_ns, and __netpoll_rx when CONFIG_NETPOLL_TRAP is defined. Add helper functions netpoll_trap_setup, netpoll_trap_setup_info, netpoll_trap_cleanup, and netpoll_trap_cleanup_info that initialize and cleanup the struct netpoll and struct netpoll_info receive specific fields when CONFIG_NETPOLL_TRAP is enabled and do nothing otherwise. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Consolidate neigh_tx processing in service_neigh_queueEric W. Biederman1-22/+16
Move the bond slave device neigh_tx handling into service_neigh_queue. In connection with neigh_tx processing remove unnecessary tests of a NULL netpoll_info. As the netpoll_poll_dev has already used and thus verified the existince of the netpoll_info. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Move netpoll_trap under CONFIG_NETPOLL_TRAPEric W. Biederman1-5/+9
Now that we no longer need to receive packets to safely drain the network drivers receive queue move netpoll_trap and netpoll_set_trap under CONFIG_NETPOLL_TRAP Making netpoll_trap and netpoll_set_trap noop inline functions when CONFIG_NETPOLL_TRAP is not set. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Don't drop all received packets.Eric W. Biederman1-11/+6
Change the strategy of netpoll from dropping all packets received during netpoll_poll_dev to calling napi poll with a budget of 0 (to avoid processing drivers rx queue), and to ignore packets received with netif_rx (those will safely be placed on the backlog queue). All of the netpoll supporting drivers have been reviewed to ensure either thay use netif_rx or that a budget of 0 is supported by their napi poll routine and that a budget of 0 will not process the drivers rx queues. Not dropping packets makes NETPOLL_RX_DROP unnecesary so it is removed. npinfo->rx_flags is removed as rx_flags with just the NETPOLL_RX_ENABLED flag becomes just a redundant mirror of list_empty(&npinfo->rx_np). Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Add netpoll_rx_processingEric W. Biederman1-2/+2
Add a helper netpoll_rx_processing that reports when netpoll has receive side processing to perform. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Warn if more packets are processed than are budgetedEric W. Biederman1-0/+1
There is already a warning for this case in the normal netpoll path, but put a copy here in case how netpoll calls the poll functions causes a differenet result. netpoll will shortly call the napi poll routine with a budget 0 to avoid any rx packets being processed. As nothing does that today we may encounter drivers that have problems so a netpoll specific warning seems desirable. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Visit all napi handlers in poll_napiEric W. Biederman1-3/+0
In poll_napi loop through all of the napi handlers even when the budget falls to 0 to ensure that we process all of the tx_queues, and so that we continue to call into drivers when our initial budget is 0. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: Pass budget into poll_napiEric W. Biederman1-3/+3
This moves the control logic to the top level in netpoll_poll_dev instead of having it dispersed throughout netpoll_poll_dev, poll_napi and poll_one_napi. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-17netpoll: move setting of NETPOLL_RX_DROP into netpoll_poll_devEric W. Biederman1-8/+8
Today netpoll depends on setting NETPOLL_RX_DROP before networking drivers receive packets in interrupt context so that the packets can be dropped. Move this setting into netpoll_poll_dev from poll_one_napi so that if ndo_poll_controller happens to receive packets we will drop the packets on the floor instead of letting the packets bounce through the networking stack and potentially cause problems. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-47/+58
Conflicts: drivers/net/usb/r8152.c drivers/net/xen-netback/netback.c Both the r8152 and netback conflicts were simple overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2014-03-13consolidate duplicate code is skb_checksum_setup() helpersJan Beulich1-90/+50
consolidate duplicate code is skb_checksum_setup() helpers Realizing that the skb_maybe_pull_tail() calls in the IP-protocol specific portions of both helpers are terminal ones (i.e. no further pulls are expected), their maximum size to be pulled can be made match their minimal size needed, thus making the code identical and hence possible to be moved into another helper. Signed-off-by: Jan Beulich <[email protected]> Cc: Paul Durrant <[email protected]> Cc: David Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Reviewed-by: Paul Durrant <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-12flowcache: Fix resource leaks on namespace exit.Steffen Klassert1-0/+19
We leak an active timer, the hotcpu notifier and all allocated resources when we exit a namespace. Fix this by introducing a flow_cache_fini() function where we release the resources before we exit. Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware") Reported-by: Jakub Kicinski <[email protected]> Tested-by: Jakub Kicinski <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Fan Du <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-12net: Convert uses of __constant_<foo> to <foo>Joe Perches2-12/+12
The use of __constant_<foo> has been unnecessary for quite awhile now. Make these uses consistent with the rest of the kernel. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-11tcp: tcp_release_cb() should release socket ownershipEric Dumazet1-1/+4
Lars Persson reported following deadlock : -000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock -001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0 -002 |ip_local_deliver_finish(skb = 0x8BD527A0) -003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?) -004 |netif_receive_skb(skb = 0x8BD527A0) -005 |elk_poll(napi = 0x8C770500, budget = 64) -006 |net_rx_action(?) -007 |__do_softirq() -008 |do_softirq() -009 |local_bh_enable() -010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?) -011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0) -012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0) -013 |tcp_release_cb(sk = 0x8BE6B2A0) -014 |release_sock(sk = 0x8BE6B2A0) -015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?) -016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096) -017 |kernel_sendmsg(?, ?, ?, ?, size = 4096) -018 |smb_send_kvec() -019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0) -020 |cifs_call_async() -021 |cifs_async_writev(wdata = 0x87FD6580) -022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88) -023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88) -024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88) -025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88) -026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88) -027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0) -028 |bdi_writeback_workfn(work = 0x87E4A9CC) -029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC) -030 |worker_thread(__worker = 0x8B045880) -031 |kthread(_create = 0x87CADD90) -032 |ret_from_kernel_thread(asm) Bug occurs because __tcp_checksum_complete_user() enables BH, assuming it is running from softirq context. Lars trace involved a NIC without RX checksum support but other points are problematic as well, like the prequeue stuff. Problem is triggered by a timer, that found socket being owned by user. tcp_release_cb() should call tcp_write_timer_handler() or tcp_delack_timer_handler() in the appropriate context : BH disabled and socket lock held, but 'owned' field cleared, as if they were running from timer handlers. Fixes: 6f458dfb4092 ("tcp: improve latencies of timer triggered events") Reported-by: Lars Persson <[email protected]> Tested-by: Lars Persson <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-11skbuff: skb_segment: orphan frags before copyingMichael S. Tsirkin1-0/+6
skb_segment copies frags around, so we need to copy them carefully to avoid accessing user memory after reporting completion to userspace through a callback. skb_segment doesn't normally happen on datapath: TSO needs to be disabled - so disabling zero copy in this case does not look like a big deal. Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Herbert Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-11skbuff: skb_segment: s/fskb/list_skb/Michael S. Tsirkin1-13/+13
fskb is unrelated to frag: it's coming from frag_list. Rename it list_skb to avoid confusion. Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-11skbuff: skb_segment: s/skb/head_skb/Michael S. Tsirkin1-22/+24
rename local variable to make it easier to tell at a glance that we are dealing with a head skb. Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-11skbuff: skb_segment: s/skb_frag/frag/Michael S. Tsirkin1-7/+7
skb_frag can in fact point at either skb or fskb so rename it generally "frag". Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-11skbuff: skb_segment: s/frag/nskb_frag/Michael S. Tsirkin1-9/+9
frag points at nskb, so name it appropriately Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-10flowcache: restore a single flow_cache kmem_cacheEric Dumazet1-6/+8
It is not legal to create multiple kmem_cache having the same name. flowcache can use a single kmem_cache, no need for a per netns one. Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware") Reported-by: Jakub Kicinski <[email protected]> Tested-by: Jakub Kicinski <[email protected]> Tested-by: Fan Du <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-7/+4
Conflicts: drivers/net/wireless/ath/ath9k/recv.c drivers/net/wireless/mwifiex/pcie.c net/ipv6/sit.c The SIT driver conflict consists of a bug fix being done by hand in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper was created (netdev_alloc_pcpu_stats()) which takes care of this. The two wireless conflicts were overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2014-02-27neigh: recompute reachabletime before returning from neigh_periodic_work()Duan Jiong1-3/+3
If the neigh table's entries is less than gc_thresh1, the function will return directly, and the reachabletime will not be recompute, so the reachabletime can be guessed. Signed-off-by: Duan Jiong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-02-27neigh: directly goto out after setting nud_state to NUD_FAILEDDuan Jiong1-0/+1
Because those following if conditions will not be matched. Signed-off-by: Duan Jiong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-02-26neigh: probe application via netlink in NUD_PROBETimo Teräs1-4/+4
iproute2 arpd seems to expect this as there's code and comments to handle netlink probes with NUD_PROBE set. It is used to flush the arpd cached mappings. opennhrp instead turns off unicast probes (so it can handle all neighbour discovery). Without this change it will not see NUD_PROBE probes and cannot reconfirm the mapping. Thus currently neigh entry will just fail and can cause few packets dropped until broadcast discovery is restarted. Earlier discussion on the subject: http://marc.info/?t=139305877100001&r=1&w=2 Signed-off-by: Timo Teräs <[email protected]> Signed-off-by: David S. Miller <[email protected]>