aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2014-05-19netfilter: nf_tables: remove skb and nlh from context structurePablo Neira Ayuso1-4/+6
Instead of caching the original skbuff that contains the netlink messages, this stores the netlink message sequence number, the netlink portID and the report flag. This helps to prepare the introduction of the object release via call_rcu. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: use new transaction infrastructure to handle elementsPablo Neira Ayuso1-0/+10
Leave the set content in consistent state if we fail to load the batch. Use the new generic transaction infrastructure to achieve this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: use new transaction infrastructure to handle tablePablo Neira Ayuso1-0/+10
This patch speeds up rule-set updates and it also provides a way to revert updates and leave things in consistent state in case that the batch needs to be aborted. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: use new transaction infrastructure to handle chainPablo Neira Ayuso1-0/+17
This patch speeds up rule-set updates and it also introduces a way to revert chain updates if the batch is aborted. The idea is to store the changes in the transaction to apply that in the commit step. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: use new transaction infrastructure to handle setsPablo Neira Ayuso2-0/+18
This patch reworks the nf_tables API so set updates are included in the same batch that contains rule updates. This speeds up rule-set updates since we skip a dialog of four messages between kernel and user-space (two on each direction), from: 1) create the set and send netlink message to the kernel 2) process the response from the kernel that contains the allocated name. 3) add the set elements and send netlink message to the kernel. 4) process the response from the kernel (to check for errors). To: 1) add the set to the batch. 2) add the set elements to the batch. 3) add the rule that points to the set. 4) send batch to the kernel. This also introduces an internal set ID (NFTA_SET_ID) that is unique in the batch so set elements and rules can refer to new sets. Backward compatibility has been only retained in userspace, this means that new nft versions can talk to the kernel both in the new and the old fashion. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: add message type to transactionsPablo Neira Ayuso1-0/+2
The patch adds message type to the transaction to simplify the commit the and abort routines. Yet another step forward in the generalisation of the transaction infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: generalise transaction infrastructurePablo Neira Ayuso1-4/+11
This patch generalises the existing rule transaction infrastructure so it can be used to handle set, table and chain object transactions as well. The transaction provides a data area that stores private information depending on the transaction type. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: deconstify table and chain in context structurePablo Neira Ayuso1-3/+3
The new transaction infrastructure updates the family, table and chain objects in the context structure, so let's deconstify them. While at it, move the context structure initialization routine to the top of the source file as it will be also used from the table and chain routines. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-04-24netfilter: nf_tables: Add meta expression key for bridge interface nameTomasz Bursztyka1-0/+4
NFT_META_BRI_IIFNAME to get packet input bridge interface name NFT_META_BRI_OIFNAME to get packet output bridge interface name Such meta key are accessible only through NFPROTO_BRIDGE family, on a dedicated nft meta module: nft_meta_bridge. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-04-23netfilter: nf_tables: Make meta expression core functions publicTomasz Bursztyka1-0/+36
This will be useful to create network family dedicated META expression as for NFPROTO_BRIDGE for instance. Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-04-02netfilter: nf_tables: implement proper set selectionPatrick McHardy2-0/+73
The current set selection simply choses the first set type that provides the requested features, which always results in the rbtree being chosen by virtue of being the first set in the list. What we actually want to do is choose the implementation that can provide the requested features and is optimal from either a performance or memory perspective depending on the characteristics of the elements and the preferences specified by the user. The elements are not known when creating a set. Even if we would provide them for anonymous (literal) sets, we'd still have standalone sets where the elements are not known in advance. We therefore need an abstract description of the data charcteristics. The kernel already knows the size of the key, this patch starts by introducing a nested set description which so far contains only the maximum amount of elements. Based on this the set implementations are changed to provide an estimate of the required amount of memory and the lookup complexity class. The set ops have a new callback ->estimate() that is invoked during set selection. It receives a structure containing the attributes known to the kernel and is supposed to populate a struct nft_set_estimate with the complexity class and, in case the size is known, the complete amount of memory required, or the amount of memory required per element otherwise. Based on the policy specified by the user (performance/memory, defaulting to performance) the kernel will then select the best suited implementation. Even if the set implementation would allow to add more than the specified maximum amount of elements, they are enforced since new implementations might not be able to add more than maximum based on which they were selected. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-04-01net: Add a test to see if a skb is freeable in irq contextEric W. Biederman1-0/+13
Currently netpoll and skb_release_head_state assume that a skb is freeable in hard irq context except when skb->destructor is set. The reality is far from this. So add a function skb_irq_freeable to compute the full test and in the process be the living documentation of what the requirements are of actually freeing a skb in hard irq context. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-01net: ptp: move PTP classifier in its own fileDaniel Borkmann2-75/+22
This commit fixes a build error reported by Fengguang, that is triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set: ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined! The fix is to introduce its own file for the PTP BPF classifier, so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select it independently from each other. IXP4xx driver on ARM needs to select it as well since it does not seem to select PTP_1588_CLOCK or similar that would pull it in automatically. This also allows for hiding all of the internals of the BPF PTP program inside that file, and only exporting relevant API bits to drivers. This patch also adds a kdoc documentation of ptp_classify_raw() API to make it clear that it can return PTP_CLASS_* defines. Also, the BPF program has been translated into bpf_asm code, so that it can be more easily read and altered (extensively documented in [1]). In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg tools, so the commented program can simply be translated via `./bpf_asm -c prog` where prog is a file that contains the commented code. This makes it easily readable/verifiable and when there's a need to change something, jump offsets etc do not need to be replaced manually which can be very error prone. Instead, a newly translated version via bpf_asm can simply replace the old code. I have checked opcode diffs before/after and it's the very same filter. [1] Documentation/networking/filter.txt Fixes: 164d8c666521 ("net: ptp: do not reimplement PTP/BPF classifier") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Jiri Benc <jbenc@redhat.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-01mac802154: make csma/cca parameters per-wpanPhoebe Buckheister2-1/+18
Commit 9b2777d6089bcd (ieee802154: add TX power control to wpan_phy) and following erroneously added CSMA and CCA parameters for 802.15.4 devices as PHY parameters, while they are actually MAC parameters and can differ for any two WPAN instances. Since it is now sensible to have multiple WPAN devices with differing CSMA/CCA parameters, make these parameters MAC parameters instead. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31Merge branch 'for-davem' of ↵David S. Miller1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== pull request: wireless-next 2014-03-31 Please accept this one last round of general wireless updates for the 3.15 merge window! For the Bluetooth bits, Gustavo says: "Here follow another set of patches to 3.15. This is mostly a bug fix pull request with the exception of one commit from Marcel which adds tracking to the current configured LE scan type parameter." Beyond that, notable bits include some final refactoring of rtl8180 and the addition of the rtl8187se driver, fixes for a number of problems identified by Dan Carpenter and his static analysis tools, and a handful of other bits here and there. Please let me know if there are problems! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net-gro: restore frag0 optimizationEric Dumazet1-5/+0
Main difference between napi_frags_skb() and napi_gro_receive() is that the later is called while ethernet header was already pulled by the NIC driver (eth_type_trans() was called before napi_gro_receive()) Jerry Chu in commit 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support") tried to remove this difference by calling eth_type_trans() from napi_frags_skb() instead of doing this later from napi_frags_finish() Goal was that napi_gro_complete() could call ptype->callbacks.gro_complete(skb, 0) (offset of first network header = 0) Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to point to their own header, for the current skb and ones held in gro_list Problem is this cleanup work defeated the frag0 optimization: It turns out the consecutive pskb_may_pull() calls are too expensive. This patch brings back the frag0 stuff in napi_frags_skb(). As all skb have their mac header in skb head, we no longer need skb_gro_mac_header() Reported-by: Michal Schmidt <mschmidt@redhat.com> Fixes: 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net-sysfs: expose number of carrier on/off changesdavid decotigny2-0/+4
This allows to monitor carrier on/off transitions and detect link flapping issues: - new /sys/class/net/X/carrier_changes - new rtnetlink IFLA_CARRIER_CHANGES (getlink) Tested: - grep . /sys/class/net/*/carrier_changes + ip link set dev X down/up + plug/unplug cable - updated iproute2: prints IFLA_CARRIER_CHANGES - iproute2 20121211-2 (debian): unchanged behavior Signed-off-by: David Decotigny <decot@googlers.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31ipv6: reuse rt6_need_strictWang Yufen1-0/+5
Move the whole rt6_need_strict as static inline into ip6_route.h, so that it can be reused Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: export NET_ADDR_* values to user-space APIFlorian Fainelli2-7/+6
NET_ADDR_* values are exported in the /sys/class/net/<iface>/addr_assign_type sysfs attributes, and as such constitutes an user-space ABI. Move the NET_ADDR_* definitions from include/linux/netdevice.h to include/uapi/linux/netdevice.h Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: Allow modules to use is_skb_forwardableVlad Yasevich1-0/+1
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31Merge branch 'master' of ↵John W. Linville1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
2014-03-31net: filter: rework/optimize internal BPF interpreter's instruction setAlexei Starovoitov2-11/+64
This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the *internal* format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <pmoore@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: linux-kernel@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: isdn: use sk_unattached_filter apiDaniel Borkmann1-3/+2
Similarly as in ppp, we need to migrate the ISDN/PPP code to make use of the sk_unattached_filter api in order to decouple having direct filter structure access. By using sk_unattached_filter_{create,destroy}, we can allow for the possibility to jit compile filters for faster filter verdicts as well. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: isdn4linux@listserv.isdn4linux.de Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: ptp: do not reimplement PTP/BPF classifierDaniel Borkmann1-8/+2
There are currently pch_gbe, cpts, and ixp4xx_eth drivers that open-code and reimplement a BPF classifier for the PTP protocol. Since all of them effectively do the very same thing and load the very same PTP/BPF filter, we can just consolidate that code by introducing ptp_classify_raw() in the time-stamping core framework which can be used in drivers. As drivers get initialized after bootstrapping the core networking subsystem, they can make use of ptp_insns wrapped through ptp_classify_raw(), which allows to simplify and remove PTP classifier setup code in drivers. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Richard Cochran <richard.cochran@omicron.at> Cc: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: ptp: use sk_unattached_filter_create() for BPFDaniel Borkmann1-4/+0
This patch migrates an open-coded sk_run_filter() implementation with proper use of the BPF API, that is, sk_unattached_filter_create(). This migration is needed, as we will be internally transforming the filter to a different representation, and therefore needs to be decoupled. It is okay to do so as skb_timestamping_init() is called during initialization of the network stack in core initcall via sock_init(). This would effectively also allow for PTP filters to be jit compiled if bpf_jit_enable is set. For better readability, there are also some newlines introduced, also ptp_classify.h is only in kernel space. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Richard Cochran <richard.cochran@omicron.at> Cc: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: filter: move filter accounting to filter coreDaniel Borkmann2-40/+17
This patch basically does two things, i) removes the extern keyword from the include/linux/filter.h file to be more consistent with the rest of Joe's changes, and ii) moves filter accounting into the filter core framework. Filter accounting mainly done through sk_filter_{un,}charge() take care of the case when sockets are being cloned through sk_clone_lock() so that removal of the filter on one socket won't result in eviction as it's still referenced by the other. These functions actually belong to net/core/filter.c and not include/net/sock.h as we want to keep all that in a central place. It's also not in fast-path so uninlining them is fine and even allows us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward declaration. Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: filter: keep original BPF program aroundDaniel Borkmann1-2/+13
In order to open up the possibility to internally transform a BPF program into an alternative and possibly non-trivial reversible representation, we need to keep the original BPF program around, so that it can be passed back to user space w/o the need of a complex decoder. The reason for that use case resides in commit a8fc92778080 ("sk-filter: Add ability to get socket filter program (v2)"), that is, the ability to retrieve the currently attached BPF filter from a given socket used mainly by the checkpoint-restore project, for example. Therefore, we add two helpers sk_{store,release}_orig_filter for taking care of that. In the sk_unattached_filter_create() case, there's no such possibility/requirement to retrieve a loaded BPF program. Therefore, we can spare us the work in that case. This approach will simplify and slightly speed up both, sk_get_filter() and sock_diag_put_filterinfo() handlers as we won't need to successively decode filters anymore through sk_decode_filter(). As we still need sk_decode_filter() later on, we're keeping it around. Joint work with Alexei Starovoitov. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31net: filter: add jited flag to indicate jit compiled filtersDaniel Borkmann1-1/+2
This patch adds a jited flag into sk_filter struct in order to indicate whether a filter is currently jited or not. The size of sk_filter is not being expanded as the 32 bit 'len' member allows upper bits to be reused since a filter can currently only grow as large as BPF_MAXINSNS. Therefore, there's enough room also for other in future needed flags to reuse 'len' field if necessary. The jited flag also allows for having alternative interpreter functions running as currently, we can only detect jit compiled filters by testing fp->bpf_func to not equal the address of sk_run_filter(). Joint work with Alexei Starovoitov. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-5/+14
Conflicts: drivers/net/ethernet/marvell/mvneta.c The mvneta.c conflict is a case of overlapping changes, a conversion to devm_ioremap_resource() vs. a conversion to netdev_alloc_pcpu_stats. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29netpoll: Respect NETIF_F_LLTXEric W. Biederman1-0/+5
Stop taking the transmit lock when a network device has specified NETIF_F_LLTX. If no locks needed to trasnmit a packet this is the ideal scenario for netpoll as all packets can be trasnmitted immediately. Even if some locks are needed in ndo_start_xmit skipping any unnecessary serialization is desirable for netpoll as it makes it more likely a debugging packet may be trasnmitted immediately instead of being deferred until later. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29netpoll: Rename netpoll_rx_enable/disable to netpoll_poll_disable/enableEric W. Biederman1-4/+4
The netpoll_rx_enable and netpoll_rx_disable functions have always controlled polling the network drivers transmit and receive queues. Rename them to netpoll_poll_enable and netpoll_poll_disable to make their functionality clear. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29netpoll: Remove gfp parameter from __netpoll_setupEric W. Biederman2-3/+2
The gfp parameter was added in: commit 47be03a28cc6c80e3aa2b3e8ed6d960ff0c5c0af Author: Amerigo Wang <amwang@redhat.com> Date: Fri Aug 10 01:24:37 2012 +0000 netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup() slave_enable_netpoll() and __netpoll_setup() may be called with read_lock() held, so should use GFP_ATOMIC to allocate memory. Eric suggested to pass gfp flags to __netpoll_setup(). Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> The reason for the gfp parameter was removed in: commit c4cdef9b7183159c23c7302aaf270d64c549f557 Author: dingtianhong <dingtianhong@huawei.com> Date: Tue Jul 23 15:25:27 2013 +0800 bonding: don't call slave_xxx_netpoll under spinlocks The slave_xxx_netpoll will call synchronize_rcu_bh(), so the function may schedule and sleep, it should't be called under spinlocks. bond_netpoll_setup() and bond_netpoll_cleanup() are always protected by rtnl lock, it is no need to take the read lock, as the slave list couldn't be changed outside rtnl lock. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Cc: Jay Vosburgh <fubar@us.ibm.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net> Nothing else that calls __netpoll_setup or ndo_netpoll_setup requires a gfp paramter, so remove the gfp parameter from both of these functions making the code clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28vlan: Warn the user if lowerdev has bad vlan features.Vlad Yasevich1-0/+7
Some drivers incorrectly assign vlan acceleration features to vlan_features thus causing issues for Q-in-Q vlan configurations. Warn the user of such cases. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28net: Account for all vlan headers in skb_mac_gso_segmentVlad Yasevich1-1/+1
skb_network_protocol() already accounts for multiple vlan headers that may be present in the skb. However, skb_mac_gso_segment() doesn't know anything about it and assumes that skb->mac_len is set correctly to skip all mac headers. That may not always be the case. If we are simply forwarding the packet (via bridge or macvtap), all vlan headers may not be accounted for. A simple solution is to allow skb_network_protocol to return the vlan depth it has calculated. This way skb_mac_gso_segment will correctly skip all mac headers. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28ipv6: move DAD and addrconf_verify processing to workqueueHannes Frederic Sowa1-1/+3
addrconf_join_solict and addrconf_join_anycast may cause actions which need rtnl locked, especially on first address creation. A new DAD state is introduced which defers processing of the initial DAD processing into a workqueue. To get rtnl lock we need to push the code paths which depend on those calls up to workqueues, specifically addrconf_verify and the DAD processing. (v2) addrconf_dad_failure needs to be queued up to the workqueue, too. This patch introduces a new DAD state and stop the DAD processing in the workqueue (this is because of the possible ipv6_del_addr processing which removes the solicited multicast address from the device). addrconf_verify_lock is removed, too. After the transition it is not needed any more. As we are not processing in bottom half anymore we need to be a bit more careful about disabling bottom half out when we lock spin_locks which are also used in bh. Relevant backtrace: [ 541.030090] RTNL: assertion failed at net/core/dev.c (4496) [ 541.031143] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.10.33-1-amd64-vyatta #1 [ 541.031145] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 541.031146] ffffffff8148a9f0 000000000000002f ffffffff813c98c1 ffff88007c4451f8 [ 541.031148] 0000000000000000 0000000000000000 ffffffff813d3540 ffff88007fc03d18 [ 541.031150] 0000880000000006 ffff88007c445000 ffffffffa0194160 0000000000000000 [ 541.031152] Call Trace: [ 541.031153] <IRQ> [<ffffffff8148a9f0>] ? dump_stack+0xd/0x17 [ 541.031180] [<ffffffff813c98c1>] ? __dev_set_promiscuity+0x101/0x180 [ 541.031183] [<ffffffff813d3540>] ? __hw_addr_create_ex+0x60/0xc0 [ 541.031185] [<ffffffff813cfe1a>] ? __dev_set_rx_mode+0xaa/0xc0 [ 541.031189] [<ffffffff813d3a81>] ? __dev_mc_add+0x61/0x90 [ 541.031198] [<ffffffffa01dcf9c>] ? igmp6_group_added+0xfc/0x1a0 [ipv6] [ 541.031208] [<ffffffff8111237b>] ? kmem_cache_alloc+0xcb/0xd0 [ 541.031212] [<ffffffffa01ddcd7>] ? ipv6_dev_mc_inc+0x267/0x300 [ipv6] [ 541.031216] [<ffffffffa01c2fae>] ? addrconf_join_solict+0x2e/0x40 [ipv6] [ 541.031219] [<ffffffffa01ba2e9>] ? ipv6_dev_ac_inc+0x159/0x1f0 [ipv6] [ 541.031223] [<ffffffffa01c0772>] ? addrconf_join_anycast+0x92/0xa0 [ipv6] [ 541.031226] [<ffffffffa01c311e>] ? __ipv6_ifa_notify+0x11e/0x1e0 [ipv6] [ 541.031229] [<ffffffffa01c3213>] ? ipv6_ifa_notify+0x33/0x50 [ipv6] [ 541.031233] [<ffffffffa01c36c8>] ? addrconf_dad_completed+0x28/0x100 [ipv6] [ 541.031241] [<ffffffff81075c1d>] ? task_cputime+0x2d/0x50 [ 541.031244] [<ffffffffa01c38d6>] ? addrconf_dad_timer+0x136/0x150 [ipv6] [ 541.031247] [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6] [ 541.031255] [<ffffffff8105313a>] ? call_timer_fn.isra.22+0x2a/0x90 [ 541.031258] [<ffffffffa01c37a0>] ? addrconf_dad_completed+0x100/0x100 [ipv6] Hunks and backtrace stolen from a patch by Stephen Hemminger. Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28net: net: add a core netdev->tx_dropped counterEric Dumazet1-3/+4
Dropping packets in __dev_queue_xmit() when transmit queue is stopped (NIC TX ring buffer full or BQL limit reached) currently outputs a syslog message. It would be better to get a precise count of such events available in netdevice stats so that monitoring tools can have a clue. This extends the work done in caf586e5f23ce ("net: add a core netdev->rx_dropped counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28net/mlx4: Implement vxlan ndo callsOr Gerlitz1-1/+1
Add implementation for the add/del vxlan port ndo calls, using the CONFIG_DEV firmware command. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28mlx4: Add support for CONFIG_DEV commandOr Gerlitz2-0/+3
Introduce the CONFIG_DEV firmware command which we will use to configure the UDP port assumed by the firmware for the VXLAN offloads. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errorsZoltan Kiss1-2/+2
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't orphan them. Also, it doesn't handle errors, so this patch takes care of that as well, and modify the callers accordingly. skb_tx_error() is also added to the callers so they will signal the failed delivery towards the creator of the skb. Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27ipv6: do not overwrite inetpeer metrics prematurelyMichal Kubeček2-3/+11
If an IPv6 host route with metrics exists, an attempt to add a new route for the same target with different metrics fails but rewrites the metrics anyway: 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 rto_min lock 1s 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1500 RTNETLINK answers: File exists 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 rto_min lock 1.5s This is caused by all IPv6 host routes using the metrics in their inetpeer (or the shared default). This also holds for the new route created in ip6_route_add() which shares the metrics with the already existing route and thus ip6_route_add() rewrites the metrics even if the new route ends up not being used at all. Another problem is that old metrics in inetpeer can reappear unexpectedly for a new route, e.g. 12sp0:~ # ip route add fec0::1 dev eth0 rto_min 1000 12sp0:~ # ip route del fec0::1 12sp0:~ # ip route add fec0::1 dev eth0 12sp0:~ # ip route change fec0::1 dev eth0 hoplimit 10 12sp0:~ # ip -6 route show fe80::/64 dev eth0 proto kernel metric 256 fec0::1 dev eth0 metric 1024 hoplimit 10 rto_min lock 1s Resolve the first problem by moving the setting of metrics down into fib6_add_rt2node() to the point we are sure we are inserting the new route into the tree. Second problem is addressed by introducing new flag DST_METRICS_FORCE_OVERWRITE which is set for a new host route in ip6_route_add() and makes ipv6_cow_metrics() always overwrite the metrics in inetpeer (even if they are not "new"); it is reset after that. v5: use a flag in _metrics member rather than one in flags v4: fix a typo making a condition always true (thanks to Hannes Frederic Sowa) v3: rewritten based on David Miller's idea to move setting the metrics (and allocation in non-host case) down to the point we already know the route is to be inserted. Also rebased to net-next as it is quite late in the cycle. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27usbnet: include wait queue head in device structureOliver Neukum1-1/+1
This fixes a race which happens by freeing an object on the stack. Quoting Julius: > The issue is > that it calls usbnet_terminate_urbs() before that, which temporarily > installs a waitqueue in dev->wait in order to be able to wait on the > tasklet to run and finish up some queues. The waiting itself looks > okay, but the access to 'dev->wait' is totally unprotected and can > race arbitrarily. I think in this case usbnet_bh() managed to succeed > it's dev->wait check just before usbnet_terminate_urbs() sets it back > to NULL. The latter then finishes and the waitqueue_t structure on its > stack gets overwritten by other functions halfway through the > wake_up() call in usbnet_bh(). The fix is to just not allocate the data structure on the stack. As dev->wait is abused as a flag it also takes a runtime PM change to fix this bug. Signed-off-by: Oliver Neukum <oneukum@suse.de> Reported-by: Grant Grundler <grundler@google.com> Tested-by: Grant Grundler <grundler@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27Merge branch 'for-upstream' of ↵John W. Linville1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
2014-03-26net: sxgbe: add basic framework for Samsung 10Gb ethernet driverSiva Reddy1-0/+54
This patch adds support for Samsung 10Gb ethernet driver(sxgbe). - sxgbe core initialization - Tx and Rx support - MDIO support - ISRs for Tx and Rx - ifconfig support to driver Signed-off-by: Siva Reddy Kallam <siva.kallam@samsung.com> Signed-off-by: Vipul Pandya <vipul.pandya@samsung.com> Signed-off-by: Girish K S <ks.giri@samsung.com> Neatening-by: Joe Perches <joe@perches.com> Signed-off-by: Byungho An <bh74.an@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-26vlan: make a new function vlan_dev_vlan_proto() and exportdingtianhong1-0/+7
The vlan support 2 proto: 802.1q and 802.1ad, so make a new function called vlan_dev_vlan_proto() which could return the vlan proto for input dev. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-26net: Rename skb->rxhash to skb->hashTom Herbert3-22/+22
The packet hash can be considered a property of the packet, not just on RX path. This patch changes name of rxhash and l4_rxhash skbuff fields to be hash and l4_hash respectively. This includes changing uses of the field in the code which don't call the access functions. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller6-19/+17
Conflicts: Documentation/devicetree/bindings/net/micrel-ks8851.txt net/core/netpoll.c The net/core/netpoll.c conflict is a bug fix in 'net' happening to code which is completely removed in 'net-next'. In micrel-ks8851.txt we simply have overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-25Merge branch 'for-davem' of ↵David S. Miller11-21/+82
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== Please pull this batch of wireless updates intended for 3.15! For the mac80211 bits, Johannes says: "This has a whole bunch of bugfixes for things that went into -next previously as well as some other bugfixes I didn't want to rush into 3.14 at this point. The rest of it is some cleanups and a few small features, the biggest of which is probably Janusz's regulatory DFS CAC time code." For the Bluetooth bits, Gustavo says: "One more pull request to 3.15. This is mostly and bug fix pull request, it contains several fixes and clean up all over the tree, plus some small new features." For the NFC bits, Samuel says: "This is the NFC pull request for 3.15. With this one we have: - Support for ISO 15693 a.k.a. NFC vicinity a.k.a. Type 5 tags. ISO 15693 are long range (1 - 2 meters) vicinity tags/cards. The kernel now supports those through the NFC netlink and digital APIs. - Support for TI's trf7970a chipset. This chipset relies on the NFC digital layer and the driver currently supports type 2, 4A and 5 tags. - Support for NXP's pn544 secure firmare download. The pn544 C3 chipsets relies on a different firmware download protocal than the C2 one. We now support both and use the right one depending on the version we detect at runtime. - Support for 4A tags from the NFC digital layer. - A bunch of cleanups and minor fixes from Axel Lin and Thierry Escande." For the iwlwifi bits, Emmanuel says: "We were sending a host command while the mutex wasn't held. This led to hard-to-catch races." And... "I have a fix for a "merge damage" which is not really a merge damage: it enables scheduled scan which has been disabled in wireless.git. Since you merged wireless.git into wireless-next.git, this can now be fixed in wireless-next.git. Besides this, Alex made a workaround for a hardware bug. This fix allows us to consume less power in S3. Arik and Eliad continue to work on D0i3 which is a run-time power saving feature. Eliad also contributes a few bits to the rate scaling logic to which Eyal adds his own contribution. Avri dives deep in the power code - newer firmware will allow to enable power save in newer scenarios. Johannes made a few clean-ups. I have the regular amount of BT Coex boring stuff. I disable uAPSD since we identified firmware bugs that cause packet loss. One thing that do stand out is the udev event that we now send when the FW asserts. I hope it will allow us to debug the FW more easily." Also included is one last iwlwifi pull for a build breakage fix... For the Atheros bits, Kalle says: "Michal now did some optimisations and was able to improve throughput by 100 Mbps on our MIPS based AP135 platform. Chun-Yeow added some workarounds to be able to better use ad-hoc mode. Ben improved log messages and added support for MSDU chaining. And, as usual, also some smaller fixes." Beyond that... Andrea Merello continues his rtl8180 refactoring, in preparation for a long-awaited rtl8187 driver. We get a new driver (rsi) for the RS9113 chip, from Fariya Fatima. And, of course, we get the usual round of updates for ath9k, brcmfmac, mwifiex, wil6210, etc. as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-24if_vlan: Call dev_kfree_skb_any instead of kfree_skb.Eric W. Biederman1-1/+1
Replace kfree_skb with dev_kfree_skb_any in vlan_insert_tag as vlan_insert_tag can be called from hard irq context (netpoll) and from other contexts. dev_kfree_skb_any is used as vlan_insert_tag only frees the skb if the skb can not be modified to insert a tag, in which case vlan_insert_tag drops the skb. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-8/+14
Pull networking fixes from David Miller: 1) OpenVswitch's lookup_datapath() returns error pointers, so don't check against NULL. From Jiri Pirko. 2) pfkey_compile_policy() code path tries to do a GFP_KERNEL allocation under RCU locks, fix by using GFP_ATOMIC when necessary. From Nikolay Aleksandrov. 3) phy_suspend() indirectly passes uninitialized data into the ethtool get wake-on-land implementations. Fix from Sebastian Hesselbarth. 4) CPSW driver unregisters CPTS twice, fix from Benedikt Spranger. 5) If SKB allocation of reply packet fails, vxlan's arp_reduce() defers a NULL pointer. Fix from David Stevens. 6) IPV6 neigh handling in vxlan doesn't validate the destination address properly, and it builds a packet with the src and dst reversed. Fix also from David Stevens. 7) Fix spinlock recursion during subscription failures in TIPC stack, from Erik Hugne. 8) Revert buggy conversion of davinci_emac to devm_request_irq, from Chrstian Riesch. 9) Wrong flags passed into forwarding database netlink notifications, from Nicolas Dichtel. 10) The netpoll neighbour soliciation handler checks wrong ethertype, needs to be ETH_P_IPV6 rather than ETH_P_ARP. Fix from Li RongQing. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits) tipc: fix spinlock recursion bug for failed subscriptions vxlan: fix nonfunctional neigh_reduce() net: davinci_emac: Fix rollback of emac_dev_open() net: davinci_emac: Replace devm_request_irq with request_irq netpoll: fix the skb check in pkt_is_ns net: micrel : ks8851-ml: add vdd-supply support ip6mr: fix mfc notification flags ipmr: fix mfc notification flags rtnetlink: fix fdb notification flags tcp: syncookies: do not use getnstimeofday() netlink: fix setsockopt in mmap examples in documentation openvswitch: Correctly report flow used times for first 5 minutes after boot. via-rhine: Disable device in error path ATHEROS-ATL1E: Convert iounmap to pci_iounmap vxlan: fix potential NULL dereference in arp_reduce() cnic: Update version to 2.5.20 and copyright year. cnic,bnx2i,bnx2fc: Fix inconsistent use of page size cnic: Use proper ulp_ops for per device operations. net: cdc_ncm: fix control message ordering ipv6: ip6_append_data_mtu do not handle the mtu of the second fragment properly ...
2014-03-24ipv4: remove ip_rt_dump from route.cLi RongQing1-1/+0
ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>