aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2018-05-07qed: Remove unused data member 'is_mf_default'.Sudarsana Reddy Kalluru1-1/+0
The data member 'is_mf_default' is not used by the qed/qede drivers, removing the same. Signed-off-by: Sudarsana Reddy Kalluru <[email protected]> Signed-off-by: Ariel Elior <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-07qed*: Refactor mf_mode to consist of bits.Sudarsana Reddy Kalluru1-1/+1
`mf_mode' field indicates the multi-partitioning mode the device is configured to. This method doesn't scale very well, adding a new MF mode requires going over all the existing conditions, and deciding whether those are needed for the new mode or not. The patch defines a set of bit-fields for modes which are derived according to the mode info shared by the MFW and all the configuration would be made according to those. To add a new mode, there would be a single place where we'll need to go and choose which bits apply and which don't. Signed-off-by: Sudarsana Reddy Kalluru <[email protected]> Signed-off-by: Ariel Elior <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller13-378/+292
Minor conflict, a CHECK was placed into an if() statement in net-next, whilst a newline was added to that CHECK call in 'net'. Thanks to Daniel for the merge resolution. Signed-off-by: David S. Miller <[email protected]>
2018-05-07net: u64_stats_sync: Remove functions without userAnna-Maria Gleixner1-14/+0
Commit 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()") removes the only users of u64_stats_update_end/begin_raw() without removing the function in header file. Remove no longer used functions. Cc: Eric Dumazet <[email protected]> Signed-off-by: Anna-Maria Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller23-224/+282
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree, more relevant updates in this batch are: 1) Add Maglev support to IPVS. Moreover, store lastest server weight in IPVS since this is needed by maglev, patches from from Inju Song. 2) Preparation works to add iptables flowtable support, patches from Felix Fietkau. 3) Hand over flows back to conntrack slow path in case of TCP RST/FIN packet is seen via new teardown state, also from Felix. 4) Add support for extended netlink error reporting for nf_tables. 5) Support for larger timeouts that 23 days in nf_tables, patch from Florian Westphal. 6) Always set an upper limit to dynamic sets, also from Florian. 7) Allow number generator to make map lookups, from Laura Garcia. 8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat. 9) Extend ip6tables SRH match to support previous, next and last SID, from Ahmed Abdelsalam. 10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez. 11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot. 12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables, from Taehee Yoo. 13) Unify meta bridge with core nft_meta, then make nft_meta built-in. Make rt and exthdr built-in too, again from Florian. 14) Missing initialization of tbl->entries in IPVS, from Cong Wang. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-05-07netfilter: ctnetlink: export nf_conntrack_maxFlorent Fourcot1-0/+1
IPCTNL_MSG_CT_GET_STATS netlink command allow to monitor current number of conntrack entries. However, if one wants to compare it with the maximum (and detect exhaustion), the only solution is currently to read sysctl value. This patch add nf_conntrack_max value in netlink message, and simplify monitoring for application built on netlink API. Signed-off-by: Florent Fourcot <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-07netfilter: extract Passive OS fingerprint infrastructure from xt_osfFernando Fernandez Mancera3-89/+134
Add nf_osf_ttl() and nf_osf_match() into nf_osf.c to prepare for nf_tables support. Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-06netfilter: nf_tables: Provide NFT_{RT,CT}_MAX for userspacePhil Sutter1-0/+4
These macros allow conveniently declaring arrays which use NFT_{RT,CT}_* values as indexes. Signed-off-by: Phil Sutter <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-06netfilter: nf_nat: remove unused ct arg from lookup functionsFlorian Westphal1-16/+8
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-06netfilter: ip6t_srh: extend SRH matching for previous, next and last SIDAhmed Abdelsalam1-2/+41
IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed by SR encapsulated packet. Each SID is encoded as an IPv6 prefix. When a Firewall receives an SR encapsulated packet, it should be able to identify which node previously processed the packet (previous SID), which node is going to process the packet next (next SID), and which node is the last to process the packet (last SID) which represent the final destination of the packet in case of inline SR mode. An example use-case of using these features could be SID list that includes two firewalls. When the second firewall receives a packet, it can check whether the packet has been processed by the first firewall or not. Based on that check, it decides to apply all rules, apply just subset of the rules, or totally skip all rules and forward the packet to the next SID. This patch extends SRH match to support matching previous SID, next SID, and last SID. Signed-off-by: Ahmed Abdelsalam <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-06netfilter: nft_numgen: add map lookups for numgen statementsLaura Garcia Liebana1-0/+4
This patch includes a new attribute in the numgen structure to allow the lookup of an element based on the number generator as a key. For this purpose, different ops have been included to extend the current numgen inc functions. Currently, only supported for numgen incremental operations, but it will be supported for random in a follow-up patch. Signed-off-by: Laura Garcia Liebana <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-04net/ipv6: rename rt6_next to fib6_nextDavid Ahern1-3/+3
This slipped through the cracks in the followup set to the fib6_info flip. Rename rt6_next to fib6_next. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-04bpf: offload: allow offloaded programs to use perf event arraysJakub Kicinski1-0/+5
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes. The map only holds glue to perf ring, not actual data. Allow non-offloaded perf event arrays to be used in offloaded programs. Offload driver can extract the events from HW and put them in the map for user space to retrieve. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Quentin Monnet <[email protected]> Reviewed-by: Jiong Wang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller28-92/+225
Overlapping changes in selftests Makefile. Signed-off-by: David S. Miller <[email protected]>
2018-05-04bpf: centre subprog information fieldsJiong Wang1-3/+6
It is better to centre all subprog information fields into one structure. This structure could later serve as function node in call graph. Signed-off-by: Jiong Wang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-04bpf: unify main prog and subprogJiong Wang1-1/+1
Currently, verifier treat main prog and subprog differently. All subprogs detected are kept in env->subprog_starts while main prog is not kept there. Instead, main prog is implicitly defined as the prog start at 0. There is actually no difference between main prog and subprog, it is better to unify them, and register all progs detected into env->subprog_starts. This could also help simplifying some code logic. Signed-off-by: Jiong Wang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-05-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-10/+7
Pull networking fixes from David Miller: 1) Various sockmap fixes from John Fastabend (pinned map handling, blocking in recvmsg, double page put, error handling during redirect failures, etc.) 2) Fix dead code handling in x86-64 JIT, from Gianluca Borello. 3) Missing device put in RDS IB code, from Dag Moxnes. 4) Don't process fast open during repair mode in TCP< from Yuchung Cheng. 5) Move address/port comparison fixes in SCTP, from Xin Long. 6) Handle add a bond slave's master into a bridge properly, from Hangbin Liu. 7) IPv6 multipath code can operate on unitialized memory due to an assumption that the icmp header is in the linear SKB area. Fix from Eric Dumazet. 8) Don't invoke do_tcp_sendpages() recursively via TLS, from Dave Watson. 9) Fix memory leaks in x86-64 JIT, from Daniel Borkmann. 10) RDS leaks kernel memory to userspace, from Eric Dumazet. 11) DCCP can invoke a tasklet on a freed socket, take a refcount. Also from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (78 commits) dccp: fix tasklet usage smc: fix sendpage() call net/smc: handle unregistered buffers net/smc: call consolidation qed: fix spelling mistake: "offloded" -> "offloaded" net/mlx5e: fix spelling mistake: "loobpack" -> "loopback" tcp: restore autocorking rds: do not leak kernel memory to user land qmi_wwan: do not steal interfaces from class drivers ipv4: fix fnhe usage by non-cached routes bpf: sockmap, fix error handling in redirect failures bpf: sockmap, zero sg_size on error when buffer is released bpf: sockmap, fix scatterlist update on error path in send with apply net_sched: fq: take care of throttled flows before reuse ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6" bpf, x64: fix memleak when not converging on calls bpf, x64: fix memleak when not converging after image net/smc: restrict non-blocking connect finish 8139too: Use disable_irq_nosync() in rtl8139_poll_controller() sctp: fix the issue that the cookie-ack with auth can't get processed ...
2018-05-03bpf: add skb_load_bytes_relative helperDaniel Borkmann1-1/+32
This adds a small BPF helper similar to bpf_skb_load_bytes() that is able to load relative to mac/net header offset from the skb's linear data. Compared to bpf_skb_load_bytes(), it takes a fifth argument namely start_header, which is either BPF_HDR_START_MAC or BPF_HDR_START_NET. This allows for a more flexible alternative compared to LD_ABS/LD_IND with negative offset. It's enabled for tc BPF programs as well as sock filter program types where it's mainly useful in reuseport programs to ease access to lower header data. Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.html Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03bpf: implement ld_abs/ld_ind in native bpfDaniel Borkmann2-1/+5
The main part of this work is to finally allow removal of LD_ABS and LD_IND from the BPF core by reimplementing them through native eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and keeping them around in native eBPF caused way more trouble than actually worth it. To just list some of the security issues in the past: * fdfaf64e7539 ("x86: bpf_jit: support negative offsets") * 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets") * e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler") * 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call") * 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context") * 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context") For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy these days due to their limitations and more efficient/flexible alternatives that have been developed over time such as direct packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a register, the load happens in host endianness and its exception handling can yield unexpected behavior. The latter is explained in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by div/mod by 0 exception") with similar cases of exceptions we had. In native eBPF more recent program types will disable LD_ABS/LD_IND altogether through may_access_skb() in verifier, and given the limitations in terms of exception handling, it's also disabled in programs that use BPF to BPF calls. In terms of cBPF, the LD_ABS/LD_IND is used in networking programs to access packet data. It is not used in seccomp-BPF but programs that use it for socket filtering or reuseport for demuxing with cBPF. This is mostly relevant for applications that have not yet migrated to native eBPF. The main complexity and source of bugs in LD_ABS/LD_IND is coming from their implementation in the various JITs. Most of them keep the model around from cBPF times by implementing a fastpath written in asm. They use typically two from the BPF program hidden CPU registers for caching the skb's headlen (skb->len - skb->data_len) and skb->data. Throughout the JIT phase this requires to keep track whether LD_ABS/LD_IND are used and if so, the two registers need to be recached each time a BPF helper would change the underlying packet data in native eBPF case. At least in eBPF case, available CPU registers are rare and the additional exit path out of the asm written JIT helper makes it also inflexible since not all parts of the JITer are in control from plain C. A LD_ABS/LD_IND implementation in eBPF therefore allows to significantly reduce the complexity in JITs with comparable performance results for them, e.g.: test_bpf tcpdump port 22 tcpdump complex x64 - before 15 21 10 14 19 18 - after 7 10 10 7 10 15 arm64 - before 40 91 92 40 91 151 - after 51 64 73 51 62 113 For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter() and cache the skb's headlen and data in the cBPF prologue. The BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just used as a local temporary variable. This allows to shrink the image on x86_64 also for seccomp programs slightly since mapping to %rsi is not an ereg. In callee-saved R8 and R9 we now track skb data and headlen, respectively. For normal prologue emission in the JITs this does not add any extra instructions since R8, R9 are pushed to stack in any case from eBPF side. cBPF uses the convert_bpf_ld_abs() emitter which probes the fast path inline already and falls back to bpf_skb_load_helper_{8,16,32}() helper relying on the cached skb data and headlen as well. R8 and R9 never need to be reloaded due to bpf_helper_changes_pkt_data() since all skb access in cBPF is read-only. Then, for the case of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally, does neither cache skb data and headlen nor has an inlined fast path. The reason for the latter is that native eBPF does not have any extra registers available anyway, but even if there were, it avoids any reload of skb data and headlen in the first place. Additionally, for the negative offsets, we provide an alternative bpf_skb_load_bytes_relative() helper in eBPF which operates similarly as bpf_skb_load_bytes() and allows for more flexibility. Tested myself on x64, arm64, s390x, from Sandipan on ppc64. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03bpf: migrate ebpf ld_abs/ld_ind tests to test_verifierDaniel Borkmann1-2/+0
Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason is that the eBPF tests from test_bpf module do not go via BPF verifier and therefore any instruction rewrites from verifier cannot take place. Therefore, move them into test_verifier which runs out of user space, so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches. It will have the same effect since runtime tests are also performed from there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto and keep it internal to core kernel. Additionally, also add further cBPF LD_ABS/LD_IND test coverage into test_bpf.ko suite. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: statistics supportMagnus Karlsson1-0/+7
In this commit, a new getsockopt is added: XDP_STATISTICS. This is used to obtain stats from the sockets. v2: getsockopt now returns size of stats structure. Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03dev: packet: make packet_direct_xmit a common functionMagnus Karlsson1-0/+1
The new dev_direct_xmit will be used by AF_XDP in later commits. Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add Tx queue setup and mmap supportMagnus Karlsson2-0/+3
Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate a queue, where the user process can pass frames to be transmitted by the kernel. The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset. Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add umem completion queue support and mmapMagnus Karlsson1-0/+2
Here, we add another setsockopt for registered user memory (umem) called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the process can ask the kernel to allocate a queue (ring buffer) and also mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process. The queue is used to explicitly pass ownership of umem frames from the kernel to user process. This will be used by the TX path to tell user space that a certain frame has been transmitted and user space can use it for something else, if it wishes. Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: wire up XDP_SKB side of AF_XDPBjörn Töpel1-1/+1
This commit wires up the xskmap to XDP_SKB layer. Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAPBjörn Töpel4-0/+36
The xskmap is yet another BPF map, very much inspired by dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application adds AF_XDP sockets into the map, and by using the bpf_redirect_map helper, an XDP program can redirect XDP frames to an AF_XDP socket. Note that a socket that is bound to certain ifindex/queue index will *only* accept XDP frames from that netdev/queue index. If an XDP program tries to redirect from a netdev/queue index other than what the socket is bound to, the frame will not be received on the socket. A socket can reside in multiple maps. v3: Fixed race and simplified code. v2: Removed one indirection in map lookup. Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add Rx receive functions and poll supportBjörn Töpel2-0/+23
Here the actual receive functions of AF_XDP are implemented, that in a later commit, will be called from the XDP layers. There's one set of functions for the XDP_DRV side and another for XDP_SKB (generic). A new XDP API, xdp_return_buff, is also introduced. Adding xdp_return_buff, which is analogous to xdp_return_frame, but acts upon an struct xdp_buff. The API will be used by AF_XDP in future commits. Support for the poll syscall is also implemented. v2: xskq_validate_id did not update cons_tail. The entries variable was calculated twice in xskq_nb_avail. Squashed xdp_return_buff commit. Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add support for bind for RxMagnus Karlsson2-0/+12
Here, the bind syscall is added. Binding an AF_XDP socket, means associating the socket to an umem, a netdev and a queue index. This can be done in two ways. The first way, creating a "socket from scratch". Create the umem using the XDP_UMEM_REG setsockopt and an associated fill queue with XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE setsockopt. Call bind passing ifindex and queue index ("channel" in ethtool speak). The second way to bind a socket, is simply skipping the umem/netdev/queue index, and passing another already setup AF_XDP socket. The new socket will then have the same umem/netdev/queue index as the parent so it will share the same umem. You must also set the flags field in the socket address to XDP_SHARED_UMEM. v2: Use PTR_ERR instead of passing error variable explicitly. Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add Rx queue setup and mmap supportBjörn Töpel2-0/+20
Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate a queue, where the kernel can pass completed Rx frames from the kernel to user process. The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset. Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add umem fill queue support and mmapMagnus Karlsson1-0/+15
Here, we add another setsockopt for registered user memory (umem) called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can ask the kernel to allocate a queue (ring buffer) and also mmap it (XDP_UMEM_PGOFF_FILL_QUEUE) into the process. The queue is used to explicitly pass ownership of umem frames from the user process to the kernel. These frames will in a later patch be filled in with Rx packet data by the kernel. v2: Fixed potential crash in xsk_mmap. Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03xsk: add user memory registration support sockoptBjörn Töpel2-0/+65
In this commit the base structure of the AF_XDP address family is set up. Further, we introduce the abilty register a window of user memory to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory window is viewed by an AF_XDP socket as a set of equally large frames. After a user memory registration all frames are "owned" by the user application, and not the kernel. v2: More robust checks on umem creation and unaccount on error. Call set_page_dirty_lock on cleanup. Simplified xdp_umem_reg. Co-authored-by: Magnus Karlsson <[email protected]> Signed-off-by: Magnus Karlsson <[email protected]> Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03net: initial AF_XDP skeletonBjörn Töpel1-1/+4
Buildable skeleton of AF_XDP without any functionality. Just what it takes to register a new address family. Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2018-05-03switchdev: Add fdb.added_by_user to switchdev notificationsPetr Machata1-0/+1
The following patch enables sending notifications also for events on FDB entries that weren't added by the user. Give the drivers the information necessary to distinguish between the two origins of FDB entries. To maintain the current behavior, have switchdev-implementing drivers bail out on notifications about non-user-added FDB entries. In case of mlxsw driver, allow a call to mlxsw_sp_span_respin() so that SPAN over bridge catches up with the changed FDB. Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Acked-by: Ivan Vecera <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02Merge tag 'trace-v4.17-rc1-2' of ↵Linus Torvalds1-3/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various fixes in tracing: - Tracepoints should not give warning on OOM failures - Use special field for function pointer in trace event - Fix igrab issues in uprobes - Fixes to the new histogram triggers" * tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracepoint: Do not warn on ENOMEM tracing: Add field modifier parsing hist error for hist triggers tracing: Add field parsing hist error for hist triggers tracing: Restore proper field flag printing when displaying triggers tracing: initcall: Ordered comparison of function pointers tracing: Remove igrab() iput() call from uprobes.c tracing: Fix bad use of igrab in trace_uprobe.c
2018-05-02ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"Ido Schimmel1-1/+2
This reverts commit edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6"). Eric reported a division by zero in rt6_multipath_rebalance() which is caused by above commit that considers identical local routes to be siblings. The division by zero happens because a nexthop weight is not set for local routes. Revert the commit as it does not fix a bug and has side effects. To reproduce: # ip -6 address add 2001:db8::1/64 dev dummy0 # ip -6 address add 2001:db8::1/64 dev dummy1 Fixes: edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6") Signed-off-by: Ido Schimmel <[email protected]> Reported-by: Eric Dumazet <[email protected]> Tested-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02Revert "vhost: make msg padding explicit"Michael S. Tsirkin1-1/+0
This reverts commit 93c0d549c4c5a7382ad70de6b86610b7aae57406. Unfortunately the padding will break 32 bit userspace. Ouch. Need to add some compat code, revert for now. Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/tls: Don't recursively call push_record during tls_write_space callbacksDave Watson1-0/+1
It is reported that in some cases, write_space may be called in do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again: [ 660.468802] ? do_tcp_sendpages+0x8d/0x580 [ 660.468826] ? tls_push_sg+0x74/0x130 [tls] [ 660.468852] ? tls_push_record+0x24a/0x390 [tls] [ 660.468880] ? tls_write_space+0x6a/0x80 [tls] ... tls_push_sg already does a loop over all sending sg's, so ignore any tls_write_space notifications until we are done sending. We then have to call the previous write_space to wake up poll() waiters after we are done with the send loop. Reported-by: Andre Tomt <[email protected]> Signed-off-by: Dave Watson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01tcp: send in-queue bytes in cmsg upon readSoheil Hassas Yeganeh2-1/+4
Applications with many concurrent connections, high variance in receive queue length and tight memory bounds cannot allocate worst-case buffer size to drain sockets. Knowing the size of receive queue length, applications can optimize how they allocate buffers to read from the socket. The number of bytes pending on the socket is directly available through ioctl(FIONREAD/SIOCINQ) and can be approximated using getsockopt(MEMINFO) (rmem_alloc includes skb overheads in addition to application data). But, both of these options add an extra syscall per recvmsg. Moreover, ioctl(FIONREAD/SIOCINQ) takes the socket lock. Add the TCP_INQ socket option to TCP. When this socket option is set, recvmsg() relays the number of bytes available on the socket for reading to the application via the TCP_CM_INQ control message. Calculate the number of bytes after releasing the socket lock to include the processed backlog, if any. To avoid an extra branch in the hot path of recvmsg() for this new control message, move all cmsg processing inside an existing branch for processing receive timestamps. Since the socket lock is not held when calculating the size of receive queue, TCP_INQ is a hint. For example, it can overestimate the queue size by one byte, if FIN is received. With this method, applications can start reading from the socket using a small buffer, and then use larger buffers based on the remaining data when needed. V3 change-log: As suggested by David Miller, added loads with barrier to check whether we have multiple threads calling recvmsg in parallel. When that happens we lock the socket to calculate inq. V4 change-log: Removed inline from a static function. Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Suggested-by: David Miller <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01connector: add parent pid and tgid to coredump and exit eventsStefan Strogin1-0/+4
The intention is to get notified of process failures as soon as possible, before a possible core dumping (which could be very long) (e.g. in some process-manager). Coredump and exit process events are perfect for such use cases (see 2b5faa4c553f "connector: Added coredumping event to the process connector"). The problem is that for now the process-manager cannot know the parent of a dying process using connectors. This could be useful if the process-manager should monitor for failures only children of certain parents, so we could filter the coredump and exit events by parent process and/or thread ID. Add parent pid and tgid to coredump and exit process connectors event data. Signed-off-by: Stefan Strogin <[email protected]> Acked-by: Evgeniy Polyakov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: core: Inline netdev_features_size_check()Florian Fainelli1-6/+0
We do not require this inline function to be used in multiple different locations, just inline it where it gets used in register_netdevice(). Suggested-by: David Miller <[email protected]> Suggested-by: Stephen Hemminger <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01ipv6: Allow non-gateway ECMP for IPv6Thomas Winter1-2/+1
It is valid to have static routes where the nexthop is an interface not an address such as tunnels. For IPv4 it was possible to use ECMP on these routes but not for IPv6. Signed-off-by: Thomas Winter <[email protected]> Cc: David Ahern <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Alexey Kuznetsov <[email protected]> Cc: Hideaki YOSHIFUJI <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01vhost: make msg padding explicitMichael S. Tsirkin1-0/+1
There's a 32 bit hole just after type. It's best to give it a name, this way compiler is forced to initialize it with rest of the structure. Reported-by: Kevin Easton <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01sctp: allow sctp_init_cause to return errorsMarcelo Ricardo Leitner1-1/+1
And do so if the skb doesn't have enough space for the payload. This is a preparation for the next patch. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/mlx5: Accel, Add TLS tx offload interfaceIlya Lesokhin2-16/+77
Add routines for manipulating TLS TX offload contexts. In Innova TLS, TLS contexts are added or deleted via a command message over the SBU connection. The HW then sends a response message over the same connection. Add implementation for Innova TLS (FPGA-based) hardware. These routines will be used by the TLS offload support in a later patch mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs to work directly with mlx5_core rather than Innova FPGA or other mlx5 acceleration providers. In the future, when IPSec/TLS or any other acceleration gets integrated into ConnectX chip, mlx5/accel layer will provide the integrated acceleration, rather than the Innova one. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Acked-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/tls: Add generic NIC offload infrastructureIlya Lesokhin1-2/+67
This patch adds a generic infrastructure to offload TLS crypto to a network device. It enables the kernel TLS socket to skip encryption and authentication operations on the transmit side of the data path. Leaving those computationally expensive operations to the NIC. The NIC offload infrastructure builds TLS records and pushes them to the TCP layer just like the SW KTLS implementation and using the same API. TCP segmentation is mostly unaffected. Currently the only exception is that we prevent mixed SKBs where only part of the payload requires offload. In the future we are likely to add a similar restriction following a change cipher spec record. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. The offloaded implementation builds "plaintext TLS record", those records contain plaintext instead of ciphertext and place holder bytes instead of authentication tags. 2. The offloaded implementation maintains a mapping from TCP sequence number to TLS records. Thus given a TCP SKB sent from a NIC offloaded TLS socket, we can use the tls NIC offload infrastructure to obtain enough context to encrypt the payload of the SKB. A TLS record is released when the last byte of the record is ack'ed, this is done through the new icsk_clean_acked callback. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC assumes that packets from each offloaded stream are sent as plaintext and in-order. It keeps track of the TLS records in the TCP stream. When a packet marked for offload is transmitted, the NIC encrypts the payload in-place and puts authentication tags in the relevant place holders. The responsibility for handling out-of-order packets (i.e. TCP retransmission, qdisc drops) falls on the netdev driver. The netdev driver keeps track of the expected TCP SN from the NIC's perspective. If the next packet to transmit matches the expected TCP SN, the driver advances the expected TCP SN, and transmits the packet with TLS offload indication. If the next packet to transmit does not match the expected TCP SN. The driver calls the TLS layer to obtain the TLS record that includes the TCP of the packet for transmission. Using this TLS record, the driver posts a work entry on the transmit queue to reconstruct the NIC TLS state required for the offload of the out-of-order packet. It updates the expected TCP SN accordingly and transmits the now in-order packet. The same queue is used for packet transmission and TLS context reconstruction to avoid the need for flushing the transmit queue before issuing the context reconstruction request. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Aviad Yehezkel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/tls: Split conf to rx + txBoris Pismenny1-18/+33
In TLS inline crypto, we can have one direction in software and another in hardware. Thus, we split the TLS configuration to separate structures for receive and transmit. Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Add TLS TX offload featuresIlya Lesokhin1-0/+2
This patch adds a netdev feature to configure TLS TX offloads. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Aviad Yehezkel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Add TLS offload netdev opsIlya Lesokhin1-0/+24
Add new netdev ops to add and delete tls context Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Aviad Yehezkel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Add Software fallback infrastructure for socket dependent offloadsIlya Lesokhin1-0/+21
With socket dependent offloads we rely on the netdev to transform the transmitted packets before sending them to the wire. When a packet from an offloaded socket is rerouted to a different device we need to detect it and do the transformation in software. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Rename and export copy_skb_headerIlya Lesokhin1-0/+1
copy_skb_header is renamed to skb_copy_header and exported. Exposing this function give more flexibility in copying SKBs. skb_copy and skb_copy_expand do not give enough control over which parts are copied. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>