aboutsummaryrefslogtreecommitdiff
path: root/include/uapi
AgeCommit message (Collapse)AuthorFilesLines
2018-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+55
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-08-13 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add driver XDP support for veth. This can be used in conjunction with redirect of another XDP program e.g. sitting on NIC so the xdp_frame can be forwarded to the peer veth directly without modification, from Toshiaki. 2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT in order to provide more control and visibility on where a SO_REUSEPORT sk should be located, and the latter enables to directly select a sk from the bpf map. This also enables map-in-map for application migration use cases, from Martin. 3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id of cgroup v2 that is the ancestor of the cgroup associated with the skb at the ancestor_level, from Andrey. 4) Implement BPF fs map pretty-print support based on BTF data for regular hash table and LRU map, from Yonghong. 5) Decouple the ability to attach BTF for a map from the key and value pretty-printer in BPF fs, and enable further support of BTF for maps for percpu and LPM trie, from Daniel. 6) Implement a better BPF sample of using XDP's CPU redirect feature for load balancing SKB processing to remote CPU. The sample implements the same XDP load balancing as Suricata does which is symmetric hash based on IP and L4 protocol, from Jesper. 7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s critical path as it is ensured that the allocator is present, from Björn. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-08-13ipv6: Add icmp_echo_ignore_all support for ICMPv6Virgile Jarry1-1/+2
Preventing the kernel from responding to ICMP Echo Requests messages can be useful in several ways. The sysctl parameter 'icmp_echo_ignore_all' can be used to prevent the kernel from responding to IPv4 ICMP echo requests. For IPv6 pings, such a sysctl kernel parameter did not exist. Add the ability to prevent the kernel from responding to IPv6 ICMP echo requests through the use of the following sysctl parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all. Update the documentation to reflect this change. Signed-off-by: Virgile Jarry <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-13Merge tag 'asoc-v4.19' of ↵Takashi Iwai5-8/+8
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Updates for v4.19 A fairly big update, including quite a bit of core activity this time around (which is good to see) along with a fairly large set of new drivers. - A new snd_pcm_stop_xrun() helper which is now used in several drivers. - Support for providing name prefixes to generic component nodes. - Quite a few fixes for DPCM as it gains a bit wider use and more robust testing. - Generalization of the DIO2125 support to a simple amplifier driver. - Accessory detection support for the audio graph card. - DT support for PXA AC'97 devices. - Quirks for a number of new x86 systems. - Support for AM Logic Meson, Everest ES7154, Intel systems with RT5682, Qualcomm QDSP6 and WCD9335, Realtek RT5682 and TI TAS5707.
2018-08-13bpf: Introduce bpf_skb_ancestor_cgroup_id helperAndrey Ignatov1-1/+20
== Problem description == It's useful to be able to identify cgroup associated with skb in TC so that a policy can be applied to this skb, and existing bpf_skb_cgroup_id helper can help with this. Though in real life cgroup hierarchy and hierarchy to apply a policy to don't map 1:1. It's often the case that there is a container and corresponding cgroup, but there are many more sub-cgroups inside container, e.g. because it's delegated to containerized application to control resources for its subsystems, or to separate application inside container from infra that belongs to containerization system (e.g. sshd). At the same time it may be useful to apply a policy to container as a whole. If multiple containers like this are run on a host (what is often the case) and many of them have sub-cgroups, it may not be possible to apply per-container policy in TC with existing helpers such as bpf_skb_under_cgroup or bpf_skb_cgroup_id: * bpf_skb_cgroup_id will return id of immediate cgroup associated with skb, i.e. if it's a sub-cgroup inside container, it can't be used to identify container's cgroup; * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale, i.e. if there are N containers on a host and a policy has to be applied to M of them (0 <= M <= N), it'd require M calls to bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild & load new BPF program. == Solution == The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be used to get id of cgroup v2 that is an ancestor of cgroup associated with skb at specified level of cgroup hierarchy. That way admin can place all containers on one level of cgroup hierarchy (what is a good practice in general and already used in many configurations) and identify specific cgroup on this level no matter what sub-cgroup skb is associated with. E.g. if there is a cgroup hierarchy: root/ root/container1/ root/container1/app11/ root/container1/app11/sub-app-a/ root/container1/app12/ root/container2/ root/container2/app21/ root/container2/app22/ root/container2/app22/sub-app-b/ , then having skb associated with root/container1/app11/sub-app-a/ it's possible to get ancestor at level 1, what is container1 and apply policy for this container, or apply another policy if it's container2. Policies can be kept e.g. in a hash map where key is a container cgroup id and value is an action. Levels where container cgroups are created are usually known in advance whether cgroup hierarchy inside container may be hard to predict especially in case when its creation is delegated to containerized application. == Implementation details == The helper gets ancestor by walking parents up to specified level. Another option would be to get different kind of "id" from cgroup->ancestor_ids[level] and use it with idr_find() to get struct cgroup for ancestor. But that would require radix lookup what doesn't seem to be better (at least it's not obviously better). Format of return value of the new helper is same as that of bpf_skb_cgroup_id. Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-08-11bcache: style fix to add a blank line after declarationsColy Li1-0/+2
Signed-off-by: Coly Li <[email protected]> Reviewed-by: Shenghui Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-11bcache: style fix to replace 'unsigned' by 'unsigned int'Coly Li1-3/+3
This patch fixes warning reported by checkpatch.pl by replacing 'unsigned' with 'unsigned int'. Signed-off-by: Coly Li <[email protected]> Reviewed-by: Shenghui Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-11l2tp: remove pppol2tp_session_ioctl()Guillaume Nault1-1/+1
pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS on session sockets. We just need to copy the stats and set ->session_id. As a side effect of sharing session and tunnel code, ->using_ipsec is properly set even when the request was made using a session socket. Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-11bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORTMartin KaFai Lau1-1/+35
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN. BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern" to store the bpf context instead of using the skb->cb[48]. At the SO_REUSEPORT sk lookup time, it is in the middle of transiting from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this point, it is not always clear where the bpf context can be appended in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not clear if the lower layer is only ipv4 and ipv6 in the future and will it not touch the cb[] again before transiting to the upper layer. For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB instead of IP[6]CB and it may still modify the cb[] after calling the udp[46]_lib_lookup_skb(). Because of the above reason, if sk->cb is used for the bpf ctx, saving-and-restoring is needed and likely the whole 48 bytes cb[] has to be saved and restored. Instead of saving, setting and restoring the cb[], this patch opts to create a new "struct sk_reuseport_kern" and setting the needed values in there. The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)" will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol specific usage at this point and it is also inline with the current sock_reuseport.c implementation (i.e. no protocol specific requirement). In "struct sk_reuseport_md", this patch exposes data/data_end/len with semantic similar to other existing usages. Together with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()", the bpf prog can peek anywhere in the skb. The "bind_inany" tells the bpf prog that the reuseport group is bind-ed to a local INANY address which cannot be learned from skb. The new "bind_inany" is added to "struct sock_reuseport" which will be used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order to avoid repeating the "bind INANY" test on "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can only be properly initialized when a "sk->sk_reuseport" enabled sk is adding to a hashtable (i.e. during "reuseport_alloc()" and "reuseport_add_sock()"). The new "sk_select_reuseport()" is the main helper that the bpf prog will use to select a SO_REUSEPORT sk. It is the only function that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in the earlier patch, the validity of a selected sk is checked in run time in "sk_select_reuseport()". Doing the check in verification time is difficult and inflexible (consider the map-in-map use case). The runtime check is to compare the selected sk's reuseport_id with the reuseport_id that we want. This helper will return -EXXX if the selected sk cannot serve the incoming request (e.g. reuseport_id not match). The bpf prog can decide if it wants to do SK_DROP as its discretion. When the bpf prog returns SK_PASS, the kernel will check if a valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL"). If it does , it will use the selected sk. If not, the kernel will select one from "reuse->socks[]" (as before this patch). The SK_DROP and SK_PASS handling logic will be in the next patch. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-08-11bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAYMartin KaFai Lau1-0/+1
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY. To unleash the full potential of a bpf prog, it is essential for the userspace to be capable of directly setting up a bpf map which can then be consumed by the bpf prog to make decision. In this case, decide which SO_REUSEPORT sk to serve the incoming request. By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control and visibility on where a SO_REUSEPORT sk should be located in a bpf map. The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that the bpf prog can directly select a sk from the bpf map. That will raise the programmability of the bpf prog attached to a reuseport group (a group of sk serving the same IP:PORT). For example, in UDP, the bpf prog can peek into the payload (e.g. through the "data" pointer introduced in the later patch) to learn the application level's connection information and then decide which sk to pick from a bpf map. The userspace can tightly couple the sk's location in a bpf map with the application logic in generating the UDP payload's connection information. This connection info contact/API stays within the userspace. Also, when used with map-in-map, the userspace can switch the old-server-process's inner map to a new-server-process's inner map in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)". The bpf prog will then direct incoming requests to the new process instead of the old process. The old process can finish draining the pending requests (e.g. by "accept()") before closing the old-fds. [Note that deleting a fd from a bpf map does not necessary mean the fd is closed] During map_update_elem(), Only SO_REUSEPORT sk (i.e. which has already been added to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is "bind()" for UDP or "bind()+listen()" for TCP. These conditions are ensured in "reuseport_array_update_check()". A SO_REUSEPORT sk can only be added once to a map (i.e. the same sk cannot be added twice even to the same map). SO_REUSEPORT already allows another sk to be created for the same IP:PORT. There is no need to re-create a similar usage in the BPF side. When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"), it will notify the bpf map to remove it from the map also. It is done through "bpf_sk_reuseport_detach()" and it will only be called if >=1 of the "reuse->sock[]" has ever been added to a bpf map. The map_update()/map_delete() has to be in-sync with the "reuse->socks[]". Hence, the same "reuseport_lock" used by "reuse->socks[]" has to be used here also. Care has been taken to ensure the lock is only acquired when the adding sk passes some strict tests. and freeing the map does not require the reuseport_lock. The reuseport_array will also support lookup from the syscall side. It will return a sock_gen_cookie(). The sock_gen_cookie() is on-demand (i.e. a sk's cookie is not generated until the very first map_lookup_elem()). The lookup cookie is 64bits but it goes against the logical userspace expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also). It may catch user in surprise if we enforce value_size=8 while userspace still pass a 32bits fd during update. Supporting different value_size between lookup and update seems unintuitive also. We also need to consider what if other existing fd based maps want to return 64bits value from syscall's lookup in the future. Hence, reuseport_array supports both value_size 4 and 8, and assuming user will usually use value_size=4. The syscall's lookup will return ENOSPC on value_size=4. It will will only return 64bits value from sock_gen_cookie() when user consciously choose value_size=8 (as a signal that lookup is desired) which then requires a 64bits value in both lookup and update. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-08-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller3-1/+27
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains netfilter updates for your net-next tree: 1) Expose NFT_OSF_MAXGENRELEN maximum OS name length from the new OS passive fingerprint matching extension, from Fernando Fernandez. 2) Add extension to support for fine grain conntrack timeout policies from nf_tables. As preparation works, this patchset moves nf_ct_untimeout() to nf_conntrack_timeout and it also decouples the timeout policy from the ctnl_timeout object, most work done by Harsha Sharma. 3) Enable connection tracking when conntrack helper is in place. 4) Missing enumeration in uapi header when splitting original xt_osf to nfnetlink_osf, also from Fernando. 5) Fix a sparse warning due to incorrect typing in the nf_osf_find(), from Wei Yongjun. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-08-08netfilter: nfnetlink_osf: add missing enum in nfnetlink_osf uapi headerFernando Fernandez Mancera2-0/+13
xt_osf_window_size_options was originally part of include/uapi/linux/netfilter/xt_osf.h, restore it. Fixes: bfb15f2a95cb ("netfilter: extract Passive OS fingerprint infrastructure from xt_osf") Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-07net/sched: allow flower to match tunnel optionsPieter Jansen van Vuuren1-0/+26
Allow matching on options in Geneve tunnel headers. This makes use of existing tunnel metadata support. The options can be described in the form CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit hexadecimal value and DATA as a variable length hexadecimal value. e.g. # ip link add name geneve0 type geneve dstport 0 external # tc qdisc add dev geneve0 ingress # tc filter add dev geneve0 protocol ip parent ffff: \ flower \ enc_src_ip 10.0.99.192 \ enc_dst_ip 10.0.99.193 \ enc_key_id 11 \ geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \ ip_proto udp \ action mirred egress redirect dev eth1 This patch adds support for matching Geneve options in the order supplied by the user. This leads to an efficient implementation in the software datapath (and in our opinion hardware datapaths that offload this feature). It is also compatible with Geneve options matching provided by the Open vSwitch kernel datapath which is relevant here as the Flower classifier may be used as a mechanism to program flows into hardware as a form of Open vSwitch datapath offload (sometimes referred to as OVS-TC). The netlink Kernel/Userspace API may be extended, for example by adding a flag, if other matching options are desired, for example matching given options in any order. This would require an implementation in the TC software datapath. And be done in a way that drivers that facilitate offload of the Flower classifier can reject or accept such flows based on hardware datapath capabilities. This approach was discussed and agreed on at Netconf 2017 in Seoul. Signed-off-by: Simon Horman <[email protected]> Signed-off-by: Pieter Jansen van Vuuren <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-07ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKEFlorian Fainelli1-1/+4
Add the ability to specify through ethtool::rxnfc that a rule location is special and will be used to participate in Wake-on-LAN, by e.g: having a specific pattern be matched. When this is the case, fs->ring_cookie must be set to the special value RX_CLS_FLOW_WAKE. We also define an additional ethtool::wolinfo flag: WAKE_FILTER which can be used to configure an Ethernet adapter to allow Wake-on-LAN using previously programmed filters. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+40
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-08-07 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add cgroup local storage for BPF programs, which provides a fast accessible memory for storing various per-cgroup data like number of transmitted packets, etc, from Roman. 2) Support bpf_get_socket_cookie() BPF helper in several more program types that have a full socket available, from Andrey. 3) Significantly improve the performance of perf events which are reported from BPF offload. Also convert a couple of BPF AF_XDP samples overto use libbpf, both from Jakub. 4) seg6local LWT provides the End.DT6 action, which allows to decapsulate an outer IPv6 header containing a Segment Routing Header. Adds this action now to the seg6local BPF interface, from Mathieu. 5) Do not mark dst register as unbounded in MOV64 instruction when both src and dst register are the same, from Arthur. 6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier instructions on arm64 for the AF_XDP sample code, from Brian. 7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts over from Python 2 to Python 3, from Jeremy. 8) Enable BTF build flags to the BPF sample code Makefile, from Taeung. 9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee. 10) Several improvements to the README.rst from the BPF documentation to make it more consistent with RST format, from Tobin. 11) Replace all occurrences of strerror() by calls to strerror_r() in libbpf and fix a FORTIFY_SOURCE build error along with it, from Thomas. 12) Fix a bug in bpftool's get_btf() function to correctly propagate an error via PTR_ERR(), from Yue. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-08-07netfilter: nft_ct: add ct timeout supportHarsha Sharma1-1/+13
This patch allows to add, list and delete connection tracking timeout policies via nft objref infrastructure and assigning these timeout via nft rule. %./libnftnl/examples/nft-ct-timeout-add ip raw cttime tcp Ruleset: table ip raw { ct timeout cttime { protocol tcp; policy = {established: 111, close: 13 } } chain output { type filter hook output priority -300; policy accept; ct timeout set "cttime" } } %./libnftnl/examples/nft-rule-ct-timeout-add ip raw output cttime %conntrack -E [NEW] tcp 6 111 ESTABLISHED src=172.16.19.128 dst=172.16.19.1 sport=22 dport=41360 [UNREPLIED] src=172.16.19.1 dst=172.16.19.128 sport=41360 dport=22 %nft delete rule ip raw output handle <handle> %./libnftnl/examples/nft-ct-timeout-del ip raw cttime Joint work with Pablo Neira. Signed-off-by: Harsha Sharma <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-07netfilter: nft_osf: use NFT_OSF_MAXGENRELEN instead of IFNAMSIZFernando Fernandez Mancera1-0/+1
As no "genre" on pf.os exceed 16 bytes of length, we reduce NFT_OSF_MAXGENRELEN parameter to 16 bytes and use it instead of IFNAMSIZ. Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-06vhost: switch to use new message formatJason Wang1-0/+18
We use to have message like: struct vhost_msg { int type; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; Unfortunately, there will be a hole of 32bit in 64bit machine because of the alignment. This leads a different formats between 32bit API and 64bit API. What's more it will break 32bit program running on 64bit machine. So fixing this by introducing a new message type with an explicit 32bit reserved field after type like: struct vhost_msg_v2 { __u32 type; __u32 reserved; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; We will have a consistent ABI after switching to use this. To enable this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2). Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-06KVM: X86: Implement PV IPIs in linux guestWanpeng Li1-0/+1
Implement paravirtual apic hooks to enable PV IPIs for KVM if the "send IPI" hypercall is available. The hypercall lets a guest send IPIs, with at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06KVM: X86: Implement "send IPI" hypercallWanpeng Li1-0/+1
Using hypercall to send IPIs by one vmexit instead of one by one for xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster mode. Intel guest can enter x2apic cluster mode when interrupt remmaping is enabled in qemu, however, latest AMD EPYC still just supports xapic mode which can get great improvement by Exit-less IPIs. This patchset lets a guest send multicast IPIs, with at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode. Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141): x2apic cluster mode, vanilla Dry-run: 0, 2392199 ns Self-IPI: 6907514, 15027589 ns Normal IPI: 223910476, 251301666 ns Broadcast IPI: 0, 9282161150 ns Broadcast lock: 0, 8812934104 ns x2apic cluster mode, pv-ipi Dry-run: 0, 2449341 ns Self-IPI: 6720360, 15028732 ns Normal IPI: 228643307, 255708477 ns Broadcast IPI: 0, 7572293590 ns => 22% performance boost Broadcast lock: 0, 8316124651 ns x2apic physical mode, vanilla Dry-run: 0, 3135933 ns Self-IPI: 8572670, 17901757 ns Normal IPI: 226444334, 255421709 ns Broadcast IPI: 0, 19845070887 ns Broadcast lock: 0, 19827383656 ns x2apic physical mode, pv-ipi Dry-run: 0, 2446381 ns Self-IPI: 6788217, 15021056 ns Normal IPI: 219454441, 249583458 ns Broadcast IPI: 0, 7806540019 ns => 154% performance boost Broadcast lock: 0, 9143618799 ns Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06kvm: nVMX: Introduce KVM_CAP_NESTED_STATEJim Mattson1-0/+4
For nested virtualization L0 KVM is managing a bit of state for L2 guests, this state can not be captured through the currently available IOCTLs. In fact the state captured through all of these IOCTLs is usually a mix of L1 and L2 state. It is also dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM that is in VMX operation. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Jim Mattson <[email protected]> [karahmed@ - rename structs and functions and make them ready for AMD and address previous comments. - handle nested.smm state. - rebase & a bit of refactoring. - Merge 7/8 and 8/8 into one patch. ] Signed-off-by: KarimAllah Ahmed <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-08-06Merge tag 'v4.18-rc6' into HEADPaolo Bonzini7-109/+93
Pull bug fixes into the KVM development tree to avoid nasty conflicts.
2018-08-06aio: implement IOCB_CMD_POLLChristoph Hellwig1-4/+2
Simple one-shot poll through the io_submit() interface. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It will poll the fd for the events specified in the the first 32 bits of the aio_buf field of the iocb. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the iocb is completed, it will have to be resubmitted. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Avi Kivity <[email protected]>
2018-08-05Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe5-101/+63
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <[email protected]>
2018-08-05ip: discard IPv4 datagrams with overlapping segments.Peter Oskolkov1-0/+1
This behavior is required in IPv6, and there is little need to tolerate overlapping fragments in IPv4. This change simplifies the code and eliminates potential DDoS attack vectors. Tested: ran ip_defrag selftest (not yet available uptream). Suggested-by: David S. Miller <[email protected]> Signed-off-by: Peter Oskolkov <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Florian Westphal <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller4-10/+128
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree: 1) Support for transparent proxying for nf_tables, from Mate Eckl. 2) Patchset to add OS passive fingerprint recognition for nf_tables, from Fernando Fernandez. This takes common code from xt_osf and place it into the new nfnetlink_osf module for codebase sharing. 3) Lightweight tunneling support for nf_tables. 4) meta and lookup are likely going to be used in rulesets, make them direct calls. From Florian Westphal. A bunch of incremental updates: 5) use PTR_ERR_OR_ZERO() from nft_numgen, from YueHaibing. 6) Use kvmalloc_array() to allocate hashtables, from Li RongQing. 7) Explicit dependencies between nfnetlink_cttimeout and conntrack timeout extensions, from Harsha Sharma. 8) Simplify NLM_F_CREATE handling in nf_tables. 9) Removed unused variable in the get element command, from YueHaibing. 10) Expose bridge hook priorities through uapi, from Mate Eckl. And a few fixes for previous Netfilter batch for net-next: 11) Use per-netns mutex from flowtable event, from Florian Westphal. 12) Remove explicit dependency on iptables CT target from conntrack zones, from Florian. 13) Fix use-after-free in rmmod nf_conntrack path, also from Florian. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-08-04ethtool: Remove trailing semicolon for static inlineFlorian Fainelli1-2/+2
Android's header sanitization tool chokes on static inline functions having a trailing semicolon, leading to an incorrectly parsed header file. While the tool should obviously be fixed, also fix the header files for the two affected functions: ethtool_get_flow_spec_ring() and ethtool_get_flow_spec_ring_vf(). Fixes: 8cf6f497de40 ("ethtool: Add helper routines to pass vf to rx_flow_spec") Reporetd-by: Blair Prescott <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-03netfilter: bridge: Expose nf_tables bridge hook priorities through uapiMáté Eckl1-0/+11
Netfilter exposes standard hook priorities in case of ipv4, ipv6 and arp but not in case of bridge. This patch exposes the hook priority values of the bridge family (which are different from the formerly mentioned) via uapi so that they can be used by user-space applications just like the others. Signed-off-by: Máté Eckl <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-03netfilter: nf_tables: match on tunnel metadataPablo Neira Ayuso1-0/+15
This patch allows us to match on the tunnel metadata that is available of the packet. We can use this to validate if the packet comes from/goes to tunnel and the corresponding tunnel ID. Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-03netfilter: nf_tables: add tunnel supportPablo Neira Ayuso1-1/+68
This patch implements the tunnel object type that can be used to configure tunnels via metadata template through the existing lightweight API from the ingress path. Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-03l2tp: ignore L2TP_ATTR_MTUGuillaume Nault1-1/+1
This attribute's handling is broken. It can only be used when creating Ethernet pseudo-wires, in which case its value can be used as the initial MTU for the l2tpeth device. However, when handling update requests, L2TP_ATTR_MTU only modifies session->mtu. This value is never propagated to the l2tpeth device. Dump requests also return the value of session->mtu, which is not synchronised anymore with the device MTU. The same problem occurs if the device MTU is properly updated using the generic IFLA_MTU attribute. In this case, session->mtu is not updated, and L2TP_ATTR_MTU will report an invalid value again when dumping the session. It does not seem worthwhile to complexify l2tp_eth.c to synchronise session->mtu with the device MTU. Even the ip-l2tp manpage advises to use 'ip link' to initialise the MTU of l2tpeth devices (iproute2 does not handle L2TP_ATTR_MTU at all anyway). So let's just ignore it entirely. Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-03netfilter: nfnetlink_osf: rename nf_osf header file to nfnetlink_osfFernando Fernandez Mancera2-1/+1
The first client of the nf_osf.h userspace header is nft_osf, coming in this batch, rename it to nfnetlink_osf.h as there are no userspace clients for this yet, hence this looks consistent with other nfnetlink subsystem. Suggested-by: Jan Engelhardt <[email protected]> Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-03netfilter: nf_osf: move nf_osf_fingers to non-uapi header fileFernando Fernandez Mancera1-2/+0
All warnings (new ones prefixed by >>): >> ./usr/include/linux/netfilter/nf_osf.h:73: userspace cannot reference function or variable defined in the kernel Fixes: f9324952088f ("netfilter: nfnetlink_osf: extract nfnetlink_subsystem code from xt_osf.c") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-08-02iw_cxgb4: RDMA write with immediate supportPotnuri Bharat Teja1-1/+2
Adds iw_cxgb4 functionality to support RDMA_WRITE_WITH_IMMEDATE opcode. Signed-off-by: Potnuri Bharat Teja <[email protected]> Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-02RDMA/hns: Support flush cqe for hip08 in kernel spaceYixian Liu1-0/+1
According to IB protocol, there are some cases that work requests must return the flush error completion status through the completion queue. Due to hardware limitation, the driver needs to assist the flush process. This patch adds the support of flush cqe for hip08 in the cases that needed, such as poll cqe, post send, post recv and aeqe handle. The patch also considered the compatibility between kernel and user space. Signed-off-by: Yixian Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-03bpf: introduce the bpf_get_local_storage() helper functionRoman Gushchin1-1/+20
The bpf_get_local_storage() helper function is used to get a pointer to the bpf local storage from a bpf program. It takes a pointer to a storage map and flags as arguments. Right now it accepts only cgroup storage maps, and flags argument has to be 0. Further it can be extended to support other types of local storage: e.g. thread local storage etc. Signed-off-by: Roman Gushchin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-08-03bpf: introduce cgroup storage mapsRoman Gushchin1-0/+6
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps: a special type of maps which are implementing the cgroup storage. >From the userspace point of view it's almost a generic hash map with the (cgroup inode id, attachment type) pair used as a key. The only difference is that some operations are restricted: 1) a user can't create new entries, 2) a user can't remove existing entries. The lookup from userspace is o(log(n)). Signed-off-by: Roman Gushchin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-08-02Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+2
The BTF conflicts were simple overlapping changes. The virtio_net conflict was an overlap of a fix of statistics counter, happening alongisde a move over to a bonafide statistics structure rather than counting value on the stack. Signed-off-by: David S. Miller <[email protected]>
2018-08-02media: v4l: Add new 10-bit packed grayscale formatTodor Tomov1-0/+1
The new format will be called V4L2_PIX_FMT_Y10P. It is similar to the V4L2_PIX_FMT_SBGGR10P family formats but V4L2_PIX_FMT_Y10P is a grayscale format. Signed-off-by: Todor Tomov <[email protected]> Acked-by: Sakari Ailus <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2018-08-02media: v4l: Add new 2X8 10-bit grayscale media bus codeTodor Tomov1-1/+2
The code will be called MEDIA_BUS_FMT_Y10_2X8_PADHI_LE. It is similar to MEDIA_BUS_FMT_SBGGR10_2X8_PADHI_LE but MEDIA_BUS_FMT_Y10_2X8_PADHI_LE describes grayscale data. Signed-off-by: Todor Tomov <[email protected]> Acked-by: Sakari Ailus <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2018-08-02media: doc-rst: Add packed Bayer raw14 pixel formatsSakari Ailus1-0/+5
These formats are compressed 14-bit raw bayer formats with four different pixel orders. They are similar to 10-bit variants. The formats added by this patch are V4L2_PIX_FMT_SBGGR14P V4L2_PIX_FMT_SGBRG14P V4L2_PIX_FMT_SGRBG14P V4L2_PIX_FMT_SRGGB14P Signed-off-by: Sakari Ailus <[email protected]> Acked-by: Hans Verkuil <[email protected]> Signed-off-by: Todor Tomov <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2018-08-01tcp: add stat of data packet reordering eventsWei Wang1-0/+2
Introduce a new TCP stats to record the number of reordering events seen and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Application can use this stats to track the frequency of the reordering events in addition to the existing reordering stats which tracks the magnitude of the latest reordering event. Note: this new stats tracks reordering events triggered by ACKs, which could often be fewer than the actual number of packets being delivered out-of-order. Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-01tcp: add dsack blocks received statsWei Wang1-0/+2
Introduce a new TCP stat to record the number of DSACK blocks received (RFC4989 tcpEStatsStackDSACKDups) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-01tcp: add data bytes retransmitted statsWei Wang1-0/+2
Introduce a new TCP stat to record the number of bytes retransmitted (RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-01tcp: add data bytes sent statsWei Wang1-1/+3
Introduce a new TCP stat to record the number of bytes sent (RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info (TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS). Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-08-01Merge tag 'drm-msm-next-2018-07-30' of ↵Dave Airlie1-0/+13
git://people.freedesktop.org/~robclark/linux into drm-next A bit larger this time around, due to introduction of "dpu1" support for the display controller in sdm845 and beyond. This has been on list and undergoing refactoring since Feb (going from ~110kloc to ~30kloc), and all my review complaints have been addressed, so I'd be happy to see this upstream so further feature work can procede on top of upstream. Also includes the gpu coredump support, which should be useful for debugging gpu crashes. And various other misc fixes and such. Signed-off-by: Dave Airlie <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGv-8y3zguY0Mj1vh=o+vrv_bJ8AwZ96wBXYPvMeQT2XcA@mail.gmail.com
2018-07-31macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driverFinn Thain1-1/+1
Now that the PowerMac via-pmu driver supports m68k PowerBooks, switch over to that driver and remove the via-pmu68k driver. Tested-by: Stan Johnson <[email protected]> Signed-off-by: Finn Thain <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-07-31macintosh/via-pmu68k: Don't load driver on unsupported hardwareFinn Thain1-1/+1
Don't load the via-pmu68k driver on early PowerBooks. The M50753 PMU device found in those models was never supported by this driver. Attempting to load the driver usually causes a boot hang. Signed-off-by: Finn Thain <[email protected]> Reviewed-by: Michael Schmitz <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-07-31bpf: Support bpf_get_socket_cookie in more prog typesAndrey Ignatov1-0/+14
bpf_get_socket_cookie() helper can be used to identify skb that correspond to the same socket. Though socket cookie can be useful in many other use-cases where socket is available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of them can augment a value in a map prepared earlier by other program for the same socket. The patch adds support to call bpf_get_socket_cookie() from BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS. It doesn't introduce new helpers. Instead it reuses same helper name bpf_get_socket_cookie() but adds support to this helper to accept `struct bpf_sock_addr` and `struct bpf_sock_ops`. Documentation in bpf.h is changed in a way that should not break automatic generation of markdown. Signed-off-by: Andrey Ignatov <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-07-30KVM: s390: Add huge page enablement controlJanosch Frank1-0/+1
General KVM huge page support on s390 has to be enabled via the kvm.hpage module parameter. Either nested or hpage can be enabled, as we currently do not support vSIE for huge backed guests. Once the vSIE support is added we will either drop the parameter or enable it as default. For a guest the feature has to be enabled through the new KVM_CAP_S390_HPAGE_1M capability and the hpage module parameter. Enabling it means that cmm can't be enabled for the vm and disables pfmf and storage key interpretation. This is due to the fact that in some cases, in upcoming patches, we have to split huge pages in the guest mapping to be able to set more granular memory protection on 4k pages. These split pages have fake page tables that are not visible to the Linux memory management which subsequently will not manage its PGSTEs, while the SIE will. Disabling these features lets us manage PGSTE data in a consistent matter and solve that problem. Signed-off-by: Janosch Frank <[email protected]> Reviewed-by: David Hildenbrand <[email protected]>
2018-07-30media: dvb/audio.h: get rid of unused APIsMauro Carvalho Chehab1-37/+0
There are a number of other ioctls that aren't used anywhere inside the Kernel tree. Get rid of them. Signed-off-by: Mauro Carvalho Chehab <[email protected]>