aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2019-08-31net: tls: export protocol version, cipher, tx_conf/rx_conf to socket diagDavide Caratti1-0/+17
When an application configures kernel TLS on top of a TCP socket, it's now possible for inet_diag_handler() to collect information regarding the protocol version, the cipher type and TX / RX configuration, in case INET_DIAG_INFO is requested. Signed-off-by: Davide Caratti <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-31tcp: ulp: add functions to dump ulp-specific informationDavide Caratti1-0/+3
currently, only getsockopt(TCP_ULP) can be invoked to know if a ULP is on top of a TCP socket. Extend idiag_get_aux() and idiag_get_aux_size(), introduced by commit b37e88407c1d ("inet_diag: allow protocols to provide additional data"), to report the ULP name and other information that can be made available by the ULP through optional functions. Users having CAP_NET_ADMIN privileges will then be able to retrieve this information through inet_diag_handler, if they specify INET_DIAG_INFO in the request. Signed-off-by: Davide Caratti <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-31net/tls: use RCU protection on icsk->icsk_ulp_dataJakub Kicinski2-3/+8
We need to make sure context does not get freed while diag code is interrogating it. Free struct tls_context with kfree_rcu(). We add the __rcu annotation directly in icsk, and cast it away in the datapath accessor. Presumably all ULPs will do a similar thing. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-30udp: Remove unlikely() from IS_ERR*() conditionDenis Efremov1-1/+1
"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses unlikely() internally. Signed-off-by: Denis Efremov <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Joe Perches <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Signed-off-by: David S. Miller <[email protected]>
2019-08-31xsk: add support to allow unaligned chunk placementKevin Laatz1-4/+71
Currently, addresses are chunk size aligned. This means, we are very restricted in terms of where we can place chunk within the umem. For example, if we have a chunk size of 2k, then our chunks can only be placed at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0). This patch introduces the ability to use unaligned chunks. With these changes, we are no longer bound to having to place chunks at a 2k (or whatever your chunk size is) interval. Since we are no longer dealing with aligned chunks, they can now cross page boundaries. Checks for page contiguity have been added in order to keep track of which pages are followed by a physically contiguous page. Signed-off-by: Kevin Laatz <[email protected]> Signed-off-by: Ciara Loftus <[email protected]> Signed-off-by: Bruce Richardson <[email protected]> Acked-by: Jonathan Lemon <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-30cfg80211: add local BSS receive time to survey informationFelix Fietkau1-0/+4
This is useful for checking how much airtime is being used up by other transmissions on the channel, e.g. by calculating (time_rx - time_bss_rx) or (time_busy - time_bss_rx - time_tx) Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2019-08-28net: sched: act_sample: fix psample group handling on overwriteVlad Buslov1-0/+1
Action sample doesn't properly handle psample_group pointer in overwrite case. Following issues need to be fixed: - In tcf_sample_init() function RCU_INIT_POINTER() is used to set s->psample_group, even though we neither setting the pointer to NULL, nor preventing concurrent readers from accessing the pointer in some way. Use rcu_swap_protected() instead to safely reset the pointer. - Old value of s->psample_group is not released or deallocated in any way, which results resource leak. Use psample_group_put() on non-NULL value obtained with rcu_swap_protected(). - The function psample_group_put() that released reference to struct psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu grace period when deallocating it. Extend struct psample_group with rcu head and use kfree_rcu when freeing it. Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action") Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-28ipv6: shrink struct ipv6_mc_socklistEric Dumazet1-1/+1
Remove two holes on 64bit arches, to bring the size to one cache line exactly. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-27sctp: make ecn flag per netns and endpointXin Long2-1/+5
This patch is to add ecn flag for both netns_sctp and sctp_endpoint, net->sctp.ecn_enable is set 1 by default, and ep->ecn_enable will be initialized with net->sctp.ecn_enable. asoc->peer.ecn_capable will be set during negotiation only when ep->ecn_enable is set on both sides. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-27net: dsa: remove bitmap operationsVivien Didelot1-3/+0
The bitmap operations were introduced to simplify the switch drivers in the future, since most of them could implement the common VLAN and MDB operations (add, del, dump) with simple functions taking all target ports at once, and thus limiting the number of hardware accesses. Programming an MDB or VLAN this way in a single operation would clearly simplify the drivers a lot but would require a new get-set interface in DSA. The usage of such bitmap from the stack also raised concerned in the past, leading to the dynamic allocation of a new ds->_bitmap member in the dsa_switch structure. So let's get rid of them for now. This commit nicely wraps the ds->ops->port_{mdb,vlan}_{prepare,add} switch operations into new dsa_switch_{mdb,vlan}_{prepare,add} variants not using any bitmap argument anymore. New dsa_switch_{mdb,vlan}_match helpers have been introduced to make clear which local port of a switch must be programmed with the target object. While the targeted user port is an obvious candidate, the DSA links must also be programmed, as well as the CPU port for VLANs. While at it, also remove local variables that are only used once. Signed-off-by: Vivien Didelot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-27net_sched: fix a NULL pointer deref in ipt actionCong Wang1-1/+3
The net pointer in struct xt_tgdtor_param is not explicitly initialized therefore is still NULL when dereferencing it. So we have to find a way to pass the correct net pointer to ipt_destroy_target(). The best way I find is just saving the net pointer inside the per netns struct tcf_idrinfo, which could make this patch smaller. Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset") Reported-and-tested-by: [email protected] Cc: Jamal Hadi Salim <[email protected]> Cc: Jiri Pirko <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller4-9/+3
Minor conflict in r8169, bug fix had two versions in net and net-next, take the net-next hunks. Signed-off-by: David S. Miller <[email protected]>
2019-08-27netfilter: nft_dynset: support for element deletionAnder Juaristi1-1/+9
This patch implements the delete operation from the ruleset. It implements a new delete() function in nft_set_rhash. It is simpler to use than the already existing remove(), because it only takes the set and the key as arguments, whereas remove() expects a full nft_set_elem structure. Signed-off-by: Ander Juaristi <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-08-26net: sched: copy tunnel info when setting flow_action entry->tunnelVlad Buslov1-0/+17
In order to remove dependency on rtnl lock, modify tc_setup_flow_action() to copy tunnel info, instead of just saving pointer to tunnel_key action tunnel info. This is necessary to prevent concurrent action overwrite from releasing tunnel info while it is being used by rtnl-unlocked driver. Implement helper tcf_tunnel_info_copy() that is used to copy tunnel info with all its options to dynamically allocated memory block. Modify tc_cleanup_flow_action() to free dynamically allocated tunnel info. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: take reference to action dev before calling offloadsVlad Buslov1-0/+2
In order to remove dependency on rtnl lock when calling hardware offload API, take reference to action mirred dev when initializing flow_action structure in tc_setup_flow_action(). Implement function tc_cleanup_flow_action(), use it to release the device after hardware offload API is done using it. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: take rtnl lock in tc_setup_flow_action()Vlad Buslov1-1/+1
In order to allow using new flow_action infrastructure from unlocked classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held' argument. Take rtnl lock before accessing tc_action data. This is necessary to protect from concurrent action replace. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: add API for registering unlocked offload block callbacksVlad Buslov2-0/+2
Extend struct flow_block_offload with "unlocked_driver_cb" flag to allow registering and unregistering block hardware offload callbacks that do not require caller to hold rtnl lock. Extend tcf_block with additional lockeddevcnt counter that is incremented for each non-unlocked driver callback attached to device. This counter is necessary to conditionally obtain rtnl lock before calling hardware callbacks in following patches. Register mlx5 tc block offload callbacks as "unlocked". Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: notify classifier on successful offload add/deleteVlad Buslov1-0/+4
To remove dependency on rtnl lock, extend classifier ops with new ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while holding cb_lock every time filter if successfully added to or deleted from hardware. Implement the new API in flower classifier. Use it to manage hw_filters list under cb_lock protection, instead of relying on rtnl lock to synchronize with concurrent fl_reoffload() call. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: refactor block offloads counter usageVlad Buslov2-32/+16
Without rtnl lock protection filters can no longer safely manage block offloads counter themselves. Refactor cls API to protect block offloadcnt with tcf_block->cb_lock that is already used to protect driver callback list and nooffloaddevcnt counter. The counter can be modified by concurrent tasks by new functions that execute block callbacks (which is safe with previous patch that changed its type to atomic_t), however, block bind/unbind code that checks the counter value takes cb_lock in write mode to exclude any concurrent modifications. This approach prevents race conditions between bind/unbind and callback execution code but allows for concurrency for tc rule update path. Move block offload counter, filter in hardware counter and filter flags management from classifiers into cls hardware offloads API. Make functions tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API private. Implement following new cls API to be used instead: tc_setup_cb_add() - non-destructive filter add. If filter that wasn't already in hardware is successfully offloaded, increment block offloads counter, set filter in hardware counter and flag. On failure, previously offloaded filter is considered to be intact and offloads counter is not decremented. tc_setup_cb_replace() - destructive filter replace. Release existing filter block offload counter and reset its in hardware counter and flag. Set new filter in hardware counter and flag. On failure, previously offloaded filter is considered to be destroyed and offload counter is decremented. tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block offloads counter. tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and call tc_cls_offload_cnt_update() if cb() didn't return an error. Refactor all offload-capable classifiers to atomically offload filters to hardware, change block offload counter, and set filter in hardware counter and flag by means of the new cls API functions. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: change tcf block offload counter type to atomic_tVlad Buslov1-3/+4
As a preparation for running proto ops functions without rtnl lock, change offload counter type to atomic. This is necessary to allow updating the counter by multiple concurrent users when offloading filters to hardware from unlocked classifiers. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26net: sched: protect block offload-related fields with rw_semaphoreVlad Buslov1-0/+2
In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock' rwsem and use it to protect flow_block->cb_list and related counters from concurrent modification. The lock is taken in read mode for read-only traversal of cb_list in tc_setup_cb_call() and write mode in all other cases. This approach ensures that: - cb_list is not changed concurrently while filters is being offloaded on block. - block->nooffloaddevcnt is checked while holding the lock in read mode, but is only changed by bind/unbind code when holding the cb_lock in write mode to prevent concurrent modification. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-26netfilter: nf_tables: Introduce new 64-bit helper register functionsAnder Juaristi1-7/+18
Introduce new helper functions to load/store 64-bit values onto/from registers: - nft_reg_store64 - nft_reg_load64 This commit also re-orders all these helpers from smallest to largest target bit size. Signed-off-by: Ander Juaristi <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-08-25nexthop: Fix nexthop_num_path for blackhole nexthopsDavid Ahern1-6/+0
Donald reported this sequence: ip next add id 1 blackhole ip next add id 2 blackhole ip ro add 1.1.1.1/32 nhid 1 ip ro add 1.1.1.2/32 nhid 2 would cause a crash. Backtrace is: [ 151.302790] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 151.304043] CPU: 1 PID: 277 Comm: ip Not tainted 5.3.0-rc5+ #37 [ 151.305078] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 [ 151.306526] RIP: 0010:fib_add_nexthop+0x8b/0x2aa [ 151.307343] Code: 35 f7 81 48 8d 14 01 c7 02 f1 f1 f1 f1 c7 42 04 01 f4 f4 f4 48 89 f2 48 c1 ea 03 65 48 8b 0c 25 28 00 00 00 48 89 4d d0 31 c9 <80> 3c 02 00 74 08 48 89 f7 e8 1a e8 53 ff be 08 00 00 00 4c 89 e7 [ 151.310549] RSP: 0018:ffff888116c27340 EFLAGS: 00010246 [ 151.311469] RAX: dffffc0000000000 RBX: ffff8881154ece00 RCX: 0000000000000000 [ 151.312713] RDX: 0000000000000004 RSI: 0000000000000020 RDI: ffff888115649b40 [ 151.313968] RBP: ffff888116c273d8 R08: ffffed10221e3757 R09: ffff888110f1bab8 [ 151.315212] R10: 0000000000000001 R11: ffff888110f1bab3 R12: ffff888115649b40 [ 151.316456] R13: 0000000000000020 R14: ffff888116c273b0 R15: ffff888115649b40 [ 151.317707] FS: 00007f60b4d8d800(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000 [ 151.319113] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.320119] CR2: 0000555671ffdc00 CR3: 00000001136ba005 CR4: 0000000000020ee0 [ 151.321367] Call Trace: [ 151.321820] ? fib_nexthop_info+0x635/0x635 [ 151.322572] fib_dump_info+0xaa4/0xde0 [ 151.323247] ? fib_create_info+0x2431/0x2431 [ 151.324008] ? napi_alloc_frag+0x2a/0x2a [ 151.324711] rtmsg_fib+0x2c4/0x3be [ 151.325339] fib_table_insert+0xe2f/0xeee ... fib_dump_info incorrectly has nhs = 0 for blackhole nexthops, so it believes the nexthop object is a multipath group (nhs != 1) and ends up down the nexthop_mpath_fill_node() path which is wrong for a blackhole. The blackhole check in nexthop_num_path is leftover from early days of the blackhole implementation which did not initialize the device. In the end the design was simpler (fewer special case checks) to set the device to loopback in nh_info, so the check in nexthop_num_path should have been removed. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Reported-by: Donald Sharp <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-24net: route dump netlink NLM_F_MULTI flag missingJohn Fastabend1-1/+1
An excerpt from netlink(7) man page, In multipart messages (multiple nlmsghdr headers with associated payload in one byte stream) the first and all following headers have the NLM_F_MULTI flag set, except for the last header which has the type NLMSG_DONE. but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a FIB dump. The result is user space applications following above man page excerpt may get confused and may stop parsing msg believing something went wrong. In the golang netlink lib [0] the library logic stops parsing believing the message is not a multipart message. Found this running Cilium[1] against net-next while adding a feature to auto-detect routes. I noticed with multiple route tables we no longer could detect the default routes on net tree kernels because the library logic was not returning them. Fix this by handling the fib_dump_info_fnhe() case the same way the fib_dump_info() handles it by passing the flags argument through the call chain and adding a flags argument to rt_fill_info(). Tested with Cilium stack and auto-detection of routes works again. Also annotated libs to dump netlink msgs and inspected NLM_F_MULTI and NLMSG_DONE flags look correct after this. Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same as is done for fib_dump_info() so this looks correct to me. [0] https://github.com/vishvananda/netlink/ [1] https://github.com/cilium/ Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested") Signed-off-by: John Fastabend <[email protected]> Reviewed-by: Stefano Brivio <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-21net/mlx5e: Add tc flower tracepointsDmytro Linkin1-0/+1
Implemented following tracepoints: 1. Configure flower (mlx5e_configure_flower) 2. Delete flower (mlx5e_delete_flower) 3. Stats flower (mlx5e_stats_flower) Usage example: ># cd /sys/kernel/debug/tracing ># echo mlx5:mlx5e_configure_flower >> set_event ># cat trace ... tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT Added corresponding documentation in Documentation/networking/device-driver/mellanox/mlx5.rst Signed-off-by: Dmytro Linkin <[email protected]> Reviewed-by: Vlad Buslov <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2019-08-21trivial: netns: fix typo in 'struct net.passive' descriptionMike Rapoport1-1/+1
Replace 'decided' with 'decide' so that comment would be /* To decide when the network namespace should be freed. */ Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-21nl80211: Add support for EDMG channelsAlexei Avshalom Lazar1-2/+84
802.11ay specification defines Enhanced Directional Multi-Gigabit (EDMG) STA and AP which allow channel bonding of 2 channels and more. Introduce new NL attributes that are needed for enabling and configuring EDMG support. Two new attributes are used by kernel to publish driver's EDMG capabilities to the userspace: NL80211_BAND_ATTR_EDMG_CHANNELS - bitmap field that indicates the 2.16 GHz channel(s) that are supported by the driver. When this attribute is not set it means driver does not support EDMG. NL80211_BAND_ATTR_EDMG_BW_CONFIG - represent the channel bandwidth configurations supported by the driver. Additional two new attributes are used by the userspace for connect command and for AP configuration: NL80211_ATTR_WIPHY_EDMG_CHANNELS NL80211_ATTR_WIPHY_EDMG_BW_CONFIG New rate info flag - RATE_INFO_FLAGS_EDMG, can be reported from driver and used for bitrate calculation that will take into account EDMG according to the 802.11ay specification. Signed-off-by: Alexei Avshalom Lazar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2019-08-21cfg80211: Support assoc-at timer in sta-infoBen Greear1-0/+2
Report timestamp of when sta became associated. This is the boottime clock, units are nano-seconds. Signed-off-by: Ben Greear <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2019-08-19sctp: add sctp_auth_init and sctp_auth_freeXin Long1-0/+2
This patch is to factor out sctp_auth_init and sctp_auth_free functions, and sctp_auth_init will also be used in the next patch for SCTP_AUTH_SUPPORTED sockopt. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-19sctp: add asconf_enable in struct sctp_endpointXin Long1-0/+1
This patch is to make addip/asconf flag per endpoint, and its value is initialized by the per netns flag, net->sctp.addip_enable. It also replaces the checks of net->sctp.addip_enable with ep->asconf_enable in some places. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-19ipv6: Fix return value of ipv6_mc_may_pull() for malformed packetsStefano Brivio1-1/+1
Commit ba5ea614622d ("bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls") replaces direct calls to pskb_may_pull() in br_ipv6_multicast_mld2_report() with calls to ipv6_mc_may_pull(), that returns -EINVAL on buffers too short to be valid IPv6 packets, while maintaining the previous handling of the return code. This leads to the direct opposite of the intended effect: if the packet is malformed, -EINVAL evaluates as true, and we'll happily proceed with the processing. Return 0 if the packet is too short, in the same way as this was fixed for IPv4 by commit 083b78a9ed64 ("ip: fix ip_mc_may_pull() return value"). I don't have a reproducer for this, unlike the one referred to by the IPv4 commit, but this is clearly broken. Fixes: ba5ea614622d ("bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls") Signed-off-by: Stefano Brivio <[email protected]> Acked-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller8-9/+23
Merge conflict of mlx5 resolved using instructions in merge commit 9566e650bf7fdf58384bb06df634f7531ca3a97e. Signed-off-by: David S. Miller <[email protected]>
2019-08-18netfilter: nf_tables: map basechain priority to hardware priorityPablo Neira Ayuso1-0/+2
This patch adds initial support for offloading basechains using the priority range from 1 to 65535. This is restricting the netfilter priority range to 16-bit integer since this is what most drivers assume so far from tc. It should be possible to extend this range of supported priorities later on once drivers are updated to support for 32-bit integer priorities. Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-18net: sched: use major priority number as hardware priorityPablo Neira Ayuso1-1/+1
tc transparently maps the software priority number to hardware. Update it to pass the major priority which is what most drivers expect. Update drivers too so they do not need to lshift the priority field of the flow_cls_common_offload object. The stmmac driver is an exception, since this code assumes the tc software priority is fine, therefore, lshift it just to be conservative. Signed-off-by: Pablo Neira Ayuso <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17xsk: remove AF_XDP socket from map when the socket is releasedBjörn Töpel1-0/+18
When an AF_XDP socket is released/closed the XSKMAP still holds a reference to the socket in a "released" state. The socket will still use the netdev queue resource, and block newly created sockets from attaching to that queue, but no user application can access the fill/complete/rx/tx queues. This results in that all applications need to explicitly clear the map entry from the old "zombie state" socket. This should be done automatically. In this patch, the sockets tracks, and have a reference to, which maps it resides in. When the socket is released, it will remove itself from all maps. Suggested-by: Bruce Richardson <[email protected]> Signed-off-by: Björn Töpel <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-17bpf: support cloning sk storage on accept()Stanislav Fomichev1-0/+10
Add new helper bpf_sk_storage_clone which optionally clones sk storage and call it from sk_clone_lock. Cc: Martin KaFai Lau <[email protected]> Cc: Yonghong Song <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-17xsk: add support for need_wakeup flag in AF_XDP ringsMagnus Karlsson1-1/+32
This commit adds support for a new flag called need_wakeup in the AF_XDP Tx and fill rings. When this flag is set, it means that the application has to explicitly wake up the kernel Rx (for the bit in the fill ring) or kernel Tx (for bit in the Tx ring) processing by issuing a syscall. Poll() can wake up both depending on the flags submitted and sendto() will wake up tx processing only. The main reason for introducing this new flag is to be able to efficiently support the case when application and driver is executing on the same core. Previously, the driver was just busy-spinning on the fill ring if it ran out of buffers in the HW and there were none on the fill ring. This approach works when the application is running on another core as it can replenish the fill ring while the driver is busy-spinning. Though, this is a lousy approach if both of them are running on the same core as the probability of the fill ring getting more entries when the driver is busy-spinning is zero. With this new feature the driver now sets the need_wakeup flag and returns to the application. The application can then replenish the fill queue and then explicitly wake up the Rx processing in the kernel using the syscall poll(). For Tx, the flag is only set to one if the driver has no outstanding Tx completion interrupts. If it has some, the flag is zero as it will be woken up by a completion interrupt anyway. As a nice side effect, this new flag also improves the performance of the case where application and driver are running on two different cores as it reduces the number of syscalls to the kernel. The kernel tells user space if it needs to be woken up by a syscall, and this eliminates many of the syscalls. This flag needs some simple driver support. If the driver does not support this, the Rx flag is always zero and the Tx flag is always one. This makes any application relying on this feature default to the old behaviour of not requiring any syscalls in the Rx path and always having to call sendto() in the Tx path. For backwards compatibility reasons, this feature has to be explicitly turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend that you always turn it on as it so far always have had a positive performance impact. The name and inspiration of the flag has been taken from io_uring by Jens Axboe. Details about this feature in io_uring can be found in http://kernel.dk/io_uring.pdf, section 8.3. Signed-off-by: Magnus Karlsson <[email protected]> Acked-by: Jonathan Lemon <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-17Documentation: Add devlink-trap documentationIdo Schimmel1-0/+6
Add initial documentation of the devlink-trap mechanism, explaining the background, motivation and the semantics of the interface. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17devlink: Add generic packet traps and groupsIdo Schimmel1-0/+40
Add generic packet traps and groups that can report dropped packets as well as exceptions such as TTL error. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17devlink: Add packet trap infrastructureIdo Schimmel1-0/+129
Add the basic packet trap infrastructure that allows device drivers to register their supported packet traps and trap groups with devlink. Each driver is expected to provide basic information about each supported trap, such as name and ID, but also the supported metadata types that will accompany each packet trapped via the trap. The currently supported metadata type is just the input port, but more will be added in the future. For example, output port and traffic class. Trap groups allow users to set the action of all member traps. In addition, users can retrieve per-group statistics in case per-trap statistics are too narrow. In the future, the trap group object can be extended with more attributes, such as policer settings which will limit the amount of traffic generated by member traps towards the CPU. Beside registering their packet traps with devlink, drivers are also expected to report trapped packets to devlink along with relevant metadata. devlink will maintain packets and bytes statistics for each packet trap and will potentially report the trapped packet with its metadata to user space via drop monitor netlink channel. The interface towards the drivers is simple and allows devlink to set the action of the trap. Currently, only two actions are supported: 'trap' and 'drop'. When set to 'trap', the device is expected to provide the sole copy of the packet to the driver which will pass it to devlink. When set to 'drop', the device is expected to drop the packet and not send a copy to the driver. In the future, more actions can be added, such as 'mirror'. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Add basic infrastructure for hardware dropsIdo Schimmel1-0/+33
Export a function that can be invoked in order to report packets that were dropped by the underlying hardware along with metadata. Subsequent patches will add support for the different alert modes. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17Merge branch 'for-upstream' of ↵David S. Miller1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2019-08-17 Here's a set of Bluetooth fixes for the 5.3-rc series: - Multiple fixes for Qualcomm (btqca & hci_qca) drivers - Minimum encryption key size debugfs setting (this is required for Bluetooth Qualification) - Fix hidp_send_message() to have a meaningful return value ==================== Signed-off-by: David S. Miller <[email protected]>
2019-08-17Bluetooth: Add debug setting for changing minimum encryption key sizeMarcel Holtmann1-0/+1
For testing and qualification purposes it is useful to allow changing the minimum encryption key size value that the host stack is going to enforce. This adds a new debugfs setting min_encrypt_key_size to achieve this functionality. Signed-off-by: Marcel Holtmann <[email protected]> Signed-off-by: Johan Hedberg <[email protected]>
2019-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-2/+7
Pablo Neira Ayuso says: ==================== Netfilter fixes for net This patchset contains Netfilter fixes for net: 1) Extend selftest to cover flowtable with ipsec, from Florian Westphal. 2) Fix interaction of ipsec with flowtable, also from Florian. 3) User-after-free with bound set to rule that fails to load. 4) Adjust state and timeout for flows that expire. 5) Timeout update race with flows in teardown state. 6) Ensure conntrack id hash calculation use invariants as input, from Dirk Morris. 7) Do not push flows into flowtable for TCP fin/rst packets. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-08-13netlink: Fix nlmsg_parse as a wrapper for strict message parsingDavid Ahern1-3/+2
Eric reported a syzbot warning: BUG: KMSAN: uninit-value in nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510 CPU: 0 PID: 11812 Comm: syz-executor444 Not tainted 5.3.0-rc3+ #17 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294 nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510 rtm_del_nexthop+0x1b1/0x610 net/ipv4/nexthop.c:1543 rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5223 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5241 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0xf6c/0x1050 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg net/socket.c:657 [inline] ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311 __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439 __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297 entry_SYSCALL_64_after_hwframe+0x63/0xe7 The root cause is nlmsg_parse calling __nla_parse which means the header struct size is not checked. nlmsg_parse should be a wrapper around __nlmsg_parse with NL_VALIDATE_STRICT for the validate argument very much like nlmsg_parse_deprecated is for NL_VALIDATE_LIBERAL. Fixes: 3de6440354465 ("netlink: re-add parse/validate functions in strict mode") Reported-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David Ahern <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2019-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski27-4/+121
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Rename mss field to mss_option field in synproxy, from Fernando Mancera. 2) Use SYSCTL_{ZERO,ONE} definitions in conntrack, from Matteo Croce. 3) More strict validation of IPVS sysctl values, from Junwei Hu. 4) Remove unnecessary spaces after on the right hand side of assignments, from yangxingwu. 5) Add offload support for bitwise operation. 6) Extend the nft_offload_reg structure to store immediate date. 7) Collapse several ip_set header files into ip_set.h, from Jeremy Sowden. 8) Make netfilter headers compile with CONFIG_KERNEL_HEADER_TEST=y, from Jeremy Sowden. 9) Fix several sparse warnings due to missing prototypes, from Valdis Kletnieks. 10) Use static lock initialiser to ensure connlabel spinlock is initialized on boot time to fix sched/act_ct.c, patch from Florian Westphal. ==================== Signed-off-by: Jakub Kicinski <[email protected]>
2019-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-0/+10
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net-next* tree. There is a small merge conflict in libbpf (Cc Andrii so he's in the loop as well): for (i = 1; i <= btf__get_nr_types(btf); i++) { t = (struct btf_type *)btf__type_by_id(btf, i); if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); <<<<<<< HEAD /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t+1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && kind == BTF_KIND_DATASEC) { ======= t->size = sizeof(int); *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32); } else if (!has_datasec && btf_is_datasec(t)) { >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8 /* replace DATASEC with STRUCT */ Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to use new btf helpers"), so we need to pick the sanitation fixup as well as use the new btf_is_datasec() helper and the whitespace cleanup. Looks like the following: [...] if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && btf_is_datasec(t)) { /* replace DATASEC with STRUCT */ [...] The main changes are: 1) Addition of core parts of compile once - run everywhere (co-re) effort, that is, relocation of fields offsets in libbpf as well as exposure of kernel's own BTF via sysfs and loading through libbpf, from Andrii. More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2 and http://vger.kernel.org/lpc-bpf2018.html#session-2 2) Enable passing input flags to the BPF flow dissector to customize parsing and allowing it to stop early similar to the C based one, from Stanislav. 3) Add a BPF helper function that allows generating SYN cookies from XDP and tc BPF, from Petar. 4) Add devmap hash-based map type for more flexibility in device lookup for redirects, from Toke. 5) Improvements to XDP forwarding sample code now utilizing recently enabled devmap lookups, from Jesper. 6) Add support for reporting the effective cgroup progs in bpftool, from Jakub and Takshak. 7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter. 8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan. 9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei. 10) Add perf event output helper also for other skb-based program types, from Allan. 11) Fix a co-re related compilation error in selftests, from Yonghong. ==================== Signed-off-by: Jakub Kicinski <[email protected]>
2019-08-13netfilter: add missing IS_ENABLED(CONFIG_NETFILTER) checks to some header-files.Jeremy Sowden10-0/+36
linux/netfilter.h defines a number of struct and inline function definitions which are only available is CONFIG_NETFILTER is enabled. These structs and functions are used in declarations and definitions in other header-files. Added preprocessor checks to make sure these headers will compile if CONFIG_NETFILTER is disabled. Signed-off-by: Jeremy Sowden <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-08-13netfilter: add missing IS_ENABLED(CONFIG_NF_CONNTRACK) checks to some ↵Jeremy Sowden4-0/+31
header-files. struct nf_conn contains a "struct nf_conntrack ct_general" member and struct net contains a "struct netns_ct ct" member which are both only defined in CONFIG_NF_CONNTRACK is enabled. These members are used in a number of inline functions defined in other header-files. Added preprocessor checks to make sure the headers will compile if CONFIG_NF_CONNTRACK is disabled. Signed-off-by: Jeremy Sowden <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-08-13netfilter: add missing IS_ENABLED(CONFIG_NF_TABLES) check to header-file.Jeremy Sowden1-0/+4
nf_tables.h defines an API comprising several inline functions and macros that depend on the nft member of struct net. However, this is only defined is CONFIG_NF_TABLES is enabled. Added preprocessor checks to ensure that nf_tables.h will compile if CONFIG_NF_TABLES is disabled. Signed-off-by: Jeremy Sowden <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>