aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2019-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-6/+16
Merge conflict of mlx5 resolved using instructions in merge commit 9566e650bf7fdf58384bb06df634f7531ca3a97e. Signed-off-by: David S. Miller <[email protected]>
2019-08-17bpf: support cloning sk storage on accept()Stanislav Fomichev2-6/+107
Add new helper bpf_sk_storage_clone which optionally clones sk storage and call it from sk_clone_lock. Cc: Martin KaFai Lau <[email protected]> Cc: Yonghong Song <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-17net: Don't call XDP_SETUP_PROG when nothing is changedMaxim Mikityanskiy1-2/+13
Don't uninstall an XDP program when none is installed, and don't install an XDP program that has the same ID as the one already installed. dev_change_xdp_fd doesn't perform any checks in case it uninstalls an XDP program. It means that the driver's ndo_bpf can be called with XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This case happens if the user runs `ip link set eth0 xdp off` when there is no XDP program attached. The symmetrical case is possible when the user tries to set the program that is already set. The drivers typically perform some heavy operations on XDP_SETUP_PROG, so they all have to handle these cases internally to return early if they happen. This patch puts this check into the kernel code, so that all drivers will benefit from it. Signed-off-by: Maxim Mikityanskiy <[email protected]> Acked-by: Jonathan Lemon <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-17devlink: Add generic packet traps and groupsIdo Schimmel1-0/+12
Add generic packet traps and groups that can report dropped packets as well as exceptions such as TTL error. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17devlink: Add packet trap infrastructureIdo Schimmel1-5/+1063
Add the basic packet trap infrastructure that allows device drivers to register their supported packet traps and trap groups with devlink. Each driver is expected to provide basic information about each supported trap, such as name and ID, but also the supported metadata types that will accompany each packet trapped via the trap. The currently supported metadata type is just the input port, but more will be added in the future. For example, output port and traffic class. Trap groups allow users to set the action of all member traps. In addition, users can retrieve per-group statistics in case per-trap statistics are too narrow. In the future, the trap group object can be extended with more attributes, such as policer settings which will limit the amount of traffic generated by member traps towards the CPU. Beside registering their packet traps with devlink, drivers are also expected to report trapped packets to devlink along with relevant metadata. devlink will maintain packets and bytes statistics for each packet trap and will potentially report the trapped packet with its metadata to user space via drop monitor netlink channel. The interface towards the drivers is simple and allows devlink to set the action of the trap. Currently, only two actions are supported: 'trap' and 'drop'. When set to 'trap', the device is expected to provide the sole copy of the packet to the driver which will pass it to devlink. When set to 'drop', the device is expected to drop the packet and not send a copy to the driver. In the future, more actions can be added, such as 'mirror'. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Allow user to start monitoring hardware dropsIdo Schimmel1-3/+121
Drop monitor has start and stop commands, but so far these were only used to start and stop monitoring of software drops. Now that drop monitor can also monitor hardware drops, we should allow the user to control these as well. Do that by adding SW and HW flags to these commands. If no flag is specified, then only start / stop monitoring software drops. This is done in order to maintain backward-compatibility with existing user space applications. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Add support for summary alert mode for hardware dropsIdo Schimmel1-2/+193
In summary alert mode a notification is sent with a list of recent drop reasons and a count of how many packets were dropped due to this reason. To avoid expensive operations in the context in which packets are dropped, each CPU holds an array whose number of entries is the maximum number of drop reasons that can be encoded in the netlink notification. Each entry stores the drop reason and a count. When a packet is dropped the array is traversed and a new entry is created or the count of an existing entry is incremented. Later, in process context, the array is replaced with a newly allocated copy and the old array is encoded in a netlink notification. To avoid breaking user space, the notification includes the ancillary header, which is 'struct net_dm_alert_msg' with number of entries set to '0'. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Add support for packet alert mode for hardware dropsIdo Schimmel1-4/+300
In a similar fashion to software drops, extend drop monitor to send netlink events when packets are dropped by the underlying hardware. The main difference is that instead of encoding the program counter (PC) from which kfree_skb() was called in the netlink message, we encode the hardware trap name. The two are mostly equivalent since they should both help the user understand why the packet was dropped. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Consider all monitoring states before performing configurationIdo Schimmel1-2/+7
The drop monitor configuration (e.g., alert mode) is global, but user will be able to enable monitoring of only software or hardware drops. Therefore, ensure that monitoring of both software and hardware drops are disabled before allowing drop monitor configuration to take place. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Add basic infrastructure for hardware dropsIdo Schimmel1-0/+28
Export a function that can be invoked in order to report packets that were dropped by the underlying hardware along with metadata. Subsequent patches will add support for the different alert modes. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Initialize hardware per-CPU dataIdo Schimmel1-2/+23
Like software drops, hardware drops also need the same type of per-CPU data. Therefore, initialize it during module initialization and de-initialize it during module exit. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-17drop_monitor: Move per-CPU data init/fini to separate functionsIdo Schimmel1-17/+36
Currently drop monitor only reports software drops to user space, but subsequent patches are going to add support for hardware drops. Like software drops, the per-CPU data of hardware drops needs to be initialized and de-initialized upon module initialization and exit. To avoid code duplication, break this code into separate functions, so that these could be re-used for hardware drops. No functional changes intended. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-15page_pool: fix logic in __page_pool_get_cachedJonathan Lemon1-23/+16
__page_pool_get_cached() will return NULL when the ring is empty, even if there are pages present in the lookaside cache. It is also possible to refill the cache, and then return a NULL page. Restructure the logic so eliminate both cases. Signed-off-by: Jonathan Lemon <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Ilias Apalodimas <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-15page_pool: remove unnecessary variable initYunsheng Lin1-1/+1
Remove variable initializations in functions that are followed by assignments before use Signed-off-by: Yunsheng Lin <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-13net: devlink: remove redundant rtnl lock assertVlad Buslov1-3/+2
It is enough for caller of devlink_compat_switch_id_get() to hold the net device to guarantee that devlink port is not destroyed concurrently. Remove rtnl lock assertion and modify comment to warn user that they must hold either rtnl lock or reference to net device. This is necessary to accommodate future implementation of rtnl-unlocked TC offloads driver callbacks. Signed-off-by: Vlad Buslov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2019-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2-4/+105
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net-next* tree. There is a small merge conflict in libbpf (Cc Andrii so he's in the loop as well): for (i = 1; i <= btf__get_nr_types(btf); i++) { t = (struct btf_type *)btf__type_by_id(btf, i); if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); <<<<<<< HEAD /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t+1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && kind == BTF_KIND_DATASEC) { ======= t->size = sizeof(int); *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32); } else if (!has_datasec && btf_is_datasec(t)) { >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8 /* replace DATASEC with STRUCT */ Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to use new btf helpers"), so we need to pick the sanitation fixup as well as use the new btf_is_datasec() helper and the whitespace cleanup. Looks like the following: [...] if (!has_datasec && btf_is_var(t)) { /* replace VAR with INT */ t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0); /* * using size = 1 is the safest choice, 4 will be too * big and cause kernel BTF validation failure if * original variable took less than 4 bytes */ t->size = 1; *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8); } else if (!has_datasec && btf_is_datasec(t)) { /* replace DATASEC with STRUCT */ [...] The main changes are: 1) Addition of core parts of compile once - run everywhere (co-re) effort, that is, relocation of fields offsets in libbpf as well as exposure of kernel's own BTF via sysfs and loading through libbpf, from Andrii. More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2 and http://vger.kernel.org/lpc-bpf2018.html#session-2 2) Enable passing input flags to the BPF flow dissector to customize parsing and allowing it to stop early similar to the C based one, from Stanislav. 3) Add a BPF helper function that allows generating SYN cookies from XDP and tc BPF, from Petar. 4) Add devmap hash-based map type for more flexibility in device lookup for redirects, from Toke. 5) Improvements to XDP forwarding sample code now utilizing recently enabled devmap lookups, from Jesper. 6) Add support for reporting the effective cgroup progs in bpftool, from Jakub and Takshak. 7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter. 8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan. 9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei. 10) Add perf event output helper also for other skb-based program types, from Allan. 11) Fix a co-re related compilation error in selftests, from Yonghong. ==================== Signed-off-by: Jakub Kicinski <[email protected]>
2019-08-13devlink: send notifications for deleted snapshots on region destroyJiri Pirko1-11/+12
Currently the notifications for deleted snapshots are sent only in case user deletes a snapshot manually. Send the notifications in case region is destroyed too. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2019-08-11drop_monitor: Expose tail drop counterIdo Schimmel1-0/+101
Previous patch made the length of the per-CPU skb drop list configurable. Expose a counter that shows how many packets could not be enqueued to this list. This allows users determine the desired queue length. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Make drop queue length configurableIdo Schimmel1-3/+16
In packet alert mode, each CPU holds a list of dropped skbs that need to be processed in process context and sent to user space. To avoid exhausting the system's memory the maximum length of this queue is currently set to 1000. Allow users to tune the length of this queue according to their needs. The configured length is reported to user space when drop monitor configuration is queried. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Add a command to query current configurationIdo Schimmel1-0/+48
Users should be able to query the current configuration of drop monitor before they start using it. Add a command to query the existing configuration which currently consists of alert mode and packet truncation length. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Allow truncation of dropped packetsIdo Schimmel1-0/+19
When sending dropped packets to user space it is not always necessary to copy the entire packet as usually only the headers are of interest. Allow user to specify the truncation length and add the original length of the packet as additional metadata to the netlink message. By default no truncation is performed. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Add packet alert modeIdo Schimmel1-2/+278
So far drop monitor supported only one alert mode in which a summary of locations in which packets were recently dropped was sent to user space. This alert mode is sufficient in order to understand that packets were dropped, but lacks information to perform a more detailed analysis. Add a new alert mode in which the dropped packet itself is passed to user space along with metadata: The drop location (as program counter and resolved symbol), ingress netdevice and drop timestamp. More metadata can be added in the future. To avoid performing expensive operations in the context in which kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and queued on per-CPU skb drop list. Then, in process context the netlink message is allocated, prepared and finally sent to user space. The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting the system's memory. Subsequent patches will make this limit configurable and also add a counter that indicates how many skbs were tail dropped. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Add alert mode operationsIdo Schimmel1-6/+32
The next patch is going to add another alert mode in which the dropped packet is notified to user space, instead of only a summary of recent drops. Abstract the differences between the modes by adding alert mode operations. The operations are selected based on the currently configured mode and associated with the probes and the work item just before tracing starts. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Require CAP_NET_ADMIN for drop monitor configurationIdo Schimmel1-0/+1
Currently, the configure command does not do anything but return an error. Subsequent patches will enable the command to change various configuration options such as alert mode and packet truncation. Similar to other netlink-based configuration channels, make sure only users with the CAP_NET_ADMIN capability set can execute this command. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Reset per-CPU data before starting to traceIdo Schimmel1-3/+7
The function reset_per_cpu_data() allocates and prepares a new skb for the summary netlink alert message ('NET_DM_CMD_ALERT'). The new skb is stored in the per-CPU 'data' variable and the old is returned. The function is invoked during module initialization and from the workqueue, before an alert is sent. This means that it is possible to receive an alert with stale data, if we stopped tracing when the hysteresis timer ('data->send_timer') was pending. Instead of invoking the function during module initialization, invoke it just before we start tracing and ensure we get a fresh skb. This also allows us to remove the calls to initialize the timer and the work item from the module initialization path, since both could have been triggered by the error paths of reset_per_cpu_data(). Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Initialize timer and work item upon tracing enableIdo Schimmel1-5/+19
The timer and work item are currently initialized once during module init, but subsequent patches will need to associate different functions with the work item, based on the configured alert mode. Allow subsequent patches to make that change by initializing and de-initializing these objects during tracing enable and disable. This also guarantees that once the request to disable tracing returns, no more netlink notifications will be generated. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-11drop_monitor: Split tracing enable / disable to different functionsIdo Schimmel1-28/+51
Subsequent patches will need to enable / disable tracing based on the configured alerting mode. Reduce the nesting level and prepare for the introduction of this functionality by splitting the tracing enable / disable operations into two different functions. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-09sock: make cookie generation global instead of per netnsDaniel Borkmann1-1/+2
Generating and retrieving socket cookies are a useful feature that is exposed to BPF for various program types through bpf_get_socket_cookie() helper. The fact that the cookie counter is per netns is quite a limitation for BPF in practice in particular for programs in host namespace that use socket cookies as part of a map lookup key since they will be causing socket cookie collisions e.g. when attached to BPF cgroup hooks or cls_bpf on tc egress in host namespace handling container traffic from veth or ipvlan devices with peer in different netns. Change the counter to be global instead. Socket cookie consumers must assume the value as opqaue in any case. Not every socket must have a cookie generated and knowledge of the counter value itself does not provide much value either way hence conversion to global is fine. Signed-off-by: Daniel Borkmann <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Martynas Pumputis <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-09devlink: remove pointless data_len arg from region snapshot createJiri Pirko1-6/+3
The size of the snapshot has to be the same as the size of the region, therefore no need to pass it again during snapshot creation. Remove the arg and use region->size instead. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-08net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski1-5/+14
sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Willem de Bruijn <[email protected]> Reviewed-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-08flow_offload: support get multi-subsystem blockwenxu1-13/+38
It provide a callback list to find the blocks of tc and nft subsystems Signed-off-by: wenxu <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-08flow_offload: move tc indirect block to flow offloadwenxu1-0/+215
move tc indirect block to flow_offload and rename it to flow indirect block.The nf_tables can use the indr block architecture. Signed-off-by: wenxu <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-08net: use listified RX for handling GRO_NORMAL skbsEdward Cree2-3/+49
When GRO decides not to coalesce a packet, in napi_frags_finish(), instead of passing it to the stack immediately, place it on a list in the napi struct. Then, at flush time (napi_complete_done(), napi_poll(), or napi_busy_loop()), call netif_receive_skb_list_internal() on the list. We'd like to do that in napi_gro_flush(), but it's not called if !napi->gro_bitmask, so we have to do it in the callers instead. (There are a handful of drivers that call napi_gro_flush() themselves, but it's not clear why, or whether this will affect them.) Because a full 64 packets is an inefficiently large batch, also consume the list whenever it exceeds gro_normal_batch, a new net/core sysctl that defaults to 8. Signed-off-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller4-15/+31
Just minor overlapping changes in the conflicts here. Signed-off-by: David S. Miller <[email protected]>
2019-08-06drop_monitor: Use pre_doit / post_doit hooksIdo Schimmel1-7/+17
Each operation from user space should be protected by the global drop monitor mutex. Use the pre_doit / post_doit hooks to take / release the lock instead of doing it explicitly in each function. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-06drop_monitor: Add extack supportIdo Schimmel1-3/+7
Add various extack messages to make drop_monitor more user friendly. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-06drop_monitor: Avoid multiple blank linesIdo Schimmel1-2/+0
Remove multiple blank lines which are visually annoying and useless. This suppresses the "Please don't use multiple blank lines" checkpatch messages. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-06drop_monitor: Document scope of spinlockIdo Schimmel1-1/+1
While 'per_cpu_dm_data' is a per-CPU variable, its 'skb' and 'send_timer' fields can be accessed concurrently by the CPU sending the netlink notification to user space from the workqueue and the CPU tracing kfree_skb(). This spinlock is meant to protect against that. Document its scope and suppress the checkpatch message "spinlock_t definition without comment". Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-06drop_monitor: Rename and document scope of mutexIdo Schimmel1-7/+13
The 'trace_state_mutex' does not only protect the global 'trace_state' variable, but also the global 'hw_stats_list'. Subsequent patches are going add more operations from user space to drop_monitor and these all need to be mutually exclusive. Rename 'trace_state_mutex' to the more fitting 'net_dm_mutex' name and document its scope. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-06drop_monitor: Use correct error codeIdo Schimmel1-2/+2
The error code 'ENOTSUPP' is reserved for use with NFS. Use 'EOPNOTSUPP' instead. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-05net: fix bpf_xdp_adjust_head regression for generic-XDPJesper Dangaard Brouer1-5/+10
When generic-XDP was moved to a later processing step by commit 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") a regression was introduced when using bpf_xdp_adjust_head. The issue is that after this commit the skb->network_header is now changed prior to calling generic XDP and not after. Thus, if the header is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header also need to be updated again. Fix by calling skb_reset_network_header(). Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.") Reported-by: Brandon Cazander <[email protected]> Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-08-01hrtimer/treewide: Use hrtimer_sleeper_start_expires()Thomas Gleixner1-1/+1
hrtimer_sleepers will gain a scheduling class dependent treatment on PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make that possible. Signed-off-by: Thomas Gleixner <[email protected]>
2019-08-01hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() callsSebastian Andrzej Siewior1-3/+1
hrtimer_init_sleeper() calls require prior initialisation of the hrtimer object which is embedded into the hrtimer_sleeper. Combine the initialization and spare a function call. Fixup all call sites. This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper specific initializations of the embedded hrtimer without modifying any of the call sites. No functional change. [ anna-maria: Minor cleanups ] [ tglx: Adopted to the removal of the task argument of hrtimer_init_sleeper() and trivial polishing. Folded a fix from Stephen Rothwell for the vsoc code ] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Anna-Maria Gleixner <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-07-30bpf: add bpf_tcp_gen_syncookie helperPetar Penkov1-0/+73
This helper function allows BPF programs to try to generate SYN cookies, given a reference to a listener socket. The function works from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a socket in both cases. Signed-off-by: Petar Penkov <[email protected]> Suggested-by: Eric Dumazet <[email protected]> Reviewed-by: Lorenz Bauer <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-07-30hrtimer: Remove task argument from hrtimer_init_sleeper()Thomas Gleixner1-1/+1
All callers hand in 'current' and that's the only task pointer which actually makes sense. Remove the task argument and set current in the function. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-07-30net: Use skb_frag_off accessorsJonathan Lemon4-31/+33
Use accessor functions for skb fragment's page_offset instead of direct references, in preparation for bvec conversion. Signed-off-by: Jonathan Lemon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-07-29xdp: Add devmap_hash map type for looking up devices by hashed indexToke Høiland-Jørgensen1-2/+7
A common pattern when using xdp_redirect_map() is to create a device map where the lookup key is simply ifindex. Because device maps are arrays, this leaves holes in the map, and the map has to be sized to fit the largest ifindex, regardless of how many devices actually are actually needed in the map. This patch adds a second type of device map where the key is looked up using a hashmap, instead of being used as an array index. This allows maps to be densely packed, so they can be smaller. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-07-29net: fix ifindex collision during namespace removalJiri Pirko1-0/+2
Commit aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.") introduced a possibility to hit a BUG in case device is returning back to init_net and two following conditions are met: 1) dev->ifindex value is used in a name of another "dev%d" device in init_net. 2) dev->name is used by another device in init_net. Under real life circumstances this is hard to get. Therefore this has been present happily for over 10 years. To reproduce: $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff 3: enp0s2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff $ ip netns add ns1 $ ip -n ns1 link add dummy1ns1 type dummy $ ip -n ns1 link add dummy2ns1 type dummy $ ip link set enp0s2 netns ns1 $ ip -n ns1 link set enp0s2 name dummy0 [ 100.858894] virtio_net virtio0 dummy0: renamed from enp0s2 $ ip link add dev4 type dummy $ ip -n ns1 a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy1ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 16:63:4c:38:3e:ff brd ff:ff:ff:ff:ff:ff 3: dummy2ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether aa:9e:86:dd:6b:5d brd ff:ff:ff:ff:ff:ff 4: dummy0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff 4: dev4: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 5a:e1:4a:b6:ec:f8 brd ff:ff:ff:ff:ff:ff $ ip netns del ns1 [ 158.717795] default_device_exit: failed to move dummy0 to init_net: -17 [ 158.719316] ------------[ cut here ]------------ [ 158.720591] kernel BUG at net/core/dev.c:9824! [ 158.722260] invalid opcode: 0000 [#1] SMP KASAN PTI [ 158.723728] CPU: 0 PID: 56 Comm: kworker/u2:1 Not tainted 5.3.0-rc1+ #18 [ 158.725422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 [ 158.727508] Workqueue: netns cleanup_net [ 158.728915] RIP: 0010:default_device_exit.cold+0x1d/0x1f [ 158.730683] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e [ 158.736854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282 [ 158.738752] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000 [ 158.741369] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64 [ 158.743418] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c [ 158.745626] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000 [ 158.748405] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72 [ 158.750638] FS: 0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000 [ 158.752944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 158.755245] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0 [ 158.757654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 158.760012] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 158.762758] Call Trace: [ 158.763882] ? dev_change_net_namespace+0xbb0/0xbb0 [ 158.766148] ? devlink_nl_cmd_set_doit+0x520/0x520 [ 158.768034] ? dev_change_net_namespace+0xbb0/0xbb0 [ 158.769870] ops_exit_list.isra.0+0xa8/0x150 [ 158.771544] cleanup_net+0x446/0x8f0 [ 158.772945] ? unregister_pernet_operations+0x4a0/0x4a0 [ 158.775294] process_one_work+0xa1a/0x1740 [ 158.776896] ? pwq_dec_nr_in_flight+0x310/0x310 [ 158.779143] ? do_raw_spin_lock+0x11b/0x280 [ 158.780848] worker_thread+0x9e/0x1060 [ 158.782500] ? process_one_work+0x1740/0x1740 [ 158.784454] kthread+0x31b/0x420 [ 158.786082] ? __kthread_create_on_node+0x3f0/0x3f0 [ 158.788286] ret_from_fork+0x3a/0x50 [ 158.789871] ---[ end trace defd6c657c71f936 ]--- [ 158.792273] RIP: 0010:default_device_exit.cold+0x1d/0x1f [ 158.795478] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e [ 158.804854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282 [ 158.807865] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000 [ 158.811794] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64 [ 158.816652] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c [ 158.820930] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000 [ 158.825113] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72 [ 158.829899] FS: 0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000 [ 158.834923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 158.838164] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0 [ 158.841917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 158.845149] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fix this by checking if a device with the same name exists in init_net and fallback to original code - dev%d to allocate name - in case it does. This was found using syzkaller. Fixes: aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.") Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-07-27net: neigh: remove redundant assignment to variable bucketColin Ian King1-1/+1
The variable bucket is being initialized with a value that is never read and it is being updated later with a new value in a following for-loop. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-07-25bpf/flow_dissector: support ipv6 flow_label and ↵Stanislav Fomichev1-0/+9
BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL Add support for exporting ipv6 flow label via bpf_flow_keys. Export flow label from bpf_flow.c and also return early when BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed. Acked-by: Petar Penkov <[email protected]> Acked-by: Willem de Bruijn <[email protected]> Acked-by: Song Liu <[email protected]> Cc: Song Liu <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Petar Penkov <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>