aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-12-03Merge tag 'net-5.10-rc7' of ↵Linus Torvalds59-199/+496
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.10-rc7, including fixes from bpf, netfilter, wireless drivers, wireless mesh and can. Current release - regressions: - mt76: usb: fix crash on device removal Current release - always broken: - xsk: Fix umem cleanup from wrong context in socket destruct Previous release - regressions: - net: ip6_gre: set dev->hard_header_len when using header_ops - ipv4: Fix TOS mask in inet_rtm_getroute() - net, xsk: Avoid taking multiple skbuff references Previous release - always broken: - net/x25: prevent a couple of overflows - netfilter: ipset: prevent uninit-value in hash_ip6_add - geneve: pull IP header before ECN decapsulation - mpls: ensure LSE is pullable in TC and openvswitch paths - vxlan: respect needed_headroom of lower device - batman-adv: Consider fragmentation for needed packet headroom - can: drivers: don't count arbitration loss as an error - netfilter: bridge: reset skb->pkt_type after POST_ROUTING traversal - inet_ecn: Fix endianness of checksum update when setting ECT(1) - ibmvnic: fix various corner cases around reset handling - net/mlx5: fix rejecting unsupported Connect-X6DX SW steering - net/mlx5: Enforce HW TX csum offload with kTLS" * tag 'net-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits) net/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steering net/mlx5e: kTLS, Enforce HW TX csum offload with kTLS net: mlx5e: fix fs_tcp.c build when IPV6 is not enabled net/mlx5: Fix wrong address reclaim when command interface is down net/sched: act_mpls: ensure LSE is pullable before reading it net: openvswitch: ensure LSE is pullable before reading it net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl net: mvpp2: Fix error return code in mvpp2_open() chelsio/chtls: fix a double free in chtls_setkey() rtw88: debug: Fix uninitialized memory in debugfs code vxlan: fix error return code in __vxlan_dev_create() net: pasemi: fix error return code in pasemi_mac_open() cxgb3: fix error return code in t3_sge_alloc_qset() net/x25: prevent a couple of overflows dpaa_eth: copy timestamp fields to new skb in A-050385 workaround net: ip6_gre: set dev->hard_header_len when using header_ops mt76: usb: fix crash on device removal iwlwifi: pcie: add some missing entries for AX210 iwlwifi: pcie: invert values of NO_160 device config entries iwlwifi: pcie: add one missing entry for AX210 ...
2020-12-03Merge branch 'mptcp-reject-invalid-mp_join-requests-right-away'Jakub Kicinski15-50/+91
Florian Westphal says: ==================== mptcp: reject invalid mp_join requests right away At the moment MPTCP can detect an invalid join request (invalid token, max number of subflows reached, and so on) right away but cannot reject the connection until the 3WHS has completed. Instead the connection will complete and the subflow is reset afterwards. To send the reset most information is already available, but we don't have good spot where the reset could be sent: 1. The ->init_req callback is too early and also doesn't allow to return an error that could be used to inform the TCP stack that the SYN should be dropped. 2. The ->route_req callback lacks the skb needed to send a reset. 3. The ->send_synack callback is the best fit from the available hooks, but its called after the request socket has been inserted into the queue already. This means we'd have to remove it again right away. From a technical point of view, the second hook would be best: 1. Its before insertion into listener queue. 2. If it returns NULL TCP will drop the packet for us. Problem is that we'd have to pass the skb to the function just for MPTCP. Paolo suggested to merge init_req and route_req callbacks instead: This makes all info available to MPTCP -- a return value of NULL drops the packet and MPTCP can send the reset if needed. Because 'route_req' has a 'const struct sock *', this means either removal of const qualifier, or a bit of code churn to pass 'const' in security land. This does the latter; I did not find any spots that need write access to struct sock. To recap, the two alternatives are: 1. Solve it entirely in MPTCP: use the ->send_synack callback to unlink the request socket from the listener & drop it. 2. Avoid 'security' churn by removing the const qualifier. ==================== Link: https://lore.kernel.org/r/20201130153631.21872-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03mptcp: emit tcp reset when a join request failsFlorian Westphal1-11/+36
RFC 8684 says: If the token is unknown or the host wants to refuse subflow establishment (for example, due to a limit on the number of subflows it will permit), the receiver will send back a reset (RST) signal, analogous to an unknown port in TCP, containing an MP_TCPRST option (Section 3.6) with an "MPTCP specific error" reason code. mptcp-next doesn't support MP_TCPRST yet, this can be added in another change. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03tcp: merge 'init_req' and 'route_req' functionsFlorian Westphal5-28/+44
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send a TCP reset if the token in a MP_JOIN request is unknown. At this time we don't do this, the 3whs completes and the 'new subflow' is reset afterwards. There are two ways to allow MPTCP to send the reset. 1. override 'send_synack' callback and emit the rst from there. The drawback is that the request socket gets inserted into the listeners queue just to get removed again right away. 2. Send the reset from the 'route_req' function instead. This avoids the 'add&remove request socket', but route_req lacks the skb that is required to send the TCP reset. Instead of just adding the skb to that function for MPTCP sake alone, Paolo suggested to merge init_req and route_req functions. This saves one indirection from syn processing path and provides the skb to the merged function at the same time. 'send reset on unknown mptcp join token' is added in next patch. Suggested-by: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03security: add const qualifier to struct sock in various placesFlorian Westphal10-15/+15
A followup change to tcp_request_sock_op would have to drop the 'const' qualifier from the 'route_req' function as the 'security_inet_conn_request' call is moved there - and that function expects a 'struct sock *'. However, it turns out its also possible to add a const qualifier to security_inet_conn_request instead. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03samples/bpf: Fix spelling mistake "recieving" -> "receiving"Colin Ian King1-1/+1
There is a spelling mistake in an error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/20201203114452.1060017-1-colin.king@canonical.com
2020-12-03bpf: Fix cold build of test_progs-no_alu32Brendan Jackman1-1/+1
This object lives inside the trunner output dir, i.e. tools/testing/selftests/bpf/no_alu32/btf_data.o At some point it gets copied into the parent directory during another part of the build, but that doesn't happen when building test_progs-no_alu32 from clean. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20201203120850.859170-1-jackmanb@google.com
2020-12-03libbpf: Cap retries in sys_bpf_prog_loadStanislav Fomichev1-1/+2
I've seen a situation, where a process that's under pprof constantly generates SIGPROF which prevents program loading indefinitely. The right thing to do probably is to disable signals in the upper layers while loading, but it still would be nice to get some error from libbpf instead of an endless loop. Let's add some small retry limit to the program loading: try loading the program 5 (arbitrary) times and give up. v2: * 10 -> 5 retires (Andrii Nakryiko) Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201202231332.3923644-1-sdf@google.com
2020-12-03Merge tag 's390-5.10-6' of ↵Linus Torvalds3-21/+13
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: "One commit is fixing lockdep irq state tracing which broke with -rc6. The other one fixes logical vs physical CPU address mixup in our PCI code. Summary: - fix lockdep irq state tracing - fix logical vs physical CPU address confusion in PCI code" * tag 's390-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: fix irq state tracing s390/pci: fix CPU address in MSI for directed IRQ
2020-12-03libbpf: Sanitise map names before pinningToke Høiland-Jørgensen1-0/+12
When we added sanitising of map names before loading programs to libbpf, we still allowed periods in the name. While the kernel will accept these for the map names themselves, they are not allowed in file names when pinning maps. This means that bpf_object__pin_maps() will fail if called on an object that contains internal maps (such as sections .rodata). Fix this by replacing periods with underscores when constructing map pin paths. This only affects the paths generated by libbpf when bpf_object__pin_maps() is called with a path argument. Any pin paths set by bpf_map__set_pin_path() are unaffected, and it will still be up to the caller to avoid invalid characters in those. Fixes: 113e6b7e15e2 ("libbpf: Sanitise internal map names so they are not rejected by the kernel") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203093306.107676-1-toke@redhat.com
2020-12-03Merge tag '9p-for-5.10-rc7' of git://github.com/martinetd/linuxLinus Torvalds1-0/+12
Pull 9p fixes from Dominique Martinet: "Restore splice functionality for 9p" * tag '9p-for-5.10-rc7' of git://github.com/martinetd/linux: fs: 9p: add generic splice_write file operation fs: 9p: add generic splice_read file operations
2020-12-03libbpf: Fail early when loading programs with unspecified typeAndrei Matei1-1/+14
Before this patch, a program with unspecified type (BPF_PROG_TYPE_UNSPEC) would be passed to the BPF syscall, only to have the kernel reject it with an opaque invalid argument error. This patch makes libbpf reject such programs with a nicer error message - in particular libbpf now tries to diagnose bad ELF section names at both open time and load time. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203043410.59699-1-andreimatei1@gmail.com
2020-12-03Merge branch 'Fixes for ima selftest'Andrii Nakryiko2-44/+64
KP Singh says: ==================== From: KP Singh <kpsingh@google.com> # v3 -> v4 * Fix typos. * Update commit message for the indentation patch. * Added Andrii's acks. # v2 -> v3 * Added missing tags. * Indentation fixes + some other fixes suggested by Andrii. * Re-indent file to tabs. The selftest for the bpf_ima_inode_hash helper uses a shell script to setup the system for ima. While this worked without an issue on recent desktop distros, it failed on environments with stripped out shells like busybox which is also used by the bpf CI. This series fixes the assumptions made on the availablity of certain command line switches and the expectation that securityfs being mounted by default. It also adds the missing kernel config dependencies in tools/testing/selftests/bpf and, lastly, changes the indentation of ima_setup.sh to use tabs. ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2020-12-03selftests/bpf: Indent ima_setup.sh with tabs.KP Singh1-54/+54
The file was formatted with spaces instead of tabs and went unnoticed as checkpatch.pl did not complain (probably because this is a shell script). Re-indent it with tabs to be consistent with other scripts. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-5-kpsingh@chromium.org
2020-12-03selftests/bpf: Add config dependency on BLK_DEV_LOOPKP Singh1-0/+1
The ima selftest restricts its scope to a test filesystem image mounted on a loop device and prevents permanent ima policy changes for the whole system. Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-4-kpsingh@chromium.org
2020-12-03selftests/bpf: Ensure securityfs mount before writing ima policyKP Singh1-0/+15
SecurityFS may not be mounted even if it is enabled in the kernel config. So, check if the mount exists in /proc/mounts by parsing the file and, if not, mount it on /sys/kernel/security. Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-3-kpsingh@chromium.org
2020-12-03selftests/bpf: Update ima_setup.sh for busyboxKP Singh1-4/+8
losetup on busybox does not output the name of loop device on using -f with --show. It also doesn't support -j to find the loop devices for a given backing file. losetup is updated to use "-a" which is available on busybox. blkid does not support options (-s and -o) to only display the uuid, so parse the output instead. Not all environments have mkfs.ext4, the test requires a loop device with a backing image file which could formatted with any filesystem. Update to using mkfs.ext2 which is available on busybox. Fixes: 34b82d3ac105 ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201203191437.666737-2-kpsingh@chromium.org
2020-12-03Merge branch 'mlx5-fixes-2020-12-01'Jakub Kicinski7-10/+51
Saeed Mahameed says: ==================== mlx5 fixes 2020-12-01 This series introduces some fixes to mlx5 driver. ==================== Link: https://lore.kernel.org/r/20201203043946.235385-1-saeedm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steeringYevgeny Kliteynik4-1/+15
STEs format for Connect-X5 and Connect-X6DX different. Currently, on Connext-X6DX the SW steering would break at some point when building STEs w/o giving a proper error message. Fix this by checking the STE format of the current device when initializing domain: add mlx5_ifc definitions for Connect-X6DX SW steering, read FW capability to get the current format version, and check this version when domain is being created. Fixes: 26d688e33f88 ("net/mlx5: DR, Add Steering entry (STE) utilities") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net/mlx5e: kTLS, Enforce HW TX csum offload with kTLSTariq Toukan1-7/+15
Checksum calculation cannot be done in SW for TX kTLS HW offloaded packets. Offload it to the device, disregard the declared state of the TX csum offload feature. Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: mlx5e: fix fs_tcp.c build when IPV6 is not enabledRandy Dunlap1-0/+2
Fix build when CONFIG_IPV6 is not enabled by making a function be built conditionally. Fixes these build errors and warnings: ../drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c: In function 'accel_fs_tcp_set_ipv6_flow': ../include/net/sock.h:380:34: error: 'struct sock_common' has no member named 'skc_v6_daddr'; did you mean 'skc_daddr'? 380 | #define sk_v6_daddr __sk_common.skc_v6_daddr | ^~~~~~~~~~~~ ../drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c:55:14: note: in expansion of macro 'sk_v6_daddr' 55 | &sk->sk_v6_daddr, 16); | ^~~~~~~~~~~ At top level: ../drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c:47:13: warning: 'accel_fs_tcp_set_ipv6_flow' defined but not used [-Wunused-function] 47 | static void accel_fs_tcp_set_ipv6_flow(struct mlx5_flow_spec *spec, struct sock *sk) Fixes: 5229a96e59ec ("net/mlx5e: Accel, Expose flow steering API for rules add/del") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net/mlx5: Fix wrong address reclaim when command interface is downEran Ben Elisha1-2/+19
When command interface is down, driver to reclaim all 4K page chucks that were hold by the Firmeware. Fix a bug for 64K page size systems, where driver repeatedly released only the first chunk of the page. Define helper function to fill 4K chunks for a given Firmware pages. Iterate over all unreleased Firmware pages and call the hepler per each. Fixes: 5adff6a08862 ("net/mlx5: Fix incorrect page count when in internal error") Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net/sched: act_mpls: ensure LSE is pullable before reading itDavide Caratti1-0/+3
when 'act_mpls' is used to mangle the LSE, the current value is read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is contained in the skb "linear" area. Found by code inspection. v2: - use MPLS_HLEN instead of sizeof(new_lse), thanks to Jakub Kicinski Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/3243506cba43d14858f3bd21ee0994160e44d64a.1606987058.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: openvswitch: ensure LSE is pullable before reading itDavide Caratti1-0/+3
when openvswitch is configured to mangle the LSE, the current value is read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is contained in the skb "linear" area. Found by code inspection. Fixes: d27cf5c59a12 ("net: core: add MPLS update core helper and use in OvS") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: skbuff: ensure LSE is pullable before decrementing the MPLS ttlDavide Caratti1-0/+3
skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in the skb "linear" area. Fix this calling pskb_may_pull() before reading the current ttl. Found by code inspection. Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03Merge tag 'wireless-drivers-2020-12-03' of ↵Jakub Kicinski5-13/+19
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for v5.10 Second, and most likely final, set of fixes for v5.10. Small fixes and PCI id addtions. iwlwifi * PCI id additions mt76 * fix a kernel crash during device removal rtw88 * fix uninitialized memory in debugfs code * tag 'wireless-drivers-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers: rtw88: debug: Fix uninitialized memory in debugfs code mt76: usb: fix crash on device removal iwlwifi: pcie: add some missing entries for AX210 iwlwifi: pcie: invert values of NO_160 device config entries iwlwifi: pcie: add one missing entry for AX210 iwlwifi: update MAINTAINERS entry ==================== Link: https://lore.kernel.org/r/20201203183408.EE88AC43461@smtp.codeaurora.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: mvpp2: Fix error return code in mvpp2_open()Wang Hai1-0/+1
Fix to return negative error code -ENOENT from invalid configuration error handling case instead of 0, as done elsewhere in this function. Fixes: 4bb043262878 ("net: mvpp2: phylink support") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20201203141806.37966-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03chelsio/chtls: fix a double free in chtls_setkey()Dan Carpenter1-0/+1
The "skb" is freed by the transmit code in cxgb4_ofld_send() and we shouldn't use it again. But in the current code, if we hit an error later on in the function then the clean up code will call kfree_skb(skb) and so it causes a double free. Set the "skb" to NULL and that makes the kfree_skb() a no-op. Fixes: d25f2f71f653 ("crypto: chtls - Program the TLS session Key") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/X8ilb6PtBRLWiSHp@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03Merge branch 'libbpf: add support for privileged/unprivileged control ↵Alexei Starovoitov7-15/+427
separation' Mariusz Dudek says: ==================== From: Mariusz Dudek <mariuszx.dudek@intel.com> This patch series adds support for separation of eBPF program load and xsk socket creation. In for example a Kubernetes environment you can have an AF_XDP CNI or daemonset that is responsible for launching pods that execute an application using AF_XDP sockets. It is desirable that the pod runs with as low privileges as possible, CAP_NET_RAW in this case, and that all operations that require privileges are contained in the CNI or daemonset. In this case, you have to be able separate ePBF program load from xsk socket creation. Currently, this will not work with the xsk_socket__create APIs because you need to have CAP_NET_ADMIN privileges to load eBPF program and CAP_SYS_ADMIN privileges to create update xsk_bpf_maps. To be exact xsk_set_bpf_maps does not need those privileges but it takes the prog_fd and xsks_map_fd and those are known only to process that was loading eBPF program. The api bpf_prog_get_fd_by_id that looks up the fd of the prog using an prog_id and bpf_map_get_fd_by_id that looks for xsks_map_fd usinb map_id both requires CAP_SYS_ADMIN. With this patch, the pod can be run with CAP_NET_RAW capability only. In case your umem is larger or equal process limit for MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability. Without this patch in case of insufficient rights ENOPERM is returned by xsk_socket__create. To resolve this privileges issue two new APIs are introduced: - xsk_setup_xdp_prog - loads the built in XDP program. It can also return xsks_map_fd which is needed by unprivileged process to update xsks_map with AF_XDP socket "fd" - xsk_sokcet__update_xskmap - inserts an AF_XDP socket into an xskmap for a particular xsk_socket Usage example: int xsk_setup_xdp_prog(int ifindex, int *xsks_map_fd) int xsk_socket__update_xskmap(struct xsk_socket *xsk, int xsks_map_fd); Inserts AF_XDP socket "fd" into the xskmap. The first patch introduces the new APIs. The second patch provides a new sample applications working as control and modification to existing xdpsock application to work with less privileges. This patch set is based on bpf-next commit 97306be45fbe (Merge branch 'switch to memcg-based memory accounting') Since v6 - rebase on 97306be45fbe to resolve RLIMIT conflicts Since v5 - fixed sample/bpf/xdpsock_user.c to resolve merge conflicts Since v4 - sample/bpf/Makefile issues fixed Since v3: - force_set_map flag removed - leaking of xsk struct fixed - unified function error returning policy implemented Since v2: - new APIs moved itto LIBBPF_0.3.0 section - struct bpf_prog_cfg_opts removed - loading own eBPF program via xsk_setup_xdp_prog functionality removed Since v1: - struct bpf_prog_cfg improved for backward/forward compatibility - API xsk_update_xskmap renamed to xsk_socket__update_xskmap - commit message formatting fixed ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-03samples/bpf: Sample application for eBPF load and socket creation splitMariusz Dudek4-6/+337
Introduce a sample program to demonstrate the control and data plane split. For the control plane part a new program called xdpsock_ctrl_proc is introduced. For the data plane part, some code was added to xdpsock_user.c to act as the data plane entity. Application xdpsock_ctrl_proc works as control entity with sudo privileges (CAP_SYS_ADMIN and CAP_NET_ADMIN are sufficient) and the extended xdpsock as data plane entity with CAP_NET_RAW capability only. Usage example: sudo ./samples/bpf/xdpsock_ctrl_proc -i <interface> sudo ./samples/bpf/xdpsock -i <interface> -q <queue_id> -n <interval> -N -l -R Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201203090546.11976-3-mariuszx.dudek@intel.com
2020-12-03libbpf: Separate XDP program load with xsk socket creationMariusz Dudek3-9/+90
Add support for separation of eBPF program load and xsk socket creation. This is needed for use-case when you want to privide as little privileges as possible to the data plane application that will handle xsk socket creation and incoming traffic. With this patch the data entity container can be run with only CAP_NET_RAW capability to fulfill its purpose of creating xsk socket and handling packages. In case your umem is larger or equal process limit for MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability. To resolve privileges issue two APIs are introduced: - xsk_setup_xdp_prog - loads the built in XDP program. It can also return xsks_map_fd which is needed by unprivileged process to update xsks_map with AF_XDP socket "fd" - xsk_socket__update_xskmap - inserts an AF_XDP socket into an xskmap for a particular xsk_socket Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201203090546.11976-2-mariuszx.dudek@intel.com
2020-12-03tools/resolve_btfids: Fix some error messagesBrendan Jackman1-3/+3
Add missing newlines and fix polarity of strerror argument. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20201203102234.648540-1-jackmanb@google.com
2020-12-03selftests/bpf: Copy file using read/write in local storage testStanislav Fomichev1-10/+18
Splice (copy_file_range) doesn't work on all filesystems. I'm running test kernels on top of my read-only disk image and it uses plan9 under the hood. This prevents test_local_storage from successfully passing. There is really no technical reason to use splice, so lets do old-school read/write to copy file; this should work in all environments. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202174947.3621989-1-sdf@google.com
2020-12-03Merge branch 'bpftool: improve split BTF support'Alexei Starovoitov4-4/+30
Andrii Nakryiko says: ==================== Few follow up improvements to bpftool for split BTF support: - emit "name <anon>" for non-named BTFs in `bpftool btf show` command; - when dumping /sys/kernel/btf/<module> use /sys/kernel/btf/vmlinux as the base BTF, unless base BTF is explicitly specified with -B flag. This patch set also adds btf__base_btf() getter to access base BTF of the struct btf. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-03tools/bpftool: Auto-detect split BTFs in common casesAndrii Nakryiko1-4/+21
In case of working with module's split BTF from /sys/kernel/btf/*, auto-substitute /sys/kernel/btf/vmlinux as the base BTF. This makes using bpftool with module BTFs faster and simpler. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-4-andrii@kernel.org
2020-12-03libbpf: Add base BTF accessorAndrii Nakryiko3-0/+7
Add ability to get base BTF. It can be also used to check if BTF is split BTF. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-3-andrii@kernel.org
2020-12-03tools/bpftool: Emit name <anon> for anonymous BTFsAndrii Nakryiko1-0/+2
For consistency of output, emit "name <anon>" for BTFs without the name. This keeps output more consistent and obvious. Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-2-andrii@kernel.org
2020-12-03uapi: fix statx attribute value overlap for DAX & MOUNT_ROOTEric Sandeen1-3/+6
STATX_ATTR_MOUNT_ROOT and STATX_ATTR_DAX got merged with the same value, so one of them needs fixing. Move STATX_ATTR_DAX. While we're in here, clarify the value-matching scheme for some of the attributes, and explain why the value for DAX does not match. Fixes: 80340fe3605c ("statx: add mount_root") Fixes: 712b2698e4c0 ("fs/stat: Define DAX statx attribute") Link: https://lore.kernel.org/linux-fsdevel/7027520f-7c79-087e-1d00-743bdefa1a1e@redhat.com/ Link: https://lore.kernel.org/lkml/20201202214629.1563760-1-ira.weiny@intel.com/ Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: <stable@vger.kernel.org> # 5.8 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-03pwm: sl28cpld: fix getting driver data in pwm callbacksUwe Kleine-König1-2/+4
Currently .get_state() and .apply() use dev_get_drvdata() on the struct device related to the pwm chip. This only works after .probe() called platform_set_drvdata() which in this driver happens only after pwmchip_add() and so comes possibly too late. Instead of setting the driver data earlier use the traditional container_of approach as this way the driver data is conceptually and computational nearer. Fixes: 9db33d221efc ("pwm: Add support for sl28cpld PWM controller") Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-03lib/syscall: fix syscall registers retrieval on 32-bit platformsWilly Tarreau1-2/+9
Lilith >_> and Claudio Bozzato of Cisco Talos security team reported that collect_syscall() improperly casts the syscall registers to 64-bit values leaking the uninitialized last 24 bytes on 32-bit platforms, that are visible in /proc/self/syscall. The cause is that info->data.args are u64 while syscall_get_arguments() uses longs, as hinted by the bogus pointer cast in the function. Let's just proceed like the other call places, by retrieving the registers into an array of longs before assigning them to the caller's array. This was successfully tested on x86_64, i386 and ppc32. Reference: CVE-2020-28588, TALOS-2020-1211 Fixes: 631b7abacd02 ("ptrace: Remove maxargs from task_current_syscall()") Cc: Greg KH <greg@kroah.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (ppc32) Signed-off-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-03macvlan: Support for high multicast packet rateThomas Karlsson4-2/+43
Background: Broadcast and multicast packages are enqueued for later processing. This queue was previously hardcoded to 1000. This proved insufficient for handling very high packet rates. This resulted in packet drops for multicast. While at the same time unicast worked fine. The change: This patch make the queue length adjustable to accommodate for environments with very high multicast packet rate. But still keeps the default value of 1000 unless specified. The queue length is specified as a request per macvlan using the IFLA_MACVLAN_BC_QUEUE_LEN parameter. The actual used queue length will then be the maximum of any macvlan connected to the same port. The actual used queue length for the port can be retrieved (read only) by the IFLA_MACVLAN_BC_QUEUE_LEN_USED parameter for verification. This will be followed up by a patch to iproute2 in order to adjust the parameter from userspace. Signed-off-by: Thomas Karlsson <thomas.karlsson@paneda.se> Link: https://lore.kernel.org/r/dd4673b2-7eab-edda-6815-85c67ce87f63@paneda.se Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03rtw88: debug: Fix uninitialized memory in debugfs codeDan Carpenter1-0/+2
This code does not ensure that the whole buffer is initialized and none of the callers check for errors so potentially none of the buffer is initialized. Add a memset to eliminate this bug. Fixes: e3037485c68e ("rtw88: new Realtek 802.11ac driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/X8ilOfVz3pf0T5ec@mwanda
2020-12-02Merge branch 'switch to memcg-based memory accounting'Alexei Starovoitov61-762/+533
Roman Gushchin says: ==================== Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf use cases (e.g. memlock()). Any specific value is either too low and creates false failures or is too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and charge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. 6) rlimits are generally considered as (at least partially) obsolete. They do not provide a comprehensive system for the control of physical resources: memory, cpu, io etc. All resource control developments in the recent years were related to cgroups. In order to overcome these problems let's switch to the memory cgroup-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of the memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. Cgroup-based accounting adds new requirements: 1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled. These options are usually enabled, maybe excluding tiny builds for embedded devices. 2) The system should have a configured cgroup hierarchy, including reasonable memory limits and/or guarantees. Modern systems usually delegate this task to systemd or similar task managers. Without meeting these requirements there are no limits on how much memory bpf can use and a non-root user is able to hurt the system by allocating too much. But because per-user rlimits do not provide a functional system to protect and manage physical resources anyway, anyone who seriously depends on it, should use cgroups. When a bpf map is created, the memory cgroup of the process which creates the map is recorded. Subsequently all memory allocation related to the bpf map are charged to the same cgroup. It includes allocations made from interrupts and by any processes. Bpf program memory is charged to the memory cgroup of a process which loads the program. The patchset consists of the following parts: 1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped to userspace 2) memcg-based accounting for various bpf objects: progs and maps 3) removal of the rlimit-based accounting 4) removal of rlimit adjustments in userspace samples v9: - always charge the saved memory cgroup, by Daniel, Toke and Alexei - added bpf_map_kzalloc() - rebase and minor fixes v8: - extended the cover letter to be more clear on new requirements, by Daniel - an approximate value is provided by map memlock info, by Alexei v7: - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei - switched allocations made from an interrupt context to new helpers, by Daniel - rebase and minor fixes v6: - rebased to the latest version of the remote charging API - fixed signatures, added acks v5: - rebased to the latest version of the remote charging API - implemented kmem accounting from an interrupt context, by Shakeel - rebased to latest changes in mm allowed to map vmallocs to userspace - fixed a build issue in kselftests, by Alexei - fixed a use-after-free bug in bpf_map_free_deferred() - added bpf line info coverage, by Shakeel - split bpf map charging preparations into a separate patch v4: - covered allocations made from an interrupt context, by Daniel - added some clarifications to the cover letter v3: - droped the userspace part for further discussions/refinements, by Andrii and Song v2: - fixed build issue, caused by the remaining rlimit-based accounting for sockhash maps ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-02bpf: samples: Do not touch RLIMIT_MEMLOCKRoman Gushchin27-133/+0
Since bpf is not using rlimit memlock for the memory accounting and control, do not change the limit in sample applications. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-35-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for bpf progsRoman Gushchin3-80/+12
Do not use rlimit-based memory accounting for bpf progs. It has been replaced with memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting infra for bpf mapsRoman Gushchin4-100/+17
Remove rlimit-based accounting infrastructure code, which is not used anymore. To provide a backward compatibility, use an approximation of the bpf map memory footprint as a "memlock" value, available to a user via map info. The approximation is based on the maximal number of elements and key and value sizes. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for bpf local storage mapsRoman Gushchin1-10/+0
Do not use rlimit-based memory accounting for bpf local storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for xskmap mapsRoman Gushchin1-10/+2
Do not use rlimit-based memory accounting for xskmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for stackmap mapsRoman Gushchin1-13/+3
Do not use rlimit-based memory accounting for stackmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash mapsRoman Gushchin1-27/+6
Do not use rlimit-based memory accounting for sockmap and sockhash maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com