aboutsummaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)AuthorFilesLines
2023-11-01Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds1-11/+14
Pull io_uring {get,set}sockopt support from Jens Axboe: "This adds support for using getsockopt and setsockopt via io_uring. The main use cases for this is to enable use of direct descriptors, rather than first instantiating a normal file descriptor, doing the option tweaking needed, then turning it into a direct descriptor. With this support, we can avoid needing a regular file descriptor completely. The net and bpf bits have been signed off on their side" * tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux: selftests/bpf/sockopt: Add io_uring support io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT io_uring/cmd: return -EOPNOTSUPP if net is disabled selftests/net: Extract uring helpers to be reusable tools headers: Grab copy of io_uring.h io_uring/cmd: Pass compat mode in issue_flags net/socket: Break down __sys_getsockopt net/socket: Break down __sys_setsockopt bpf: Add sockptr support for setsockopt bpf: Add sockptr support for getsockopt
2023-10-31Merge tag 'net-next-6.7' of ↵Linus Torvalds19-312/+1766
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ...
2023-10-30Merge tag 'vfs-6.7.ctime' of ↵Linus Torvalds1-3/+2
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode time accessor updates from Christian Brauner: "This finishes the conversion of all inode time fields to accessor functions as discussed on list. Changing timestamps manually as we used to do before is error prone. Using accessors function makes this robust. It does not contain the switch of the time fields to discrete 64 bit integers to replace struct timespec and free up space in struct inode. But after this, the switch can be trivially made and the patch should only affect the vfs if we decide to do it" * tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits) fs: rename inode i_atime and i_mtime fields security: convert to new timestamp accessors selinux: convert to new timestamp accessors apparmor: convert to new timestamp accessors sunrpc: convert to new timestamp accessors mm: convert to new timestamp accessors bpf: convert to new timestamp accessors ipc: convert to new timestamp accessors linux: convert to new timestamp accessors zonefs: convert to new timestamp accessors xfs: convert to new timestamp accessors vboxsf: convert to new timestamp accessors ufs: convert to new timestamp accessors udf: convert to new timestamp accessors ubifs: convert to new timestamp accessors tracefs: convert to new timestamp accessors sysv: convert to new timestamp accessors squashfs: convert to new timestamp accessors server: convert to new timestamp accessors client: convert to new timestamp accessors ...
2023-10-30Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-3/+1
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/[email protected] Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2023-10-26bpf: Add more WARN_ON_ONCE checks for mismatched alloc and freeHou Tao1-0/+4
There are two possible mismatched alloc and free cases in BPF memory allocator: 1) allocate from cache X but free by cache Y with a different unit_size 2) allocate from per-cpu cache but free by kmalloc cache or vice versa So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to spot these mismatched alloc and free early. Signed-off-by: Hou Tao <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-24netkit, bpf: Add bpf programmable net deviceDaniel Borkmann1-5/+25
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source. One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides). In this initial version, the netkit device ships as a pair, but we plan to extend this further so it can also operate in single device mode. The pair comes with a primary and a peer device. Only the primary device, typically residing in hostns, can manage BPF programs for itself and its peer. The peer device is designated for containers/Pods and cannot attach/detach BPF programs. Upon the device creation, the user can set the default policy to 'pass' or 'drop' for the case when no BPF program is attached. Additionally, the device can be operated in L3 (default) or L2 mode. The management of BPF programs is done via bpf_mprog, so that multi-attach is supported right from the beginning with similar API and dependency controls as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs"). tc BPF compatibility is provided, so that existing programs can be easily migrated. Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. Additionally, the programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device, hence the latter is also supporting the ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. Also, BIG TCP is supported on netkit device. For the follow-up work in single device mode, we plan to convert Cilium's cilium_host/_net devices into a single one. An extensive test suite for checking device operations and the BPF program and link management API comes as BPF selftests in this series. Co-developed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://github.com/borkmann/iproute2/tree/pr/netkit Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.) Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-24bpf: Improve JEQ/JNE branch taken logicAndrii Nakryiko1-0/+8
When determining if an if/else branch will always or never be taken, use signed range knowledge in addition to currently used unsigned range knowledge. If either signed or unsigned range suggests that condition is always/never taken, return corresponding branch_taken verdict. Current use of unsigned range for this seems arbitrary and unnecessarily incomplete. It is possible for *signed* operations to be performed on register, which could "invalidate" unsigned range for that register. In such case branch_taken will be artificially useless, even if we can still tell that some constant is outside of register value range based on its signed bounds. veristat-based validation shows zero differences across selftests, Cilium, and Meta-internal BPF object files. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-24bpf: Fold smp_mb__before_atomic() into atomic_set_release()Paul E. McKenney1-2/+1
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set() immediately preceded by smp_mb__before_atomic() so as to order storing of ring-buffer consumer and producer positions prior to the atomic_set() call's clearing of the ->busy flag, as follows: smp_mb__before_atomic(); atomic_set(&rb->busy, 0); Although this works given current architectures and implementations, and given that this only needs to order prior writes against a later write. However, it does so by accident because the smp_mb__before_atomic() is only guaranteed to work with read-modify-write atomic operations, and not at all with things like atomic_set() and atomic_read(). Note especially that smp_mb__before_atomic() will not, repeat *not*, order the prior write to "a" before the subsequent non-read-modify-write atomic read from "b", even on strongly ordered systems such as x86: WRITE_ONCE(a, 1); smp_mb__before_atomic(); r1 = atomic_read(&b); Therefore, replace the smp_mb__before_atomic() and atomic_set() with atomic_set_release() as follows: atomic_set_release(&rb->busy, 0); This is no slower (and sometimes is faster) than the original, and also provides a formal guarantee of ordering that the original lacks. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: David Vernet <[email protected]> Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
2023-10-24bpf: Fix unnecessary -EBUSY from htab_lock_bucketSong Liu1-2/+5
htab_lock_bucket uses the following logic to avoid recursion: 1. preempt_disable(); 2. check percpu counter htab->map_locked[hash] for recursion; 2.1. if map_lock[hash] is already taken, return -BUSY; 3. raw_spin_lock_irqsave(); However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ logic will not able to access the same hash of the hashtab and get -EBUSY. This -EBUSY is not really necessary. Fix it by disabling IRQ before checking map_locked: 1. preempt_disable(); 2. local_irq_save(); 3. check percpu counter htab->map_locked[hash] for recursion; 3.1. if map_lock[hash] is already taken, return -BUSY; 4. raw_spin_lock(). Similarly, use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Song Liu <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Link: https://lore.kernel.org/bpf/[email protected]
2023-10-23bpf: print full verifier states on infinite loop detectionEduard Zingerman1-0/+4
Additional logging in is_state_visited(): if infinite loop is detected print full verifier state for both current and equivalent states. Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-23bpf: correct loop detection for iterators convergenceEduard Zingerman1-4/+203
It turns out that .branches > 0 in is_state_visited() is not a sufficient condition to identify if two verifier states form a loop when iterators convergence is computed. This commit adds logic to distinguish situations like below: (I) initial (II) initial | | V V .---------> hdr .. | | | | V V | .------... .------.. | | | | | | V V V V | ... ... .-> hdr .. | | | | | | | V V | V V | succ <- cur | succ <- cur | | | | | V | V | ... | ... | | | | '----' '----' For both (I) and (II) successor 'succ' of the current state 'cur' was previously explored and has branches count at 0. However, loop entry 'hdr' corresponding to 'succ' might be a part of current DFS path. If that is the case 'succ' and 'cur' are members of the same loop and have to be compared exactly. Co-developed-by: Andrii Nakryiko <[email protected]> Co-developed-by: Alexei Starovoitov <[email protected]> Reviewed-by: Andrii Nakryiko <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-23bpf: exact states comparison for iterator convergence checksEduard Zingerman1-31/+187
Convergence for open coded iterators is computed in is_state_visited() by examining states with branches count > 1 and using states_equal(). states_equal() computes sub-state relation using read and precision marks. Read and precision marks are propagated from children states, thus are not guaranteed to be complete inside a loop when branches count > 1. This could be demonstrated using the following unsafe program: 1. r7 = -16 2. r6 = bpf_get_prandom_u32() 3. while (bpf_iter_num_next(&fp[-8])) { 4. if (r6 != 42) { 5. r7 = -32 6. r6 = bpf_get_prandom_u32() 7. continue 8. } 9. r0 = r10 10. r0 += r7 11. r8 = *(u64 *)(r0 + 0) 12. r6 = bpf_get_prandom_u32() 13. } Here verifier would first visit path 1-3, create a checkpoint at 3 with r7=-16, continue to 4-7,3 with r7=-32. Because instructions at 9-12 had not been visitied yet existing checkpoint at 3 does not have read or precision mark for r7. Thus states_equal() would return true and verifier would discard current state, thus unsafe memory access at 11 would not be caught. This commit fixes this loophole by introducing exact state comparisons for iterator convergence logic: - registers are compared using regs_exact() regardless of read or precision marks; - stack slots have to have identical type. Unfortunately, this is too strict even for simple programs like below: i = 0; while(iter_next(&it)) i++; At each iteration step i++ would produce a new distinct state and eventually instruction processing limit would be reached. To avoid such behavior speculatively forget (widen) range for imprecise scalar registers, if those registers were not precise at the end of the previous iteration and do not match exactly. This a conservative heuristic that allows to verify wide range of programs, however it precludes verification of programs that conjure an imprecise value on the first loop iteration and use it as precise on the second. Test case iter_task_vma_for_each() presents one of such cases: unsigned int seen = 0; ... bpf_for_each(task_vma, vma, task, 0) { if (seen >= 1000) break; ... seen++; } Here clang generates the following code: <LBB0_4>: 24: r8 = r6 ; stash current value of ... body ... 'seen' 29: r1 = r10 30: r1 += -0x8 31: call bpf_iter_task_vma_next 32: r6 += 0x1 ; seen++; 33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL 34: r7 += 0x10 35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000 <LBB0_6>: ... exit ... Note that counter in r6 is copied to r8 and then incremented, conditional jump is done using r8. Because of this precision mark for r6 lags one state behind of precision mark on r8 and widening logic kicks in. Adding barrier_var(seen) after conditional is sufficient to force clang use the same register for both counting and conditional jump. This issue was discussed in the thread [1] which was started by Andrew Werner <[email protected]> demonstrating a similar bug in callback functions handling. The callbacks would be addressed in a followup patch. [1] https://lore.kernel.org/bpf/[email protected]/ Co-developed-by: Andrii Nakryiko <[email protected]> Co-developed-by: Alexei Starovoitov <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-23bpf: extract same_callsites() as utility functionEduard Zingerman1-5/+15
Extract same_callsites() from clean_live_states() as a utility function. This function would be used by the next patch in the set. Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-23bpf: move explored_state() closer to the beginning of verifier.cEduard Zingerman1-15/+13
Subsequent patches would make use of explored_state() function. Move it up to avoid adding unnecessary prototype. Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-23bpf, tcx: Get rid of tcx_link_constDaniel Borkmann1-2/+2
Small clean up to get rid of the extra tcx_link_const() and only retain the tcx_link(). Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-20bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()Hou Tao2-10/+16
The following warning was reported when running "./test_progs -t test_bpf_ma/percpu_free_through_map_free": ------------[ cut here ]------------ WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342 CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:bpf_mem_refill+0x21c/0x2a0 ...... Call Trace: <IRQ> ? bpf_mem_refill+0x21c/0x2a0 irq_work_single+0x27/0x70 irq_work_run_list+0x2a/0x40 irq_work_run+0x18/0x40 __sysvec_irq_work+0x1c/0xc0 sysvec_irq_work+0x73/0x90 </IRQ> <TASK> asm_sysvec_irq_work+0x1b/0x20 RIP: 0010:unit_free+0x50/0x80 ...... bpf_mem_free+0x46/0x60 __bpf_obj_drop_impl+0x40/0x90 bpf_obj_free_fields+0x17d/0x1a0 array_map_free+0x6b/0x170 bpf_map_free_deferred+0x54/0xa0 process_scheduled_works+0xba/0x370 worker_thread+0x16d/0x2e0 kthread+0x105/0x140 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1b/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The reason is simple: __bpf_obj_drop_impl() does not know the freeing field is a per-cpu pointer and it uses bpf_global_ma to free the pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is used to select the corresponding cache. The bpf_mem_cache with 16-bytes unit_size will always be selected to do the unmatched free and it will trigger the warning in free_bulk() eventually. Because per-cpu kptr doesn't support list or rb-tree now, so fix the problem by only checking whether or not the type of kptr is per-cpu in bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.hHou Tao2-4/+0
both syscall.c and helpers.c have the declaration of __bpf_obj_drop_impl(), so just move it to a common header file. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()Hou Tao1-2/+14
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-20bpf: Re-enable unit_size checking for global per-cpu allocatorHou Tao1-10/+12
With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size. Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-19bpf: Let bpf_iter_task_new accept null task ptrChuyi Zhou2-3/+17
When using task_iter to iterate all threads of a specific task, we enforce that the user must pass a valid task pointer to ensure safety. However, when iterating all threads/process in the system, BPF verifier still require a valid ptr instead of "nullable" pointer, even though it's pointless, which is a kind of surprising from usability standpoint. It would be nice if we could let that kfunc accept a explicit null pointer when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer when using BPF_TASK_ITER_THREAD. Given a trival kfunc: __bpf_kfunc void FN(struct TYPE_A *obj); BPF Prog would reject a nullptr for obj. The error info is: "arg#x pointer type xx xx must point to scalar, or struct with scalar" reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and the btf type of ref_t is not scalar or scalar_struct which leads to the rejection of get_kfunc_ptr_arg_type. This patch add "__nullable" annotation: __bpf_kfunc void FN(struct TYPE_A *obj__nullable); Here __nullable indicates obj can be optional, user can pass a explicit nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will detect whether the current arg is optional and register is null, If so, return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next arg in check_kfunc_args(). Signed-off-by: Chuyi Zhou <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-19bpf: teach the verifier to enforce css_iter and task_iter in RCU CSChuyi Zhou2-13/+41
css_iter and task_iter should be used in rcu section. Specifically, in sleepable progs explicit bpf_rcu_read_lock() is needed before use these iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to use them directly. This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check whether we are in rcu cs before we want to invoke this kfunc. If the rcu protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU. Once user do rcu_unlock during the iteration, state MEM_RCU of regs would be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED. It is worth noting that currently, bpf_rcu_read_unlock does not clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg search STACK_ITER. Signed-off-by: Chuyi Zhou <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-19bpf: Introduce css open-coded iterator kfuncsChuyi Zhou2-0/+68
This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css in open-coded iterator style. These kfuncs actually wrapps css_next_descendant_{pre, post}. css_iter can be used to: 1) iterating a sepcific cgroup tree with pre/post/up order 2) iterating cgroup_subsystem in BPF Prog, like for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel. The API design is consistent with cgroup_iter. bpf_iter_css_new accepts parameters defining iteration order and starting css. Here we also reuse BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST, BPF_CGROUP_ITER_ANCESTORS_UP enums. Signed-off-by: Chuyi Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-19bpf: Introduce task open coded iterator kfuncsChuyi Zhou2-0/+93
This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task in open-coded iterator style. BPF programs can use these kfuncs or through bpf_for_each macro to iterate all processes in the system. The API design keep consistent with SEC("iter/task"). bpf_iter_task_new() accepts a specific task and iterating type which allows: 1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS) 2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS) 3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS) Signed-off-by: Chuyi Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-19bpf: Introduce css_task open-coded iterator kfuncsChuyi Zhou3-0/+84
This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css_task in open-coded iterator style. These kfuncs actually wrapps css_task_iter_{start,next, end}. BPF programs can use these kfuncs through bpf_for_each macro for iteration of all tasks under a css. css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so the bpf side has to be careful in where it allows to use this iter. Currently we only allow it in bpf_lsm and bpf iter-s. Signed-off-by: Chuyi Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-10-19bpf: Add sockptr support for setsockoptBreno Leitao1-2/+3
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in setsockptr BPF hooks, so, it could be used by other callers. The main motivation for this change is to use it in the io_uring {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a kernel value for optlen. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Breno Leitao <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-10-19bpf: Add sockptr support for getsockoptBreno Leitao1-9/+11
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in getsockptr BPF hooks, so, it could be used by other callers. The main motivation for this change is to use it in the io_uring {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a kernel value for optlen. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Breno Leitao <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-10-19file: convert to SLAB_TYPESAFE_BY_RCUChristian Brauner1-3/+1
In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-18bpf: convert to new timestamp accessorsJeff Layton1-3/+2
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-17bpf: Fix missed rcu read lock in bpf_task_under_cgroup()Yafang Shao1-1/+6
When employed within a sleepable program not under RCU protection, the use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log, particularly when CONFIG_PROVE_RCU is enabled: [ 1259.662357] WARNING: suspicious RCU usage [ 1259.662358] 6.5.0+ #33 Not tainted [ 1259.662360] ----------------------------- [ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage! Other info that might help to debug this: [ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1 [ 1259.662368] 1 lock held by trace/72954: [ 1259.662369] #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0 Stack backtrace: [ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33 [ 1259.662391] Call Trace: [ 1259.662393] <TASK> [ 1259.662395] dump_stack_lvl+0x6e/0x90 [ 1259.662401] dump_stack+0x10/0x20 [ 1259.662404] lockdep_rcu_suspicious+0x163/0x1b0 [ 1259.662412] task_css_set.part.0+0x23/0x30 [ 1259.662417] bpf_task_under_cgroup+0xe7/0xf0 [ 1259.662422] bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93 [ 1259.662431] bpf_trampoline_6442505574+0x60/0x1000 [ 1259.662439] bpf_lsm_bpf+0x5/0x20 [ 1259.662443] ? security_bpf+0x32/0x50 [ 1259.662452] __sys_bpf+0xe6/0xdd0 [ 1259.662463] __x64_sys_bpf+0x1a/0x30 [ 1259.662467] do_syscall_64+0x38/0x90 [ 1259.662472] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1259.662479] RIP: 0033:0x7f487baf8e29 [...] [ 1259.662504] </TASK> This issue can be reproduced by executing a straightforward program, as demonstrated below: SEC("lsm.s/bpf") int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size) { struct cgroup *cgrp = NULL; struct task_struct *task; int ret = 0; if (cmd != BPF_LINK_CREATE) return 0; // The cgroup2 should be mounted first cgrp = bpf_cgroup_from_id(1); if (!cgrp) goto out; task = bpf_get_current_task_btf(); if (bpf_task_under_cgroup(task, cgrp)) ret = -1; bpf_cgroup_release(cgrp); out: return ret; } After running the program, if you subsequently execute another BPF program, you will encounter the warning. It's worth noting that task_under_cgroup_hierarchy() is also utilized by bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup() doesn't exhibit this issue because it cannot be used in sleepable BPF programs. Fixes: b5ad4cdc46c7 ("bpf: Add bpf_task_under_cgroup() kfunc") Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Cc: Feng Zhou <[email protected]> Cc: KP Singh <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-17net, bpf: Add a warning if NAPI cb missed xdp_do_flush().Sebastian Andrzej Siewior2-0/+20
A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-16Merge tag 'for-netdev' of ↵Jakub Kicinski8-39/+203
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-16bpf: Ensure proper register state printing for cond jumpsAndrii Nakryiko1-1/+6
Verifier emits relevant register state involved in any given instruction next to it after `;` to the right, if possible. Or, worst case, on the separate line repeating instruction index. E.g., a nice and simple case would be: 2: (d5) if r0 s<= 0x0 goto pc+1 ; R0_w=0 But if there is some intervening extra output (e.g., precision backtracking log) involved, we are supposed to see the state after the precision backtrack log: 4: (75) if r0 s>= 0x0 goto pc+1 mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1 mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0 6: R0_w=0 First off, note that in `6: R0_w=0` instruction index corresponds to the next instruction, not to the conditional jump instruction itself, which is wrong and we'll get to that. But besides that, the above is a happy case that does work today. Yet, if it so happens that precision backtracking had to traverse some of the parent states, this `6: R0_w=0` state output would be missing. This is due to a quirk of print_verifier_state() routine, which performs mark_verifier_state_clean(env) at the end. This marks all registers as "non-scratched", which means that subsequent logic to print *relevant* registers (that is, "scratched ones") fails and doesn't see anything relevant to print and skips the output altogether. print_verifier_state() is used both to print instruction context, but also to print an **entire** verifier state indiscriminately, e.g., during precision backtracking (and in a few other situations, like during entering or exiting subprogram). Which means if we have to print entire parent state before getting to printing instruction context state, instruction context is marked as clean and is omitted. Long story short, this is definitely not intentional. So we fix this behavior in this patch by teaching print_verifier_state() to clear scratch state only if it was used to print instruction state, not the parent/callback state. This is determined by print_all option, so if it's not set, we don't clear scratch state. This fixes missing instruction state for these cases. As for the mismatched instruction index, we fix that by making sure we call print_insn_state() early inside check_cond_jmp_op() before we adjusted insn_idx based on jump branch taken logic. And with that we get desired correct information: 9: (16) if w4 == 0x1 goto pc+9 mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1 mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9 mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5 mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1 9: R4=1 Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-16bpf: Disambiguate SCALAR register state output in verifier logsAndrii Nakryiko1-22/+45
Currently the way that verifier prints SCALAR_VALUE register state (and PTR_TO_PACKET, which can have var_off and ranges info as well) is very ambiguous. In the name of brevity we are trying to eliminate "unnecessary" output of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if possible. Current rules are that if any of those have their default value (which for mins is the minimal value of its respective types: 0, S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or U64_MAX) *OR* if there is another min/max value that as matching value. E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin altogether. This approach has a few problems, being both ambiguous and sort-of incorrect in some cases. Ambiguity is due to missing value could be either default value or value of umin/umax or smin/smax. This is especially confusing when we mix signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll have only `umin=0` leaving anyone reading verifier log to guess whether smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And often times it's important to know, especially when debugging tricky issues. "Sort-of incorrectness" comes from mixing negative and positive values. E.g., if umin is some large positive number, it can be equal to smin which is, interpreted as signed value, is actually some negative value. Currently, that smin will be omitted and only umin will be emitted with a large positive value, giving an impression that smin is also positive. Anyway, ambiguity is the biggest issue making it impossible to have an exact understanding of register state, preventing any sort of automated testing of verifier state based on verifier log. This patch is attempting to rectify the situation by removing ambiguity, while minimizing the verboseness of register state output. The rules are straightforward: - if some of the values are missing, then it definitely has a default value. I.e., `umin=0` means that umin is zero, but smin is actually S64_MIN; - all the various boundaries that happen to have the same value are emitted in one equality separated sequence. E.g., if umin and smin are both 100, we'll emit `smin=umin=100`, making this explicit; - we do not mix negative and positive values together, and even if they happen to have the same bit-level value, they will be emitted separately with proper sign. I.e., if both umax and smax happen to be 0xffffffffffffffff, we'll emit them both separately as `smax=-1,umax=18446744073709551615`; - in the name of a bit more uniformity and consistency, {u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to improve readability. The above means that in case of all 4 ranges being, say, [50, 100] range, we'd previously see hugely ambiguous: R1=scalar(umin=50,umax=100) Now, we'll be more explicit: R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100) This is slightly more verbose, but distinct from the case when we don't know anything about signed boundaries and 32-bit boundaries, which under new rules will match the old case: R1=scalar(umin=50,umax=100) Also, in the name of simplicity of implementation and consistency, order for {s,u}32_{min,max} are emitted *before* var_off. Previously they were emitted afterwards, for unclear reasons. This patch also includes a few fixes to selftests that expect exact register state to accommodate slight changes to verifier format. You can see that the changes are pretty minimal in common cases. Note, the special case when SCALAR_VALUE register is a known constant isn't changed, we'll emit constant value once, interpreted as signed value. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-13bpf: Introduce task_vma open-coded iterator kfuncsDave Marchevsky2-0/+94
This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task_vma in open-coded iterator style. BPF programs can use these kfuncs directly or through bpf_for_each macro for natural-looking iteration of all task vmas. The implementation borrows heavily from bpf_find_vma helper's locking - differing only in that it holds the mmap_read lock for all iterations while the helper only executes its provided callback on a maximum of 1 vma. Aside from locking, struct vma_iterator and vma_next do all the heavy lifting. A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the only field in struct bpf_iter_task_vma. This is because the inner data struct contains a struct vma_iterator (not ptr), whose size is likely to change under us. If bpf_iter_task_vma_kern contained vma_iterator directly such a change would require change in opaque bpf_iter_task_vma struct's size. So better to allocate vma_iterator using BPF allocator, and since that alloc must already succeed, might as well allocate all iter fields, thereby freezing struct bpf_iter_task_vma size. Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-13bpf: Don't explicitly emit BTF for struct btf_iter_numDave Marchevsky1-2/+0
Commit 6018e1f407cc ("bpf: implement numbers iterator") added the BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num doesn't exist, so only a forward declaration is emitted in BTF: FWD 'btf_iter_num' fwd_kind=struct That commit was probably hoping to ensure that struct bpf_iter_num is emitted in vmlinux BTF. A previous version of this patch changed the line to emit the correct type, but Yonghong confirmed that it would definitely be emitted regardless in [0], so this patch simply removes the line. This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't causing any issues that I noticed, aside from mild confusion when I looked through the code. [0]: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-20/+25
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: kernel/bpf/verifier.c 829955981c55 ("bpf: Fix verifier log for async callback return values") a923819fb2c5 ("bpf: Treat first argument as return value for bpf_throw") Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-11bpf: Implement cgroup sockaddr hooks for unix socketsDaan De Meyer3-3/+28
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen(). These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets. We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls. We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote. Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Daan De Meyer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-11bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpfDaan De Meyer1-0/+1
As prep for adding unix socket support to the cgroup sockaddr hooks, let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix sockaddr from bpf. While this is already possible for AF_INET and AF_INET6, we'll need this kfunc when we add unix socket support since modifying the address for those requires modifying both the address and the sockaddr length. Signed-off-by: Daan De Meyer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-11bpf: Propagate modified uaddrlen from cgroup sockaddr programsDaan De Meyer1-2/+15
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after running a bpf cgroup sockaddr hook program. While not important for AF_INET or AF_INET6, the sockaddr length is important when working with AF_UNIX sockaddrs as the size of the sockaddr cannot be determined just from the address family or the sockaddr's contents. __cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as an input/output argument. After running the program, the modified sockaddr length is stored in the uaddrlen pointer. Signed-off-by: Daan De Meyer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-09bpf: Fix verifier log for async callback return valuesDavid Vernet1-3/+3
The verifier, as part of check_return_code(), verifies that async callbacks such as from e.g. timers, will return 0. It does this by correctly checking that R0->var_off is in tnum_const(0), which effectively checks that it's in a range of 0. If this condition fails, however, it prints an error message which says that the value should have been in (0x0; 0x1). This results in possibly confusing output such as the following in which an async callback returns 1: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1) The fix is easy -- we should just pass the tnum_const(0) as the correct range to verbose_invalid_scalar(), which will then print the following: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0) Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: David Vernet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-09bpf: Add ability to pin bpf timer to calling CPUDavid Vernet1-1/+4
BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-10-06bpf: Refuse unused attributes in bpf_prog_{attach,detach}Lorenz Bauer1-5/+14
The recently added tcx attachment extended the BPF UAPI for attaching and detaching by a couple of fields. Those fields are currently only supported for tcx, other types like cgroups and flow dissector silently ignore the new fields except for the new flags. This is problematic once we extend bpf_mprog to older attachment types, since it's hard to figure out whether the syscall really was successful if the kernel silently ignores non-zero values. Explicitly reject non-zero fields relevant to bpf_mprog for attachment types which don't use the latter yet. Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Signed-off-by: Lorenz Bauer <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-06bpf: Handle bpf_mprog_query with NULL entryDaniel Borkmann2-11/+7
Improve consistency for bpf_mprog_query() API and let the latter also handle a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we copy a count of 0 and revision of 1 to user space, so that this can be fed into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self- test as part of this series has been added to assert this case. Suggested-by: Lorenz Bauer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-06bpf: Fix BPF_PROG_QUERY last field checkDaniel Borkmann1-1/+1
While working on the ebpf-go [0] library integration for bpf_mprog and tcx, Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A typical workflow is to first gather the bpf_mprog count without passing program/ link arrays, followed by the second request which contains the actual array pointers. The initial call populates count and revision fields. The second call gets rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to query.revision instead of query.link_attach_flags since the former is really the last member. It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with an on-stack bpf_attr that is memset() each time (and therefore query.revision was reset to zero). [0] https://ebpf-go.dev Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Reported-by: Lorenz Bauer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-10-06bpf: Annotate struct bpf_stack_map with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle [1], add __counted_by for struct bpf_stack_map. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Link: https://lore.kernel.org/bpf/[email protected]
2023-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-30/+25
Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <[email protected]>
2023-09-30bpf: Use kmalloc_size_roundup() to adjust size_indexHou Tao1-25/+19
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as reported by Nathan, the adjustment is not enough, because __kmalloc_minalign() also decides the minimal alignment of slab object as shown in new_kmalloc_cache() and its value may be greater than KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM). Instead of invoking __kmalloc_minalign() in bpf subsystem to find the maximal alignment, just using kmalloc_size_roundup() directly to get the corresponding slab object size for each allocation size. If these two sizes are unmatched, adjust size_index to select a bpf_mem_cache with unit_size equal to the object_size of the underlying slab cache for the allocation size. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Nathan Chancellor <[email protected]> Closes: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Hou Tao <[email protected]> Tested-by: Emil Renner Berthing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-09-29bpf, mprog: Fix maximum program check on mprog attachmentDaniel Borkmann1-0/+3
After Paul's recent improvement to syzkaller to improve coverage for bpf_mprog and tcx, it hit a splat that the program limit was surpassed. What happened is that the maximum number of progs got added, followed by another prog add request which adds with BPF_F_BEFORE flag relative to the last program in the array. The idx >= bpf_mprog_max() check in bpf_mprog_attach() still passes because the index is below the maximum but the maximum will be surpassed. We need to add a check upfront for insertions to catch this situation. Fixes: 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs") Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Co-developed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Tested-by: [email protected] Tested-by: [email protected] Link: https://github.com/google/syzkaller/pull/4207 Link: https://lore.kernel.org/bpf/[email protected]
2023-09-25bpf: Add missed value to kprobe perf link infoJiri Olsa1-6/+8
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-09-21bpf: Disable zero-extension for BPF_MEMSXIlya Leoshkevich1-1/+1
On the architectures that use bpf_jit_needs_zext(), e.g., s390x, the verifier incorrectly inserts a zero-extension after BPF_MEMSX, leading to miscompilations like the one below: 24: 89 1a ff fe 00 00 00 00 "r1 = *(s16 *)(r10 - 2);" # zext_dst set 0x3ff7fdb910e: lgh %r2,-2(%r13,%r0) # load halfword 0x3ff7fdb9114: llgfr %r2,%r2 # wrong! 25: 65 10 00 03 00 00 7f ff if r1 s> 32767 goto +3 <l0_1> # check_cond_jmp_op() Disable such zero-extensions. The JITs need to insert sign-extension themselves, if necessary. Suggested-by: Puranjay Mohan <[email protected]> Signed-off-by: Ilya Leoshkevich <[email protected]> Reviewed-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>