aboutsummaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)AuthorFilesLines
2023-01-26bpf: Fix the kernel crash caused by bpf_setsockopt().Kui-Feng Lee1-1/+0
The kernel crash was caused by a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook, which performed a call to `bpf_setsockopt()` in order to set the TCP_NODELAY flag as an example. Flags like TCP_NODELAY can prompt the kernel to flush a socket's outgoing queue, and this hook "lsm_cgroup/socket_sock_rcv_skb" is frequently triggered by softirqs. The issue was that in certain circumstances, when `tcp_write_xmit()` was called to flush the queue, it would also allow BH (bottom-half) to run. This could lead to our program attempting to flush the same socket recursively, which caused a `skbuff` to be unlinked twice. `security_sock_rcv_skb()` is triggered by `tcp_filter()`. This occurs before the sock ownership is checked in `tcp_v4_rcv()`. Consequently, if a bpf program runs on `security_sock_rcv_skb()` while under softirq conditions, it may not possess the lock needed for `bpf_setsockopt()`, thus presenting an issue. The patch fixes this issue by ensuring that a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook is not allowed to call `bpf_setsockopt()`. The differences from v1 are - changing commit log to explain holding the lock of the sock, - emphasizing that TCP_NODELAY is not the only flag, and - adding the fixes tag. v1: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Kui-Feng Lee <[email protected]> Fixes: 9113d7e48e91 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-25bpf: Pass const struct bpf_prog * to .check_memberDavid Vernet1-1/+1
The .check_member field of struct bpf_struct_ops is currently passed the member's btf_type via const struct btf_type *t, and a const struct btf_member *member. This allows the struct_ops implementation to check whether e.g. an ops is supported, but it would be useful to also enforce that the struct_ops prog being loaded for that member has other qualities, like being sleepable (or not). This patch therefore updates the .check_member() callback to also take a const struct bpf_prog *prog argument. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-25bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepableDavid Vernet1-2/+3
BPF struct_ops programs currently cannot be marked as sleepable. This need not be the case -- struct_ops programs can be sleepable, and e.g. invoke kfuncs that export the KF_SLEEPABLE flag. So as to allow future struct_ops programs to invoke such kfuncs, this patch updates the verifier to allow struct_ops programs to be sleepable. A follow-on patch will add support to libbpf for specifying struct_ops.s as a sleepable struct_ops program, and then another patch will add testcases to the dummy_st_ops selftest suite which test sleepable struct_ops behavior. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-25bpf/docs: Document cpumask kfuncs in a new fileDavid Vernet1-0/+208
Now that we've added a series of new cpumask kfuncs, we should document them so users can easily use them. This patch adds a new cpumasks.rst file to document them. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-25bpf: Enable cpumasks to be queried and used as kptrsDavid Vernet2-0/+269
Certain programs may wish to be able to query cpumasks. For example, if a program that is tracing percpu operations wishes to track which tasks end up running on which CPUs, it could be useful to associate that with the tasks' cpumasks. Similarly, programs tracking NUMA allocations, CPU scheduling domains, etc, could potentially benefit from being able to see which CPUs a task could be migrated to. This patch enables these types of use cases by introducing a series of bpf_cpumask_* kfuncs. Amongst these kfuncs, there are two separate "classes" of operations: 1. kfuncs which allow the caller to allocate and mutate their own cpumask kptrs in the form of a struct bpf_cpumask * object. Such kfuncs include e.g. bpf_cpumask_create() to allocate the cpumask, and bpf_cpumask_or() to mutate it. "Regular" cpumasks such as p->cpus_ptr may not be passed to these kfuncs, and the verifier will ensure this is the case by comparing BTF IDs. 2. Read-only operations which operate on const struct cpumask * arguments. For example, bpf_cpumask_test_cpu(), which tests whether a CPU is set in the cpumask. Any trusted struct cpumask * or struct bpf_cpumask * may be passed to these kfuncs. The verifier allows struct bpf_cpumask * even though the kfunc is defined with struct cpumask * because the first element of a struct bpf_cpumask is a cpumask_t, so it is safe to cast. A follow-on patch will add selftests which validate these kfuncs, and another will document them. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-25bpf: Disallow NULLable pointers for trusted kfuncsDavid Vernet1-0/+6
KF_TRUSTED_ARGS kfuncs currently have a subtle and insidious bug in validating pointers to scalars. Say that you have a kfunc like the following, which takes an array as the first argument: bool bpf_cpumask_empty(const struct cpumask *cpumask) { return cpumask_empty(cpumask); } ... BTF_ID_FLAGS(func, bpf_cpumask_empty, KF_TRUSTED_ARGS) ... If a BPF program were to invoke the kfunc with a NULL argument, it would crash the kernel. The reason is that struct cpumask is defined as a bitmap, which is itself defined as an array, and is accessed as a memory address by bitmap operations. So when the verifier analyzes the register, it interprets it as a pointer to a scalar struct, which is an array of size 8. check_mem_reg() then sees that the register is NULL and returns 0, and the kfunc crashes when it passes it down to the cpumask wrappers. To fix this, this patch adds a check for KF_ARG_PTR_TO_MEM which verifies that the register doesn't contain a possibly-NULL pointer if the kfunc is KF_TRUSTED_ARGS. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-24bpf: Allow trusted args to walk struct when checking BTF IDsDavid Vernet2-1/+90
When validating BTF types for KF_TRUSTED_ARGS kfuncs, the verifier currently enforces that the top-level type must match when calling the kfunc. In other words, the verifier does not allow the BPF program to pass a bitwise equivalent struct, despite it being allowed according to the C standard. For example, if you have the following type: struct nf_conn___init { struct nf_conn ct; }; The C standard stipulates that it would be safe to pass a struct nf_conn___init to a kfunc expecting a struct nf_conn. The verifier currently disallows this, however, as semantically kfuncs may want to enforce that structs that have equivalent types according to the C standard, but have different BTF IDs, are not able to be passed to kfuncs expecting one or the other. For example, struct nf_conn___init may not be queried / looked up, as it is allocated but may not yet be fully initialized. On the other hand, being able to pass types that are equivalent according to the C standard will be useful for other types of kfunc / kptrs enabled by BPF. For example, in a follow-on patch, a series of kfuncs will be added which allow programs to do bitwise queries on cpumasks that are either allocated by the program (in which case they'll be a 'struct bpf_cpumask' type that wraps a cpumask_t as its first element), or a cpumask that was allocated by the main kernel (in which case it will just be a straight cpumask_t, as in task->cpus_ptr). Having the two types of cpumasks allows us to distinguish between the two for when a cpumask is read-only vs. mutatable. A struct bpf_cpumask can be mutated by e.g. bpf_cpumask_clear(), whereas a regular cpumask_t cannot be. On the other hand, a struct bpf_cpumask can of course be queried in the exact same manner as a cpumask_t, with e.g. bpf_cpumask_test_cpu(). If we were to enforce that top level types match, then a user that's passing a struct bpf_cpumask to a read-only cpumask_t argument would have to cast with something like bpf_cast_to_kern_ctx() (which itself would need to be updated to expect the alias, and currently it only accommodates a single alias per prog type). Additionally, not specifying KF_TRUSTED_ARGS is not an option, as some kfuncs take one argument as a struct bpf_cpumask *, and another as a struct cpumask * (i.e. cpumask_t). In order to enable this, this patch relaxes the constraint that a KF_TRUSTED_ARGS kfunc must have strict type matching, and instead only enforces strict type matching if a type is observed to be a "no-cast alias" (i.e., that the type names are equivalent, but one is suffixed with ___init). Additionally, in order to try and be conservative and match existing behavior / expectations, this patch also enforces strict type checking for acquire kfuncs. We were already enforcing it for release kfuncs, so this should also improve the consistency of the semantics for kfuncs. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-24bpf: Enable annotating trusted nested pointersDavid Vernet2-3/+90
In kfuncs, a "trusted" pointer is a pointer that the kfunc can assume is safe, and which the verifier will allow to be passed to a KF_TRUSTED_ARGS kfunc. Currently, a KF_TRUSTED_ARGS kfunc disallows any pointer to be passed at a nonzero offset, but sometimes this is in fact safe if the "nested" pointer's lifetime is inherited from its parent. For example, the const cpumask_t *cpus_ptr field in a struct task_struct will remain valid until the task itself is destroyed, and thus would also be safe to pass to a KF_TRUSTED_ARGS kfunc. While it would be conceptually simple to enable this by using BTF tags, gcc unfortunately does not yet support this. In the interim, this patch enables support for this by using a type-naming convention. A new BTF_TYPE_SAFE_NESTED macro is defined in verifier.c which allows a developer to specify the nested fields of a type which are considered trusted if its parent is also trusted. The verifier is also updated to account for this. A patch with selftests will be added in a follow-on change, along with documentation for this feature. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-23bpf: Support consuming XDP HW metadata from fext programsToke Høiland-Jørgensen3-32/+92
Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP programs that consume HW metadata, implement support for propagating the offload information. The extension program doesn't need to set a flag or ifindex, these will just be propagated from the target by the verifier. We need to create a separate offload object for the extension program, though, since it can be reattached to a different program later (which means we can't just inherit the offload information from the target). An additional check is added on attach that the new target is compatible with the offload information in the extension prog. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-23bpf: XDP metadata RX kfuncsStanislav Fomichev3-1/+76
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not supported by the target device, the default implementation is called instead. The verifier, at load time, replaces a call to the generic kfunc with a call to the per-device one. Per-device kfunc pointers are stored in separate struct xdp_metadata_ops. Cc: John Fastabend <[email protected]> Cc: David Ahern <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Anatoly Burakov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Maryam Tahhan <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-23bpf: Introduce device-bound XDP programsStanislav Fomichev3-30/+78
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way to associate a netdev with a BPF program at load time. netdevsim checks are dropped in favor of generic check in dev_xdp_attach. Cc: John Fastabend <[email protected]> Cc: David Ahern <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Anatoly Burakov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Maryam Tahhan <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-23bpf: Reshuffle some parts of bpf/offload.cStanislav Fomichev1-105/+117
To avoid adding forward declarations in the main patch, shuffle some code around. No functional changes. Cc: John Fastabend <[email protected]> Cc: David Ahern <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Anatoly Burakov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Maryam Tahhan <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-23bpf: Move offload initialization into late_initcallStanislav Fomichev1-15/+7
So we don't have to initialize it manually from several paths. Cc: John Fastabend <[email protected]> Cc: David Ahern <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Anatoly Burakov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Maryam Tahhan <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-23bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloadedStanislav Fomichev4-24/+24
BPF offloading infra will be reused to implement bound-but-not-offloaded bpf programs. Rename existing helpers for clarity. No functional changes. Cc: John Fastabend <[email protected]> Cc: David Ahern <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Anatoly Burakov <[email protected]> Cc: Alexander Lobakin <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Maryam Tahhan <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-20bpf: Avoid recomputing spi in process_dynptr_funcKumar Kartikeya Dwivedi1-13/+11
Currently, process_dynptr_func first calls dynptr_get_spi and then is_dynptr_reg_valid_init and is_dynptr_reg_valid_uninit have to call it again to obtain the spi value. Instead of doing this twice, reuse the already obtained value (which is by default 0, and is only set for PTR_TO_STACK, and only used in that case in aforementioned functions). The input value for these two functions will either be -ERANGE or >= 1, and can either be permitted or rejected based on the respective check. Suggested-by: Joanne Koong <[email protected]> Acked-by: Joanne Koong <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20bpf: Combine dynptr_get_spi and is_spi_bounds_validKumar Kartikeya Dwivedi1-42/+33
Currently, a check on spi resides in dynptr_get_spi, while others checking its validity for being within the allocated stack slots happens in is_spi_bounds_valid. Almost always barring a couple of cases (where being beyond allocated stack slots is not an error as stack slots need to be populated), both are used together to make checks. Hence, subsume the is_spi_bounds_valid check in dynptr_get_spi, and return -ERANGE to specially distinguish the case where spi is valid but not within allocated slots in the stack state. The is_spi_bounds_valid function is still kept around as it is a generic helper that will be useful for other objects on stack similar to dynptr in the future. Suggested-by: Joanne Koong <[email protected]> Acked-by: Joanne Koong <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20bpf: Allow reinitializing unreferenced dynptr stack slotsKumar Kartikeya Dwivedi1-8/+26
Consider a program like below: void prog(void) { { struct bpf_dynptr ptr; bpf_dynptr_from_mem(...); } ... { struct bpf_dynptr ptr; bpf_dynptr_from_mem(...); } } Here, the C compiler based on lifetime rules in the C standard would be well within in its rights to share stack storage for dynptr 'ptr' as their lifetimes do not overlap in the two distinct scopes. Currently, such an example would be rejected by the verifier, but this is too strict. Instead, we should allow reinitializing over dynptr stack slots and forget information about the old dynptr object. The destroy_if_dynptr_stack_slot function already makes necessary checks to avoid overwriting referenced dynptr slots. This is done to present a better error message instead of forgetting dynptr information on stack and preserving reference state, leading to an inevitable but undecipherable error at the end about an unreleased reference which has to be associated back to its allocating call instruction to make any sense to the user. Acked-by: Joanne Koong <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20bpf: Invalidate slices on destruction of dynptrs on stackKumar Kartikeya Dwivedi1-12/+62
The previous commit implemented destroy_if_dynptr_stack_slot. It destroys the dynptr which given spi belongs to, but still doesn't invalidate the slices that belong to such a dynptr. While for the case of referenced dynptr, we don't allow their overwrite and return an error early, we still allow it and destroy the dynptr for unreferenced dynptr. To be able to enable precise and scoped invalidation of dynptr slices in this case, we must be able to associate the source dynptr of slices that have been obtained using bpf_dynptr_data. When doing destruction, only slices belonging to the dynptr being destructed should be invalidated, and nothing else. Currently, dynptr slices belonging to different dynptrs are indistinguishible. Hence, allocate a unique id to each dynptr (CONST_PTR_TO_DYNPTR and those on stack). This will be stored as part of reg->id. Whenever using bpf_dynptr_data, transfer this unique dynptr id to the returned PTR_TO_MEM_OR_NULL slice pointer, and store it in a new per-PTR_TO_MEM dynptr_id register state member. Finally, after establishing such a relationship between dynptrs and their slices, implement precise invalidation logic that only invalidates slices belong to the destroyed dynptr in destroy_if_dynptr_stack_slot. Acked-by: Joanne Koong <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20bpf: Fix partial dynptr stack slot reads/writesKumar Kartikeya Dwivedi1-0/+88
Currently, while reads are disallowed for dynptr stack slots, writes are not. Reads don't work from both direct access and helpers, while writes do work in both cases, but have the effect of overwriting the slot_type. While this is fine, handling for a few edge cases is missing. Firstly, a user can overwrite the stack slots of dynptr partially. Consider the following layout: spi: [d][d][?] 2 1 0 First slot is at spi 2, second at spi 1. Now, do a write of 1 to 8 bytes for spi 1. This will essentially either write STACK_MISC for all slot_types or STACK_MISC and STACK_ZERO (in case of size < BPF_REG_SIZE partial write of zeroes). The end result is that slot is scrubbed. Now, the layout is: spi: [d][m][?] 2 1 0 Suppose if user initializes spi = 1 as dynptr. We get: spi: [d][d][d] 2 1 0 But this time, both spi 2 and spi 1 have first_slot = true. Now, when passing spi 2 to dynptr helper, it will consider it as initialized as it does not check whether second slot has first_slot == false. And spi 1 should already work as normal. This effectively replaced size + offset of first dynptr, hence allowing invalid OOB reads and writes. Make a few changes to protect against this: When writing to PTR_TO_STACK using BPF insns, when we touch spi of a STACK_DYNPTR type, mark both first and second slot (regardless of which slot we touch) as STACK_INVALID. Reads are already prevented. Second, prevent writing to stack memory from helpers if the range may contain any STACK_DYNPTR slots. Reads are already prevented. For helpers, we cannot allow it to destroy dynptrs from the writes as depending on arguments, helper may take uninit_mem and dynptr both at the same time. This would mean that helper may write to uninit_mem before it reads the dynptr, which would be bad. PTR_TO_MEM: [?????dd] Depending on the code inside the helper, it may end up overwriting the dynptr contents first and then read those as the dynptr argument. Verifier would only simulate destruction when it does byte by byte access simulation in check_helper_call for meta.access_size, and fail to catch this case, as it happens after argument checks. The same would need to be done for any other non-trivial objects created on the stack in the future, such as bpf_list_head on stack, or bpf_rb_root on stack. A common misunderstanding in the current code is that MEM_UNINIT means writes, but note that writes may also be performed even without MEM_UNINIT in case of helpers, in that case the code after handling meta && meta->raw_mode will complain when it sees STACK_DYNPTR. So that invalid read case also covers writes to potential STACK_DYNPTR slots. The only loophole was in case of meta->raw_mode which simulated writes through instructions which could overwrite them. A future series sequenced after this will focus on the clean up of helper access checks and bugs around that. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20bpf: Fix missing var_off check for ARG_PTR_TO_DYNPTRKumar Kartikeya Dwivedi1-18/+66
Currently, the dynptr function is not checking the variable offset part of PTR_TO_STACK that it needs to check. The fixed offset is considered when computing the stack pointer index, but if the variable offset was not a constant (such that it could not be accumulated in reg->off), we will end up a discrepency where runtime pointer does not point to the actual stack slot we mark as STACK_DYNPTR. It is impossible to precisely track dynptr state when variable offset is not constant, hence, just like bpf_timer, kptr, bpf_spin_lock, etc. simply reject the case where reg->var_off is not constant. Then, consider both reg->off and reg->var_off.value when computing the stack pointer index. A new helper dynptr_get_spi is introduced to hide over these details since the dynptr needs to be located in multiple places outside the process_dynptr_func checks, hence once we know it's a PTR_TO_STACK, we need to enforce these checks in all places. Note that it is disallowed for unprivileged users to have a non-constant var_off, so this problem should only be possible to trigger from programs having CAP_PERFMON. However, its effects can vary. Without the fix, it is possible to replace the contents of the dynptr arbitrarily by making verifier mark different stack slots than actual location and then doing writes to the actual stack address of dynptr at runtime. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Acked-by: Joanne Koong <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20bpf: Fix state pruning for STACK_DYNPTR stack slotsKumar Kartikeya Dwivedi1-4/+84
The root of the problem is missing liveness marking for STACK_DYNPTR slots. This leads to all kinds of problems inside stacksafe. The verifier by default inside stacksafe ignores spilled_ptr in stack slots which do not have REG_LIVE_READ marks. Since this is being checked in the 'old' explored state, it must have already done clean_live_states for this old bpf_func_state. Hence, it won't be receiving any more liveness marks from to be explored insns (it has received REG_LIVE_DONE marking from liveness point of view). What this means is that verifier considers that it's safe to not compare the stack slot if was never read by children states. While liveness marks are usually propagated correctly following the parentage chain for spilled registers (SCALAR_VALUE and PTR_* types), the same is not the case for STACK_DYNPTR. clean_live_states hence simply rewrites these stack slots to the type STACK_INVALID since it sees no REG_LIVE_READ marks. The end result is that we will never see STACK_DYNPTR slots in explored state. Even if verifier was conservatively matching !REG_LIVE_READ slots, very next check continuing the stacksafe loop on seeing STACK_INVALID would again prevent further checks. Now as long as verifier stores an explored state which we can compare to when reaching a pruning point, we can abuse this bug to make verifier prune search for obviously unsafe paths using STACK_DYNPTR slots thinking they are never used hence safe. Doing this in unprivileged mode is a bit challenging. add_new_state is only set when seeing BPF_F_TEST_STATE_FREQ (which requires privileges) or when jmps_processed difference is >= 2 and insn_processed difference is >= 8. So coming up with the unprivileged case requires a little more work, but it is still totally possible. The test case being discussed below triggers the heuristic even in unprivileged mode. However, it no longer works since commit 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF"). Let's try to study the test step by step. Consider the following program (C style BPF ASM): 0 r0 = 0; 1 r6 = &ringbuf_map; 3 r1 = r6; 4 r2 = 8; 5 r3 = 0; 6 r4 = r10; 7 r4 -= -16; 8 call bpf_ringbuf_reserve_dynptr; 9 if r0 == 0 goto pc+1; 10 goto pc+1; 11 *(r10 - 16) = 0xeB9F; 12 r1 = r10; 13 r1 -= -16; 14 r2 = 0; 15 call bpf_ringbuf_discard_dynptr; 16 r0 = 0; 17 exit; We know that insn 12 will be a pruning point, hence if we force add_new_state for it, it will first verify the following path as safe in straight line exploration: 0 1 3 4 5 6 7 8 9 -> 10 -> (12) 13 14 15 16 17 Then, when we arrive at insn 12 from the following path: 0 1 3 4 5 6 7 8 9 -> 11 (12) We will find a state that has been verified as safe already at insn 12. Since register state is same at this point, regsafe will pass. Next, in stacksafe, for spi = 0 and spi = 1 (location of our dynptr) is skipped seeing !REG_LIVE_READ. The rest matches, so stacksafe returns true. Next, refsafe is also true as reference state is unchanged in both states. The states are considered equivalent and search is pruned. Hence, we are able to construct a dynptr with arbitrary contents and use the dynptr API to operate on this arbitrary pointer and arbitrary size + offset. To fix this, first define a mark_dynptr_read function that propagates liveness marks whenever a valid initialized dynptr is accessed by dynptr helpers. REG_LIVE_WRITTEN is marked whenever we initialize an uninitialized dynptr. This is done in mark_stack_slots_dynptr. It allows screening off mark_reg_read and not propagating marks upwards from that point. This ensures that we either set REG_LIVE_READ64 on both dynptr slots, or none, so clean_live_states either sets both slots to STACK_INVALID or none of them. This is the invariant the checks inside stacksafe rely on. Next, do a complete comparison of both stack slots whenever they have STACK_DYNPTR. Compare the dynptr type stored in the spilled_ptr, and also whether both form the same first_slot. Only then is the later path safe. Fixes: 97e03f521050 ("bpf: Add verifier support for dynptrs") Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-23/+18
drivers/net/ipa/ipa_interrupt.c drivers/net/ipa/ipa_interrupt.h 9ec9b2a30853 ("net: ipa: disable ipa interrupt during suspend") 8e461e1f092b ("net: ipa: introduce ipa_interrupt_enable()") d50ed3558719 ("net: ipa: enable IPA interrupt handlers separate from registration") https://lore.kernel.org/all/[email protected]/ https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2023-01-20bpf: Add missing btf_put to register_btf_id_dtor_kfuncsJiri Olsa1-2/+2
We take the BTF reference before we register dtors and we need to put it back when it's done. We probably won't se a problem with kernel BTF, but module BTF would stay loaded (because of the extra ref) even when its module is removed. Cc: Kumar Kartikeya Dwivedi <[email protected]> Fixes: 5ce937d613a4 ("bpf: Populate pairs of btf_id and destructor kfunc in btf") Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-19bpf: Fix to preserve reg parent/live fields when copying range infoEduard Zingerman1-7/+18
Register range information is copied in several places. The intent is to transfer range/id information from one register/stack spill to another. Currently this is done using direct register assignment, e.g.: static void find_equal_scalars(..., struct bpf_reg_state *known_reg) { ... struct bpf_reg_state *reg; ... *reg = *known_reg; ... } However, such assignments also copy the following bpf_reg_state fields: struct bpf_reg_state { ... struct bpf_reg_state *parent; ... enum bpf_reg_liveness live; ... }; Copying of these fields is accidental and incorrect, as could be demonstrated by the following example: 0: call ktime_get_ns() 1: r6 = r0 2: call ktime_get_ns() 3: r7 = r0 4: if r0 > r6 goto +1 ; r0 & r6 are unbound thus generated ; branch states are identical 5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8] --- checkpoint --- 6: r1 = 42 ; r1 marked as written 7: *(u8 *)(r10 - 8) = r1 ; 8-bit write, fp[-8] parent & live ; overwritten 8: r2 = *(u64 *)(r10 - 8) 9: r0 = 0 10: exit This example is unsafe because 64-bit write to fp[-8] at (5) is conditional, thus not all bytes of fp[-8] are guaranteed to be set when it is read at (8). However, currently the example passes verification. First, the execution path 1-10 is examined by verifier. Suppose that a new checkpoint is created by is_state_visited() at (6). After checkpoint creation: - r1.parent points to checkpoint.r1, - fp[-8].parent points to checkpoint.fp[-8]. At (6) the r1.live is set to REG_LIVE_WRITTEN. At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to REG_LIVE_WRITTEN, because of the following code called in check_stack_write_fixed_off(): static void save_register_state(struct bpf_func_state *state, int spi, struct bpf_reg_state *reg, int size) { ... state->stack[spi].spilled_ptr = *reg; // <--- parent & live copied if (size == BPF_REG_SIZE) state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; ... } Note the intent to mark stack spill as written only if 8 bytes are spilled to a slot, however this intent is spoiled by a 'live' field copy. At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but this does not happen: - fp[-8] in a current state is already marked as REG_LIVE_WRITTEN; - fp[-8].parent points to checkpoint.r1, parentage chain is used by mark_reg_read() to mark checkpoint states. At (10) the verification is finished for path 1-10 and jump 4-6 is examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this spill is pruned from the cached states by clean_live_states(). Hence verifier state obtained via path 1-4,6 is deemed identical to one obtained via path 1-6 and program marked as safe. Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag set to force creation of intermediate verifier states. This commit revisits the locations where bpf_reg_state instances are copied and replaces the direct copies with a call to a function copy_register_state(dst, src) that preserves 'parent' and 'live' fields of the 'dst'. Fixes: 679c782de14b ("bpf/verifier: per-register parent pointers") Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-19fs: port inode_init_owner() to mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2023-01-19fs: port ->permission() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2023-01-19fs: port ->mkdir() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2023-01-19fs: port ->symlink() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2023-01-18bpf: Fix off-by-one error in bpf_mem_cache_idx()Hou Tao1-1/+1
According to the definition of sizes[NUM_CACHES], when the size passed to bpf_mem_cache_size() is 256, it should return 6 instead 7. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: Hou Tao <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-17bpf: Do not allow to load sleepable BPF_TRACE_RAW_TP programJiri Olsa1-3/+19
Currently we allow to load any tracing program as sleepable, but BPF_TRACE_RAW_TP can't sleep. Making the check explicit for tracing programs attach types, so sleepable BPF_TRACE_RAW_TP will fail to load. Updating the verifier error to mention iter programs as well. Acked-by: Song Liu <[email protected]> Acked-by: Yonghong Song <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-13bpf: Fix pointer-leak due to insufficient speculative store bypass mitigationLuis Gerhorst1-1/+3
To mitigate Spectre v4, 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") inserts lfence instructions after 1) initializing a stack slot and 2) spilling a pointer to the stack. However, this does not cover cases where a stack slot is first initialized with a pointer (subject to sanitization) but then overwritten with a scalar (not subject to sanitization because the slot was already initialized). In this case, the second write may be subject to speculative store bypass (SSB) creating a speculative pointer-as-scalar type confusion. This allows the program to subsequently leak the numerical pointer value using, for example, a branch-based cache side channel. To fix this, also sanitize scalars if they write a stack slot that previously contained a pointer. Assuming that pointer-spills are only generated by LLVM on register-pressure, the performance impact on most real-world BPF programs should be small. The following unprivileged BPF bytecode drafts a minimal exploit and the mitigation: [...] // r6 = 0 or 1 (skalar, unknown user input) // r7 = accessible ptr for side channel // r10 = frame pointer (fp), to be leaked // r9 = r10 # fp alias to encourage ssb *(u64 *)(r9 - 8) = r10 // fp[-8] = ptr, to be leaked // lfence added here because of pointer spill to stack. // // Ommitted: Dummy bpf_ringbuf_output() here to train alias predictor // for no r9-r10 dependency. // *(u64 *)(r10 - 8) = r6 // fp[-8] = scalar, overwrites ptr // 2039f26f3aca: no lfence added because stack slot was not STACK_INVALID, // store may be subject to SSB // // fix: also add an lfence when the slot contained a ptr // r8 = *(u64 *)(r9 - 8) // r8 = architecturally a scalar, speculatively a ptr // // leak ptr using branch-based cache side channel: r8 &= 1 // choose bit to leak if r8 == 0 goto SLOW // no mispredict // architecturally dead code if input r6 is 0, // only executes speculatively iff ptr bit is 1 r8 = *(u64 *)(r7 + 0) # encode bit in cache (0: slow, 1: fast) SLOW: [...] After running this, the program can time the access to *(r7 + 0) to determine whether the chosen pointer bit was 0 or 1. Repeat this 64 times to recover the whole address on amd64. In summary, sanitization can only be skipped if one scalar is overwritten with another scalar. Scalar-confusion due to speculative store bypass can not lead to invalid accesses because the pointer bounds deducted during verification are enforced using branchless logic. See 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") for details. Do not make the mitigation depend on !env->allow_{uninit_stack,ptr_leaks} because speculative leaks are likely unexpected if these were enabled. For example, leaking the address to a protected log file may be acceptable while disabling the mitigation might unintentionally leak the address into the cached-state of a map that is accessible to unprivileged processes. Fixes: 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") Signed-off-by: Luis Gerhorst <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Henriette Hofmeier <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Link: https://lore.kernel.org/bpf/[email protected]
2023-01-12bpf: hash map, avoid deadlock with suitable hash maskTonghao Zhang1-2/+2
The deadlock still may occur while accessed in NMI and non-NMI context. Because in NMI, we still may access the same bucket but with different map_locked index. For example, on the same CPU, .max_entries = 2, we update the hash map, with key = 4, while running bpf prog in NMI nmi_handle(), to update hash map with key = 20, so it will have the same bucket index but have different map_locked index. To fix this issue, using min mask to hash again. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Tonghao Zhang <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Song Liu <[email protected]> Cc: Yonghong Song <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Hou Tao <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-10bpf: btf: limit logging of ignored BTF mismatchesConnor O'Brien1-11/+27
Enabling CONFIG_MODULE_ALLOW_BTF_MISMATCH is an indication that BTF mismatches are expected and module loading should proceed anyway. Logging with pr_warn() on every one of these "benign" mismatches creates unnecessary noise when many such modules are loaded. Instead, handle this case with a single log warning that BTF info may be unavailable. Mismatches also result in calls to __btf_verifier_log() via __btf_verifier_log_type() or btf_verifier_log_member(), adding several additional lines of logging per mismatched module. Add checks to these paths to skip logging for module BTF mismatches in the "allow mismatch" case. All existing logging behavior is preserved in the default CONFIG_MODULE_ALLOW_BTF_MISMATCH=n case. Signed-off-by: Connor O'Brien <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-01-09bpf: remove the do_idr_lock parameter from bpf_prog_free_id()Paul Moore1-14/+6
It was determined that the do_idr_lock parameter to bpf_prog_free_id() was not necessary as it should always be true. Suggested-by: Stanislav Fomichev <[email protected]> Signed-off-by: Paul Moore <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-09bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and ↵Paul Moore2-7/+2
PERF_BPF_EVENT_PROG_UNLOAD When changing the ebpf program put() routines to support being called from within IRQ context the program ID was reset to zero prior to calling the perf event and audit UNLOAD record generators, which resulted in problems as the ebpf program ID was bogus (always zero). This patch addresses this problem by removing an unnecessary call to bpf_prog_free_id() in __bpf_prog_offload_destroy() and adjusting __bpf_prog_put() to only call bpf_prog_free_id() after audit and perf have finished their bpf program unload tasks in bpf_prog_put_deferred(). For the record, no one can determine, or remember, why it was necessary to free the program ID, and remove it from the IDR, prior to executing bpf_prog_put_deferred(); regardless, both Stanislav and Alexei agree that the approach in this patch should be safe. It is worth noting that when moving the bpf_prog_free_id() call, the do_idr_lock parameter was forced to true as the ebpf devs determined this was the correct as the do_idr_lock should always be true. The do_idr_lock parameter will be removed in a follow-up patch, but it was kept here to keep the patch small in an effort to ease any stable backports. I also modified the bpf_audit_prog() logic used to associate the AUDIT_BPF record with other associated records, e.g. @ctx != NULL. Instead of keying off the operation, it now keys off the execution context, e.g. '!in_irg && !irqs_disabled()', which is much more appropriate and should help better connect the UNLOAD operations with the associated audit state (other audit records). Cc: [email protected] Fixes: d809e134be7a ("bpf: Prepare bpf_prog_put() to be called from irq context.") Reported-by: Burn Alting <[email protected]> Reported-by: Jiri Olsa <[email protected]> Suggested-by: Stanislav Fomichev <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Paul Moore <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-01-06bpf: Skip invalid kfunc call in backtrack_insnHao Sun1-0/+6
The verifier skips invalid kfunc call in check_kfunc_call(), which would be captured in fixup_kfunc_call() if such insn is not eliminated by dead code elimination. However, this can lead to the following warning in backtrack_insn(), also see [1]: ------------[ cut here ]------------ verifier backtracking bug WARNING: CPU: 6 PID: 8646 at kernel/bpf/verifier.c:2756 backtrack_insn kernel/bpf/verifier.c:2756 __mark_chain_precision kernel/bpf/verifier.c:3065 mark_chain_precision kernel/bpf/verifier.c:3165 adjust_reg_min_max_vals kernel/bpf/verifier.c:10715 check_alu_op kernel/bpf/verifier.c:10928 do_check kernel/bpf/verifier.c:13821 [inline] do_check_common kernel/bpf/verifier.c:16289 [...] So make backtracking conservative with this by returning ENOTSUPP. [1] https://lore.kernel.org/bpf/CACkBjsaXNceR8ZjkLG=dT3P=4A8SBsg0Z5h5PWLryF5=ghKq=g@mail.gmail.com/ Reported-by: [email protected] Signed-off-by: Hao Sun <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-18/+48
No conflicts. Signed-off-by: Jakub Kicinski <[email protected]>
2023-01-04Merge tag 'for-netdev' of ↵Jakub Kicinski5-114/+136
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-04 We've added 45 non-merge commits during the last 21 day(s) which contain a total of 50 files changed, 1454 insertions(+), 375 deletions(-). The main changes are: 1) Fixes, improvements and refactoring of parts of BPF verifier's state equivalence checks, from Andrii Nakryiko. 2) Fix a few corner cases in libbpf's BTF-to-C converter in particular around padding handling and enums, also from Andrii Nakryiko. 3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata, from Christian Ehrig. 4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks, from Dave Marchevsky. 5) Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers, from Jiri Olsa. 6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps, from Maryam Tahhan. 7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du. 8) Bigger batch of improvements to BPF tracing code samples, from Daniel T. Lee. 9) Add LoongArch support to libbpf's bpf_tracing helper header, from Hengqi Chen. 10) Fix a libbpf compiler warning in perf_event_open_probe on arm32, from Khem Raj. 11) Optimize bpf_local_storage_elem by removing 56 bytes of padding, from Martin KaFai Lau. 12) Use pkg-config to locate libelf for resolve_btfids build, from Shen Jiamin. 13) Various libbpf improvements around API documentation and errno handling, from Xin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) libbpf: Return -ENODATA for missing btf section libbpf: Add LoongArch support to bpf_tracing.h libbpf: Restore errno after pr_warn. libbpf: Added the description of some API functions libbpf: Fix invalid return address register in s390 samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro samples/bpf: Change _kern suffix to .bpf with syscall tracing program samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program samples/bpf: Use kyscall instead of kprobe in syscall tracing program bpf: rename list_head -> graph_root in field info types libbpf: fix errno is overwritten after being closed. bpf: fix regs_exact() logic in regsafe() to remap IDs correctly bpf: perform byte-by-byte comparison only when necessary in regsafe() bpf: reject non-exact register type matches in regsafe() bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule bpf: reorganize struct bpf_reg_state fields bpf: teach refsafe() to take into account ID remapping bpf: Remove unused field initialization in bpf's ctl_table selftests/bpf: Add jit probe_mem corner case tests to s390x denylist ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-12-28bpf: rename list_head -> graph_root in field info typesDave Marchevsky3-22/+24
Many of the structs recently added to track field info for linked-list head are useful as-is for rbtree root. So let's do a mechanical renaming of list_head-related types and fields: include/linux/bpf.h: struct btf_field_list_head -> struct btf_field_graph_root list_head -> graph_root in struct btf_field union kernel/bpf/btf.c: list_head -> graph_root in struct btf_field_info This is a nonfunctional change, functionality to actually use these fields for rbtree will be added in further patches. Signed-off-by: Dave Marchevsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-28bpf: Always use maximal size for copy_array()Kees Cook1-5/+7
Instead of counting on prior allocations to have sized allocations to the next kmalloc bucket size, always perform a krealloc that is at least ksize(dst) in size (which is a no-op), so the size can be correctly tracked by all the various allocation size trackers (KASAN, __alloc_size, etc). Reported-by: Hyunwoo Kim <[email protected]> Link: https://lore.kernel.org/bpf/20221223094551.GA1439509@ubuntu Fixes: ceb35b666d42 ("bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage") Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: John Fastabend <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Song Liu <[email protected]> Cc: Yonghong Song <[email protected]> Cc: KP Singh <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-28bpf: keep a reference to the mm, in case the task is dead.Kui-Feng Lee1-12/+27
Fix the system crash that happens when a task iterator travel through vma of tasks. In task iterators, we used to access mm by following the pointer on the task_struct; however, the death of a task will clear the pointer, even though we still hold the task_struct. That can cause an unexpected crash for a null pointer when an iterator is visiting a task that dies during the visit. Keeping a reference of mm on the iterator ensures we always have a valid pointer to mm. Co-developed-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Reported-by: Nathan Slingerland <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-28bpf: Fix panic due to wrong pageattr of im->imageChuang Wang1-0/+4
In the scenario where livepatch and kretfunc coexist, the pageattr of im->image is rox after arch_prepare_bpf_trampoline in bpf_trampoline_update, and then modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag will be configured, and arch_prepare_bpf_trampoline will be re-executed. At this time, because the pageattr of im->image is rox, arch_prepare_bpf_trampoline will read and write im->image, which causes a fault. as follows: insmod livepatch-sample.ko # samples/livepatch/livepatch-sample.c bpftrace -e 'kretfunc:cmdline_proc_show {}' BUG: unable to handle page fault for address: ffffffffa0206000 PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 2 PID: 270 Comm: bpftrace Tainted: G E K 6.1.0 #5 RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0 RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202 RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000 RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030 RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400 R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8 R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10 FS: 00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bpf_trampoline_update+0x25a/0x6b0 __bpf_trampoline_link_prog+0x101/0x240 bpf_trampoline_link_prog+0x2d/0x50 bpf_tracing_prog_attach+0x24c/0x530 bpf_raw_tp_link_attach+0x73/0x1d0 __sys_bpf+0x100e/0x2570 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With this patch, when modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset to nx+rw. Cc: [email protected] Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Chuang Wang <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-27bpf: fix regs_exact() logic in regsafe() to remap IDs correctlyAndrii Nakryiko1-7/+16
Comparing IDs exactly between two separate states is not just suboptimal, but also incorrect in some cases. So update regs_exact() check to do byte-by-byte memcmp() only up to id/ref_obj_id. For id and ref_obj_id perform proper check_ids() checks, taking into account idmap. This change makes more states equivalent improving insns and states stats across a bunch of selftest BPF programs: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ------------------------------------------- -------------------------------- --------- --------- -------------- ---------- ---------- ------------- cgrp_kfunc_success.bpf.linked1.o test_cgrp_get_release 141 137 -4 (-2.84%) 13 13 +0 (+0.00%) cgrp_kfunc_success.bpf.linked1.o test_cgrp_xchg_release 142 139 -3 (-2.11%) 14 13 -1 (-7.14%) connect6_prog.bpf.linked1.o connect_v6_prog 139 102 -37 (-26.62%) 9 6 -3 (-33.33%) ima.bpf.linked1.o bprm_creds_for_exec 68 61 -7 (-10.29%) 6 5 -1 (-16.67%) linked_list.bpf.linked1.o global_list_in_list 569 499 -70 (-12.30%) 60 52 -8 (-13.33%) linked_list.bpf.linked1.o global_list_push_pop 167 150 -17 (-10.18%) 18 16 -2 (-11.11%) linked_list.bpf.linked1.o global_list_push_pop_multiple 881 815 -66 (-7.49%) 74 63 -11 (-14.86%) linked_list.bpf.linked1.o inner_map_list_in_list 579 534 -45 (-7.77%) 61 55 -6 (-9.84%) linked_list.bpf.linked1.o inner_map_list_push_pop 190 181 -9 (-4.74%) 19 18 -1 (-5.26%) linked_list.bpf.linked1.o inner_map_list_push_pop_multiple 916 850 -66 (-7.21%) 75 64 -11 (-14.67%) linked_list.bpf.linked1.o map_list_in_list 588 525 -63 (-10.71%) 62 55 -7 (-11.29%) linked_list.bpf.linked1.o map_list_push_pop 183 174 -9 (-4.92%) 18 17 -1 (-5.56%) linked_list.bpf.linked1.o map_list_push_pop_multiple 909 843 -66 (-7.26%) 75 64 -11 (-14.67%) map_kptr.bpf.linked1.o test_map_kptr 264 256 -8 (-3.03%) 26 26 +0 (+0.00%) map_kptr.bpf.linked1.o test_map_kptr_ref 95 91 -4 (-4.21%) 9 8 -1 (-11.11%) task_kfunc_success.bpf.linked1.o test_task_xchg_release 139 136 -3 (-2.16%) 14 13 -1 (-7.14%) test_bpf_nf.bpf.linked1.o nf_skb_ct_test 815 509 -306 (-37.55%) 57 30 -27 (-47.37%) test_bpf_nf.bpf.linked1.o nf_xdp_ct_test 815 509 -306 (-37.55%) 57 30 -27 (-47.37%) test_cls_redirect.bpf.linked1.o cls_redirect 78925 78390 -535 (-0.68%) 4782 4704 -78 (-1.63%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 64901 63897 -1004 (-1.55%) 4612 4470 -142 (-3.08%) test_sk_lookup.bpf.linked1.o access_ctx_sk 181 95 -86 (-47.51%) 19 10 -9 (-47.37%) test_sk_lookup.bpf.linked1.o ctx_narrow_access 447 437 -10 (-2.24%) 38 37 -1 (-2.63%) test_sk_lookup_kern.bpf.linked1.o sk_lookup_success 148 133 -15 (-10.14%) 14 12 -2 (-14.29%) test_tcp_check_syncookie_kern.bpf.linked1.o check_syncookie_clsact 304 300 -4 (-1.32%) 23 22 -1 (-4.35%) test_tcp_check_syncookie_kern.bpf.linked1.o check_syncookie_xdp 304 300 -4 (-1.32%) 23 22 -1 (-4.35%) test_verify_pkcs7_sig.bpf.linked1.o bpf 87 76 -11 (-12.64%) 7 6 -1 (-14.29%) ------------------------------------------- -------------------------------- --------- --------- -------------- ---------- ---------- ------------- Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-27bpf: perform byte-by-byte comparison only when necessary in regsafe()Andrii Nakryiko1-12/+9
Extract byte-by-byte comparison of bpf_reg_state in regsafe() into a helper function, which makes it more convenient to use it "on demand" only for registers that benefit from such checks, instead of doing it all the time, even if result of such comparison is ignored. Also, remove WARN_ON_ONCE(1)+return false dead code. There is no risk of missing some case as compiler will warn about non-void function not returning value in some branches (and that under assumption that default case is removed in the future). Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-27bpf: reject non-exact register type matches in regsafe()Andrii Nakryiko1-24/+21
Generalize the (somewhat implicit) rule of regsafe(), which states that if register types in old and current states do not match *exactly*, they can't be safely considered equivalent. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-27bpf: generalize MAYBE_NULL vs non-MAYBE_NULL ruleAndrii Nakryiko1-16/+15
Make generic check to prevent XXX_OR_NULL and XXX register types to be intermixed. While technically in some situations it could be safe, it's impossible to enforce due to the loss of an ID when converting XXX_OR_NULL to its non-NULL variant. So prevent this in general, not just for PTR_TO_MAP_KEY and PTR_TO_MAP_VALUE. PTR_TO_MAP_KEY_OR_NULL and PTR_TO_MAP_VALUE_OR_NULL checks, which were previously special-cased, are simplified to generic check that takes into account range_within() and tnum_in(). This is correct as BPF verifier doesn't allow arithmetic on XXX_OR_NULL register types, so var_off and ranges should stay zero. But even if in the future this restriction is lifted, it's even more important to enforce that var_off and ranges are compatible, otherwise it's possible to construct case where this can be exploited to bypass verifier's memory range safety checks. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-27bpf: reorganize struct bpf_reg_state fieldsAndrii Nakryiko1-9/+8
Move id and ref_obj_id fields after scalar data section (var_off and ranges). This is necessary to simplify next patch which will change regsafe()'s logic to be safer, as it makes the contents that has to be an exact match (type-specific parts, off, type, and var_off+ranges) a single sequential block of memory, while id and ref_obj_id should always be remapped and thus can't be memcp()'ed. There are few places that assume that var_off is after id/ref_obj_id to clear out id/ref_obj_id with the single memset(0). These are changed to explicitly zero-out id/ref_obj_id fields. Other places are adjusted to preserve exact byte-by-byte comparison behavior. No functional changes. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-27bpf: teach refsafe() to take into account ID remappingAndrii Nakryiko1-4/+12
states_equal() check performs ID mapping between old and new states to establish a 1-to-1 correspondence between IDs, even if their absolute numberic values across two equivalent states differ. This is important both for correctness and to avoid unnecessary work when two states are equivalent. With recent changes we partially fixed this logic by maintaining ID map across all function frames. This patch also makes refsafe() check take into account (and maintain) ID map, making states_equal() behavior more optimal and correct. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-12-24Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-1/+10
Daniel Borkmann says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 5 day(s) which contain a total of 11 files changed, 231 insertions(+), 3 deletions(-). The main changes are: 1) Fix a splat in bpf_skb_generic_pop() under CHECKSUM_PARTIAL due to misuse of skb_postpull_rcsum(), from Jakub Kicinski with test case from Martin Lau. 2) Fix BPF verifier's nullness propagation when registers are of type PTR_TO_BTF_ID, from Hao Sun. 3) Fix bpftool build for JIT disassembler under statically built libllvm, from Anton Protopopov. 4) Fix warnings reported by resolve_btfids when building vmlinux with CONFIG_SECURITY_NETWORK disabled, from Hou Tao. 5) Minor fix up for BPF selftest gitignore, from Stanislav Fomichev. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-12-22bpf: fix nullness propagation for reg to reg comparisonsHao Sun1-1/+8
After befae75856ab, the verifier would propagate null information after JEQ/JNE, e.g., if two pointers, one is maybe_null and the other is not, the former would be marked as non-null in eq path. However, as comment "PTR_TO_BTF_ID points to a kernel struct that does not need to be null checked by the BPF program ... The verifier must keep this in mind and can make no assumptions about null or non-null when doing branch ...". If one pointer is maybe_null and the other is PTR_TO_BTF, the former is incorrectly marked non-null. The following BPF prog can trigger a null-ptr-deref, also see this report for more details[1]: 0: (18) r1 = map_fd ; R1_w=map_ptr(ks=4, vs=4) 2: (79) r6 = *(u64 *)(r1 +8) ; R6_w=bpf_map->inner_map_data ; R6 is PTR_TO_BTF_ID ; equals to null at runtime 3: (bf) r2 = r10 4: (07) r2 += -4 5: (62) *(u32 *)(r2 +0) = 0 6: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null 7: (1d) if r6 == r0 goto pc+1 8: (95) exit ; from 7 to 9: R0=map_value R6=ptr_bpf_map 9: (61) r0 = *(u32 *)(r0 +0) ; null-ptr-deref 10: (95) exit So, make the verifier propagate nullness information for reg to reg comparisons only if neither reg is PTR_TO_BTF_ID. [1] https://lore.kernel.org/bpf/CACkBjsaFJwjC5oiw-1KXvcazywodwXo4zGYsRHwbr2gSG9WcSw@mail.gmail.com/T/#u Fixes: befae75856ab ("bpf: propagate nullness information for reg to reg comparisons") Signed-off-by: Hao Sun <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>