Age | Commit message (Collapse) | Author | Files | Lines |
|
Expose a new netlink family to userspace to control the PM, setting:
- list of local addresses to be signalled.
- list of local addresses used to created subflows.
- maximum number of add_addr option to react
When the msk is fully established, the PM netlink attempts to
announce the 'signal' list via the ADD_ADDR option. Since we
currently lack the ADD_ADDR echo (and related event) only the
first addr is sent.
After exhausting the 'announce' list, the PM tries to create
subflow for each addr in 'local' list, waiting for each
connection to be completed before attempting the next one.
Idea is to add an additional PM hook for ADD_ADDR echo, to allow
the PM netlink announcing multiple addresses, in sequence.
Co-developed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s"
or "nstat" will show them automatically.
The MPTCP MIB counters are allocated in a distinct pcpu area in order to
avoid bloating/wasting TCP pcpu memory.
Counters are allocated once the first MPTCP socket is created in a
network namespace and free'd on exit.
If no sockets have been allocated, all-zero mptcp counters are shown.
The MIB counter list is taken from the multipath-tcp.org kernel, but
only a few counters have been picked up so far. The counter list can
be increased at any time later on.
v2 -> v3:
- remove 'inline' in foo.c files (David S. Miller)
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
add ulp-specific diagnostic functions, so that subflow information can be
dumped to userspace programs like 'ss'.
v2 -> v3:
- uapi: use bit macros appropriate for userspace
Co-developed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Davide Caratti <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
On timeout event, schedule a work queue to do the retransmission.
Retransmission code closely resembles the sendmsg() implementation and
re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
sake - and peeking the relevant dfrag from the rtx head.
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This will simplify mptcp-level retransmission implementation
in the next patch. If dfrag is provided by the caller, skip
kernel space memory allocation and use data and metadata
provided by the dfrag itself.
Because a peer could ack data at TCP level but refrain from
sending mptcp-level ACKs, we could grow the mptcp socket
backlog indefinitely.
We should thus block mptcp_sendmsg until the peer has acked some of the
sent data.
In order to be able to do so, increment the mptcp socket wmem_queued
counter on memory allocation and decrement it when releasing the memory
on mptcp-level ack reception.
Because TCP performns sndbuf auto-tuning up to tcp_wmem_max[2], make
this the mptcp sk_sndbuf limit.
In the future we could add experiment with autotuning as TCP does in
tcp_sndbuf_expand().
v2 -> v3:
- remove 'inline' in foo.c files (David S. Miller)
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
After adding wmem accounting for the mptcp socket we could get
into a situation where the mptcp socket can't transmit more data,
and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced
because it currently will only remove entire dfrags.
Allow advancing the dfrag head sequence and reduce wmem,
even though this isn't correct (as we can't release the page).
Because we will soon block on mptcp sk in case wmem is too large,
call sk_stream_write_space() in case we reduced the backlog so
userspace task blocked in sendmsg or poll will be woken up.
This isn't an issue if the send buffer is large, but it is when
SO_SNDBUF is used to reduce it to a lower value.
Note we can still get a deadlock for low SO_SNDBUF values in
case both sides of the connection write to the socket: both could
be blocked due to wmem being too small -- and current mptcp stack
will only increment mptcp ack_seq on recv.
This doesn't happen with the selftest as it uses poll() and
will always call recv if there is data to read.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Charge the data on the rtx queue to the master MPTCP socket, too.
Such memory in uncharged when the data is acked/dequeued.
Also account mptcp sockets inuse via a protocol specific pcpu
counter.
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The timer will be used to schedule retransmission. It's
frequency is based on the current subflow RTO estimation and
is reset on every una_seq update
The timer is clearer for good by __mptcp_clear_xmit()
Also clean MPTCP rtx queue before each transmission.
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Keep the send page fragment on an MPTCP level retransmission queue.
The queue entries are allocated inside the page frag allocator,
acquiring an additional reference to the page for each list entry.
Also switch to a custom page frag refill function, to ensure that
the current page fragment can always host an MPTCP rtx queue entry.
The MPTCP rtx queue is flushed at disconnect() and close() time
Note that now we need to call __mptcp_init_sock() regardless of mptcp
enable status, as the destructor will try to walk the rtx_queue.
v2 -> v3:
- remove 'inline' in foo.c files (David S. Miller)
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
So that we keep per unacked sequence number consistent; since
we update per msk data, use an atomic64 cmpxchg() to protect
against concurrent updates from multiple subflows.
Initialize the snd_una at connect()/accept() time.
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Fill in more path manager functionality by adding a worker function and
modifying the related stub functions to schedule the worker.
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Peter Krystad <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Subflow creation may be initiated by the path manager when
the primary connection is fully established and a remote
address has been received via ADD_ADDR.
Create an in-kernel sock and use kernel_connect() to
initiate connection.
Passive sockets can't acquire the mptcp socket lock at
subflow creation time, so an additional list protected by
a new spinlock is used to track the MPJ subflows.
Such list is spliced into conn_list tail every time the msk
socket lock is acquired, so that it will not interfere
with data flow on the original connection.
Data flow and connection failover not addressed by this commit.
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Co-developed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Signed-off-by: Peter Krystad <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.
The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Peter Krystad <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add enough of a path manager interface to allow sending of ADD_ADDR
when an incoming MPTCP connection is created. Capable of sending only
a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
sock will need to be expanded to handle multiple interfaces and IPv6.
Partial processing of the incoming ADD_ADDR is included so the path
manager notification of that event happens at the proper time, which
involves validating the incoming address information.
This is a skeleton interface definition for events generated by
MPTCP.
Co-developed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Co-developed-by: Florian Westphal <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Co-developed-by: Mat Martineau <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: Peter Krystad <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.
Co-developed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Co-developed-by: Paolo Abeni <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Peter Krystad <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
A recent commit e8937681797c ("devlink: prepare to support region
operations") used the region_cr_space_str and region_fw_health_str
variables as initializers for the devlink_region_ops structures.
This can result in compiler errors:
drivers/net/ethernet/mellanox//mlx4/crdump.c:45:10: error: initializer
element is not constant
.name = region_cr_space_str,
^
drivers/net/ethernet/mellanox//mlx4/crdump.c:45:10: note: (near
initialization for ‘region_cr_space_ops.name’)
drivers/net/ethernet/mellanox//mlx4/crdump.c:50:10: error: initializer
element is not constant
.name = region_fw_health_str,
The variables were made to be "const char * const", indicating that both
the pointer and data were constant. This was enough to resolve this on
recent GCC (gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1) for this author).
Unfortunately this is not enough for older compilers to realize that the
variable can be treated as a constant expression.
Fix this by introducing macros for the string and use those instead of
the variable name in the region ops structures.
Reported-by: tanhuazhong <[email protected]>
Fixes: e8937681797c ("devlink: prepare to support region operations")
Signed-off-by: Jacob Keller <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The devlink-region.rst and ice-region.rst documentation files wrapped
some lines within shell code blocks due to being longer than 80 lines.
It was pointed out during review that wrapping these lines shouldn't be
done. Fix these two rST files and remove the line wrapping on these
shell command examples.
Reported-by: Jiri Pirko <[email protected]>
Signed-off-by: Jacob Keller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Convert the mt7530 switch driver to use the finalised link
parameters in mac_link_up() rather than the parameters in mac_config().
Signed-off-by: René van Dorst <[email protected]>
Tested-by: Sean Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
It looks like the P/Q/R/S series supports some more counters,
generically named "Ethernet statistics counter", which we were not
printing. Add them.
Signed-off-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Under certain circumstances, depending on the order of addresses on the
interfaces, it could be that sctp_v[46]_get_dst() would return a dst
with a mismatched struct flowi.
For example, if when walking through the bind addresses and the first
one is not a match, it saves the dst as a fallback (added in
410f03831c07), but not the flowi. Then if the next one is also not a
match, the previous dst will be returned but with the flowi information
for the 2nd address, which is wrong.
The fix is to use a locally stored flowi that can be used for such
attempts, and copy it to the parameter only in case it is a possible
match, together with the corresponding dst entry.
The patch updates IPv6 code mostly just to be in sync. Even though the issue
is also present there, it fallback is not expected to work with IPv6.
Fixes: 410f03831c07 ("sctp: add routing output fallback")
Reported-by: Jin Meng <[email protected]>
Signed-off-by: Marcelo Ricardo Leitner <[email protected]>
Tested-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Enable the L3 driver's IPv4 address notifier to watch for events on qeth
devices that have been moved into a net namespace. We need to program
those IPs into the HW just as usual, otherwise inbound traffic won't
flow.
Fixes: 6133fb1aa137 ("[NETNS]: Disable inetaddr notifiers in namespaces other than initial.")
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
We should iterate over the datamsgs to move
all chunks(skbs) to newsk.
The following case cause the bug:
for the trouble SKB, it was in outq->transmitted list
sctp_outq_sack
sctp_check_transmitted
SKB was moved to outq->sacked list
then throw away the sack queue
SKB was deleted from outq->sacked
(but it was held by datamsg at sctp_datamsg_to_asoc
So, sctp_wfree was not called here)
then migrate happened
sctp_for_each_tx_datachunk(
sctp_clear_owner_w);
sctp_assoc_migrate();
sctp_for_each_tx_datachunk(
sctp_set_owner_w);
SKB was not in the outq, and was not changed to newsk
finally
__sctp_outq_teardown
sctp_chunk_put (for another skb)
sctp_datamsg_put
__kfree_skb(msg->frag_list)
sctp_wfree (for SKB)
SKB->sk was still oldsk (skb->sk != asoc->base.sk).
Reported-and-tested-by: [email protected]
Signed-off-by: Qiujun Huang <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the
offset of the GSO in skb cb. This patch fixes the typo.
Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Cambda Zhu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
fib_triestat_seq_show() calls hlist_for_each_entry_rcu(tb, head,
tb_hlist) without rcu_read_lock() will trigger a warning,
net/ipv4/fib_trie.c:2579 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by proc01/115277:
#0: c0000014507acf00 (&p->lock){+.+.}-{3:3}, at: seq_read+0x58/0x670
Call Trace:
dump_stack+0xf4/0x164 (unreliable)
lockdep_rcu_suspicious+0x140/0x164
fib_triestat_seq_show+0x750/0x880
seq_read+0x1a0/0x670
proc_reg_read+0x10c/0x1b0
__vfs_read+0x3c/0x70
vfs_read+0xac/0x170
ksys_read+0x7c/0x140
system_call+0x5c/0x68
Fix it by adding a pair of rcu_read_lock/unlock() and use
cond_resched_rcu() to avoid the situation where walking of a large
number of items may prevent scheduling for a long time.
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Calling queue_delayed_work concurrently with
destroy_workqueue might race to an unexpected outcome -
scheduled task after wq is destroyed or other resources
(like ptt_pool) are freed (yields NULL pointer dereference).
cancel_delayed_work prevents the race by cancelling
the timer triggered for scheduling a new task.
Fixes: 59ccf86fe ("qed: Add driver infrastucture for handling mfw requests")
Signed-off-by: Denis Bolotin <[email protected]>
Signed-off-by: Michal Kalderon <[email protected]>
Signed-off-by: Yuval Basson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
page pool API can be useful for non-DMA cases like
xen-netfront driver so let's allow to pass zero flags to
page pool flags.
v2: check DMA direction only if PP_FLAG_DMA_MAP is set
Signed-off-by: Denis Kirjanov <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
For historical reasons, there are several timestamping selftest targets
in selftests/networking/timestamping. Move them to the standard
directory for networking tests: selftests/net.
Signed-off-by: Jian Yang <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Until now a PHY-fixup in mach-imx set our rgmii timing correctly. For
the PHY KSZ9131 there is no PHY-fixup in mach-imx. To support this PHY
too, use rgmii-id.
For the now used KSZ9031 nothing will change, as rgmii-id is only
implemented and supported by the KSZ9131.
Signed-off-by: Philippe Schenker <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The KSZ9131 provides DLL controlled delays on RXC and TXC lines. This
patch makes use of those delays. The information which delays should
be enabled or disabled comes from the interface names, documented in
ethernet-controller.yaml:
rgmii: Disable RXC and TXC delays
rgmii-id: Enable RXC and TXC delays
rgmii-txid: Enable only TXC delay, disable RXC delay
rgmii-rxid: Enable onlx RXC delay, disable TXC delay
Signed-off-by: Philippe Schenker <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch adds new netlink attribute to allow a user to (optionally)
specify the desired offload mode immediately upon MACSec link creation.
Separate iproute patch will be required to support this from user space.
Signed-off-by: Mark Starovoytov <[email protected]>
Signed-off-by: Igor Russkikh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The original patch didn't copy the ieee80211_is_data() condition
because on most drivers the management frames don't go through
this path. However, they do on iwlwifi/mvm, so we do need to keep
the condition here.
Cc: [email protected]
Fixes: ce2e1ca70307 ("mac80211: Check port authorization in the ieee80211_tx_dequeue() case")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Minor comment conflict in mac80211.
Signed-off-by: David S. Miller <[email protected]>
|
|
Store the conntrack counters to the conntrack entry in the
HW flowtable offload.
Signed-off-by: wenxu <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Add nf_ct_acct_add function to update the conntrack counter
with packets and bytes.
Signed-off-by: wenxu <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
The bitmap set does not support for expressions, skip it from the
estimation step.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
If the global set expression definition mismatches the dynset
expression, then bail out.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Otherwise, nft_lookup might dereference an uninitialized pointer to the
element extension.
Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK
in netlink message are not supported. We should return an error when one
of them is set, not both
Fixes: 9306425b70bf ("netfilter: ctnetlink: must check mark attributes vs NULL")
Signed-off-by: Romain Bellan <[email protected]>
Signed-off-by: Florent Fourcot <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
KP Singh says:
====================
** Motivation
Google does analysis of rich runtime security data to detect and thwart
threats in real-time. Currently, this is done in custom kernel modules
but we would like to replace this with something that's upstream and
useful to others.
The current kernel infrastructure for providing telemetry (Audit, Perf
etc.) is disjoint from access enforcement (i.e. LSMs). Augmenting the
information provided by audit requires kernel changes to audit, its
policy language and user-space components. Furthermore, building a MAC
policy based on the newly added telemetry data requires changes to
various LSMs and their respective policy languages.
This patchset allows BPF programs to be attached to LSM hooks This
facilitates a unified and dynamic (not requiring re-compilation of the
kernel) audit and MAC policy.
** Why an LSM?
Linux Security Modules target security behaviours rather than the
kernel's API. For example, it's easy to miss out a newly added system
call for executing processes (eg. execve, execveat etc.) but the LSM
framework ensures that all process executions trigger the relevant hooks
irrespective of how the process was executed.
Allowing users to implement LSM hooks at runtime also benefits the LSM
eco-system by enabling a quick feedback loop from the security community
about the kind of behaviours that the LSM Framework should be targeting.
** How does it work?
The patchset introduces a new eBPF (https://docs.cilium.io/en/v1.6/bpf/)
program type BPF_PROG_TYPE_LSM which can only be attached to LSM hooks.
Loading and attachment of BPF programs requires CAP_SYS_ADMIN.
The new LSM registers nop functions (bpf_lsm_<hook_name>) as LSM hook
callbacks. Their purpose is to provide a definite point where BPF
programs can be attached as BPF_TRAMP_MODIFY_RETURN trampoline programs
for hooks that return an int, and BPF_TRAMP_FEXIT trampoline programs
for void LSM hooks.
Audit logs can be written using a format chosen by the eBPF program to
the perf events buffer or to global eBPF variables or maps and can be
further processed in user-space.
** BTF Based Design
The current design uses BTF:
* https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
* https://lwn.net/Articles/803258
which allows verifiable read-only structure accesses by field names
rather than fixed offsets. This allows accessing the hook parameters
using a dynamically created context which provides a certain degree of
ABI stability:
// Only declare the structure and fields intended to be used
// in the program
struct vm_area_struct {
unsigned long vm_start;
} __attribute__((preserve_access_index));
// Declare the eBPF program mprotect_audit which attaches to
// to the file_mprotect LSM hook and accepts three arguments.
SEC("lsm/file_mprotect")
int BPF_PROG(mprotect_audit, struct vm_area_struct *vma,
unsigned long reqprot, unsigned long prot, int ret)
{
unsigned long vm_start = vma->vm_start;
return 0;
}
By relocating field offsets, BTF makes a large portion of kernel data
structures readily accessible across kernel versions without requiring a
large corpus of BPF helper functions and requiring recompilation with
every kernel version. The BTF type information is also used by the BPF
verifier to validate memory accesses within the BPF program and also
prevents arbitrary writes to the kernel memory.
The limitations of BTF compatibility are described in BPF Co-Re
(http://vger.kernel.org/bpfconf2019_talks/bpf-core.pdf, i.e. field
renames, #defines and changes to the signature of LSM hooks). This
design imposes that the MAC policy (eBPF programs) be updated when the
inspected kernel structures change outside of BTF compatibility
guarantees. In practice, this is only required when a structure field
used by a current policy is removed (or renamed) or when the used LSM
hooks change. We expect the maintenance cost of these changes to be
acceptable as compared to the design presented in the RFC.
(https://lore.kernel.org/bpf/[email protected]/).
** Usage Examples
A simple example and some documentation is included in the patchset.
In order to better illustrate the capabilities of the framework some
more advanced prototype (not-ready for review) code has also been
published separately:
* Logging execution events (including environment variables and
arguments)
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
* Detecting deletion of running executables:
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c
* Detection of writes to /proc/<pid>/mem:
https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
We have updated Google's internal telemetry infrastructure and have
started deploying this LSM on our Linux Workstations. This gives us more
confidence in the real-world applications of such a system.
** Changelog:
- v8 -> v9:
https://lore.kernel.org/bpf/[email protected]/
* Fixed a selftest crash when CONFIG_LSM doesn't have "bpf".
* Added James' Ack.
* Rebase.
- v7 -> v8:
https://lore.kernel.org/bpf/[email protected]/
* Removed CAP_MAC_ADMIN check from bpf_lsm_verify_prog. LSMs can add it
in their own bpf_prog hook. This can be revisited as a separate patch.
* Added Andrii and James' Ack/Review tags.
* Fixed an indentation issue and missing newlines in selftest error
a cases.
* Updated a comment as suggested by Alexei.
* Updated the documentation to use the newer libbpf API and some other
fixes.
* Rebase
- v6 -> v7:
https://lore.kernel.org/bpf/[email protected]/
* Removed __weak from the LSM attachment nops per Kees' suggestion.
Will send a separate patch (if needed) to update the noinline
definition in include/linux/compiler_attributes.h.
* waitpid to wait specifically for the forked child in selftests.
* Comment format fixes in security/... as suggested by Casey.
* Added Acks from Kees and Andrii and Casey's Reviewed-by: tags to
the respective patches.
* Rebase
- v5 -> v6:
https://lore.kernel.org/bpf/[email protected]/
* Updated LSM_HOOK macro to define a default value and cleaned up the
BPF LSM hook declarations.
* Added Yonghong's Acks and Kees' Reviewed-by tags.
* Simplification of the selftest code.
* Rebase and fixes suggested by Andrii and Yonghong and some other minor
fixes noticed in internal review.
- v4 -> v5:
https://lore.kernel.org/bpf/[email protected]/
* Removed static keys and special casing of BPF calls from the LSM
framework.
* Initialized the BPF callbacks (nops) as proper LSM hooks.
* Updated to using the newly introduced BPF_TRAMP_MODIFY_RETURN
trampolines in https://lkml.org/lkml/2020/3/4/877
* Addressed Andrii's feedback and rebased.
- v3 -> v4:
* Moved away from allocating a separate security_hook_heads and adding a
new special case for arch_prepare_bpf_trampoline to using BPF fexit
trampolines called from the right place in the LSM hook and toggled by
static keys based on the discussion in:
https://lore.kernel.org/bpf/CAG48ez25mW+_oCxgCtbiGMX07g_ph79UOJa07h=o_6B6+Q-u5g@mail.gmail.com/
* Since the code does not deal with security_hook_heads anymore, it goes
from "being a BPF LSM" to "BPF program attachment to LSM hooks".
* Added a new test case which ensures that the BPF programs' return value
is reflected by the LSM hook.
- v2 -> v3 does not change the overall design and has some minor fixes:
* LSM_ORDER_LAST is introduced to represent the behaviour of the BPF LSM
* Fixed the inadvertent clobbering of the LSM Hook error codes
* Added GPL license requirement to the commit log
* The lsm_hook_idx is now the more conventional 0-based index
* Some changes were split into a separate patch ("Load btf_vmlinux only
once per object")
https://lore.kernel.org/bpf/[email protected]/
* Addressed Andrii's feedback on the BTF implementation
* Documentation update for using generated vmlinux.h to simplify
programs
* Rebase
- Changes since v1:
https://lore.kernel.org/bpf/[email protected]
* Eliminate the requirement to maintain LSM hooks separately in
security/bpf/hooks.h Use BPF trampolines to dynamically allocate
security hooks
* Drop the use of securityfs as bpftool provides the required
introspection capabilities. Update the tests to use the bpf_skeleton
and global variables
* Use O_CLOEXEC anonymous fds to represent BPF attachment in line with
the other BPF programs with the possibility to use bpf program pinning
in the future to provide "permanent attachment".
* Drop the logic based on prog names for handling re-attachment.
* Drop bpf_lsm_event_output from this series and send it as a separate
patch.
====================
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Document how eBPF programs (BPF_PROG_TYPE_LSM) can be loaded and
attached (BPF_LSM_MAC) to the LSM hooks.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Thomas Garnier <[email protected]>
Reviewed-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
* Load/attach a BPF program that hooks to file_mprotect (int)
and bprm_committed_creds (void).
* Perform an action that triggers the hook.
* Verify if the audit event was received using the shared global
variables for the process executed.
* Verify if the mprotect returns a -EPERM.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Thomas Garnier <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Since BPF_PROG_TYPE_LSM uses the same attaching mechanism as
BPF_PROG_TYPE_TRACING, the common logic is refactored into a static
function bpf_program__attach_btf_id.
A new API call bpf_program__attach_lsm is still added to avoid userspace
conflicts if this ever changes in the future.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
* The hooks are initialized using the definitions in
include/linux/lsm_hook_defs.h.
* The LSM can be enabled / disabled with CONFIG_BPF_LSM.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Acked-by: Kees Cook <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
JITed BPF programs are dynamically attached to the LSM hooks
using BPF trampolines. The trampoline prologue generates code to handle
conversion of the signature of the hook to the appropriate BPF context.
The allocated trampoline programs are attached to the nop functions
initialized as LSM hooks.
BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
and need CAP_SYS_ADMIN (required for loading eBPF programs).
Upon attachment:
* A BPF fexit trampoline is used for LSM hooks with a void return type.
* A BPF fmod_ret trampoline is used for LSM hooks which return an
int. The attached programs can override the return value of the
bpf LSM hook to indicate a MAC Policy decision.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are
generated for each LSM hook. These functions are initialized as LSM
hooks in a subsequent patch.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The information about the different types of LSM hooks is scattered
in two locations i.e. union security_list_options and
struct security_hook_heads. Rather than duplicating this information
even further for BPF_PROG_TYPE_LSM, define all the hooks with the
LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all
the data structures required by the LSM framework.
The LSM hooks are defined as:
LSM_HOOK(<return_type>, <default_value>, <hook_name>, args...)
with <default_value> acccessible in security.c as:
LSM_RET_DEFAULT(<hook_name>)
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Casey Schaufler <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.
Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Florent Revest <[email protected]>
Reviewed-by: Thomas Garnier <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: James Morris <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This adds a test to exercise the new bpf_map__set_initial_value() function.
The test simply overrides the global data section with all zeroes, and
checks that the new value makes it into the kernel map on load.
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
For internal maps (most notably the maps backing global variables), libbpf
uses an internal mmaped area to store the data after opening the object.
This data is subsequently copied into the kernel map when the object is
loaded.
This adds a function to set a new value for that data, which can be used to
before it is loaded into the kernel. This is especially relevant for RODATA
maps, since those are frozen on load.
Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Fix a redefinition of 'net_gen_cookie' error that was overlooked
when net ns is not configured.
Fixes: f318903c0bf4 ("bpf: Add netns cookie and enable it for bpf cgroup hooks")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|