Age | Commit message (Collapse) | Author | Files | Lines |
|
As a preparation for removing dependency on rtnl lock from rules update
path, all users of shared objects must take reference while working with
them.
Extend action ops with put_dev() API to be used on net device returned by
get_dev().
Modify mirred action (only action that implements get_dev callback):
- Take reference to net device in get_dev.
- Implement put_dev API that releases reference to net device.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock to protect vlan action private data from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)
Remove rtnl assertion that is no longer necessary.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf lock to protect tunnel key action struct private data from
concurrent modification in init and dump. Use rcu swap operation to
reassign params pointer under protection of tcf lock. (old params value is
not used by init, so there is no need of standalone rcu dereference step)
Remove rtnl lock assertion that is no longer required.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
spinlock to protect private skbmod data from concurrent modification during
dump.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock to protect private simple action data from concurrent
modification during dump. (simple init already uses tcf spinlock when
changing action state)
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock to protect private sample action data from concurrent
modification during dump and init.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Rearrange pedit init code to only access pedit action data while holding
tcf spinlock. Change keys allocation type to atomic to allow it to execute
while holding tcf spinlock. Take tcf spinlock in dump function when
accessing pedit action data.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock to protect ipt action private data from concurrent
modification during dump. Ipt init already takes tcf spinlock when
modifying ipt state.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock and rcu to protect params pointer from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)
Ife action has meta-actions that are compiled as standalone modules. Rtnl
mutex must be released while loading a kernel module. In order to support
execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
loading functions. When requesting meta action module, conditionally
release rtnl lock depending on 'rtnl_held' argument.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock to protect gact action private state from concurrent
modification during dump and init. Remove rtnl assertion that is no longer
necessary.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf lock to protect csum action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)
Remove rtnl assertion that is no longer necessary.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use tcf spinlock to protect bpf action private data from concurrent
modification during dump and init. Remove rtnl lock assertion that is no
longer necessary.
Signed-off-by: Vlad Buslov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This path replaces physically contiguous memory arrays
allocated using kmalloc_array() with flexible arrays.
This enables to avoid memory allocation failures on the
systems under a memory stress.
Signed-off-by: Oleg Babin <[email protected]>
Signed-off-by: Konstantin Khorenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch introduces wrappers for accessing in/out streams indirectly.
This will enable to replace physically contiguous memory arrays
of streams with flexible arrays (or maybe any other appropriate
mechanism) which do memory allocation on a per-page basis.
Signed-off-by: Oleg Babin <[email protected]>
Signed-off-by: Konstantin Khorenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl()
handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX.
PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring
this mechanism.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it
manually every time.
While there, constify 'stats'.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS
on session sockets. We just need to copy the stats and set ->session_id.
As a side effect of sharing session and tunnel code, ->using_ipsec is
properly set even when the request was made using a session socket.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a
tunnel. This one is a bit special because the caller may use the tunnel
socket to retrieve statistics of one of its sessions. If the session_id
is set, the corresponding session's statistics are returned, instead of
those of the tunnel. This is handled by the new
pppol2tp_tunnel_copy_stats() helper function.
Set ->tunnel_id and ->using_ipsec out of the conditional, so
that it can be used by the 'else' branch in the following patch.
We cannot do that for ->session_id, because tunnel sockets have to
report the value that was originally passed in 'stats.session_id',
while session sockets have to report their own session_id.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on
pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
* Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could
not have called us.
* Drop test on 'SOCK_DEAD' state: if this flag was set, the socket
would be in the process of being released and no ioctl could be
running anymore.
* Drop test on 'PPPOX_*' state: we depend on ->sk_user_data to get
the session structure. If it is non-NULL, then the socket is
connected. Testing for PPPOX_* is redundant.
* Retrieve session using ->sk_user_data directly, instead of going
through pppol2tp_sock_to_session(). This avoids grabbing a useless
reference on the socket.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
l2tp_session_get() is used for two different purposes. If 'tunnel' is
NULL, the session is searched globally in the supplied network
namespace. Otherwise it is searched exclusively in the tunnel context.
Callers always know the context in which they need to search the
session. But some of them do provide both a namespace and a tunnel,
making the semantic of the call unclear.
This patch defines l2tp_tunnel_get_session() for lookups done in a
tunnel and restricts l2tp_session_get() to namespace searches.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Use helper function to figure out if a tunnel is using ipsec.
Also, avoid accessing ->sk_policy directly since it's RCU protected.
Signed-off-by: Guillaume Nault <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Previously commit 9aee40006190 ("tcp: ack immediately when a cwr
packet arrives") calls tcp_enter_quickack_mode to force sending
two immediate ACKs upon receiving a packet w/ CWR flag. The side
effect is it'll also reset the delayed ACK timer and interactive
session tracking. This patch removes that side effect by using the
new ACK_NOW flag to force an immmediate ACK.
Packetdrill to demonstrate:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.1 < [ect0] . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
+0 < [ect0] . 1:1001(1000) ack 1 win 257
+0 > [ect01] . 1:1(0) ack 1001
+0 write(4, ..., 1) = 1
+0 > [ect01] P. 1:2(1) ack 1001
+0 < [ect0] . 1001:2001(1000) ack 2 win 257
+0 write(4, ..., 1) = 1
+0 > [ect01] P. 2:3(1) ack 2001
+0 < [ect0] . 2001:3001(1000) ack 3 win 257
+0 < [ect0] . 3001:4001(1000) ack 3 win 257
// Ack delayed ...
+.01 < [ce] P. 4001:4501(500) ack 3 win 257
+0 > [ect01] . 3:3(0) ack 4001
+0 > [ect01] E. 3:3(0) ack 4501
+.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect01] PE. 3:4(1) ack 4501 win 100
+.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
// No delayed ACK on CWR flag
+0 > [ect01] . 4:4(0) ack 5501
+.31 < [ect0] . 5501:6501(1000) ack 4 win 257
+0 > [ect01] . 4:4(0) ack 6501
Fixes: 9aee40006190 ("tcp: ack immediately when a cwr packet arrives")
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
RFC 5681 sec 4.2:
To provide feedback to senders recovering from losses, the receiver
SHOULD send an immediate ACK when it receives a data segment that
fills in all or part of a gap in the sequence space.
When a gap is partially filled, __tcp_ack_snd_check already checks
the out-of-order queue and correctly send an immediate ACK. However
when a gap is fully filled, the previous implementation only resets
pingpong mode which does not guarantee an immediate ACK because the
quick ACK counter may be zero. This patch addresses this issue by
marking the one-time immediate ACK flag instead.
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The recent fix of acking immediately in DCTCP on CE status change
has an undesirable side-effect: it also resets TCP ack timer and
disables pingpong mode (interactive session). But the CE status
change has nothing to do with them. This patch addresses that by
using the new one-time immediate ACK flag instead of calling
tcp_enter_quickack_mode().
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add a new flag to indicate a one-time immediate ACK. This flag is
occasionaly set under specific TCP protocol states in addition to
the more common quickack mechanism for interactive application.
In several cases in the TCP code we want to force an immediate ACK
but do not want to call tcp_enter_quickack_mode() because we do
not want to forget the icsk_ack.pingpong or icsk_ack.ato state.
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The static int 'zero' is defined but is never used hence it is
redundant and can be removed. The use of this variable was removed
with commit a158bdd3247b ("rxrpc: Fix call timeouts").
Cleans up clang warning:
warning: 'zero' defined but not used [-Wunused-const-variable=]
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.
It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.
One level of "__reuseport_attach_prog()" call is removed. The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.
The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
allows a SO_REUSEPORT sk to be added to a bpf map. When a sk
is removed from reuse->socks[], it also needs to be removed from
the bpf map. Also, when adding a sk to a bpf map, the bpf
map needs to ensure it is indeed in a reuse->socks[].
Hence, reuseport_lock is needed by the bpf map to ensure its
map_update_elem() and map_delete_elem() operations are in-sync with
the reuse->socks[]. The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
acquire the reuseport_lock after ensuring the adding sk is already
in a reuseport group (i.e. reuse->socks[]). The map_lookup_elem()
will be lockless.
This patch also adds an ID to sock_reuseport. A later patch
will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
a bpf prog to select a sk from a bpf map. It is inflexible to
statically enforce a bpf map can only contain the sk belonging to
a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
verification time. For example, think about the the map-in-map situation
where the inner map can be dynamically changed in runtime and the outer
map may have inner maps belonging to different reuseport groups.
Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
type) selects a sk, this selected sk has to be checked to ensure it
belongs to the requesting reuseport group (i.e. the group serving
that IP:PORT).
The "sk->sk_reuseport_cb" pointer cannot be used for this checking
purpose because the pointer value will change after reuseport_grow().
Instead of saving all checking conditions like the ones
preced calling "reuseport_add_sock()" and compare them everytime a
bpf_prog is run, a 32bits ID is introduced to survive the
reuseport_grow(). The ID is only acquired if any of the
reuse->socks[] is added to the newly introduced
"BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used, the changes in this
patch is a no-op.
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()". The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").
The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case. However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.
When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie. For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
was updated.
2) the reuse->socks[] became {sk1, sk2} later. e.g. sk0 was first closed
and then sk2 was added. Here, sk2 does not have ts_recent_stamp set.
There are other ordering that will trigger the similar situation
below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
"tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
all syncookies sent by sk1 will be handled (and rejected)
by sk2 while sk1 is still alive.
The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit. E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).
The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit. e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map. Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.
A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[] OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group. However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.
This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.
"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group. Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.
A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed. Before and
after the change is ~9Mpps in IPv4.
Cc: Eric Dumazet <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
With SMC-D z/OS sends a test link signal every 10 seconds. Linux is
supposed to answer, otherwise the SMC-D connection breaks.
Signed-off-by: Ursula Braun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:
====================
pull request: bluetooth-next 2018-08-10
Here's one more (most likely last) bluetooth-next pull request for the
4.19 kernel.
- Added support for MediaTek serial Bluetooth devices
- Initial skeleton for controller-side address resolution support
- Fix BT_HCIUART_RTL related Kconfig dependencies
- A few other minor fixes/cleanups
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains netfilter updates for your net-next tree:
1) Expose NFT_OSF_MAXGENRELEN maximum OS name length from the new OS
passive fingerprint matching extension, from Fernando Fernandez.
2) Add extension to support for fine grain conntrack timeout policies
from nf_tables. As preparation works, this patchset moves
nf_ct_untimeout() to nf_conntrack_timeout and it also decouples the
timeout policy from the ctnl_timeout object, most work done by
Harsha Sharma.
3) Enable connection tracking when conntrack helper is in place.
4) Missing enumeration in uapi header when splitting original xt_osf
to nfnetlink_osf, also from Fernando.
5) Fix a sparse warning due to incorrect typing in the nf_osf_find(),
from Wei Yongjun.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Add the definitions for LE address resolution enable HCI commands.
When the LE address resolution enable gets changed via HCI commands
make sure that flag gets updated.
Signed-off-by: Ankit Navik <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
|
|
We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.
This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.
A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.
v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
avoid per-frame copy cost.
Signed-off-by: Toshiaki Makita <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.
v6:
- Fix sparse warning around EXPORT_SYMBOL.
Signed-off-by: Toshiaki Makita <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
This is needed for veth XDP which does skb_copy_expand()-like operation.
v2:
- Drop skb_copy_header part because it has already been exported now.
Signed-off-by: Toshiaki Makita <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.
The reverted commit adds a WARN to check against NULL entries in the
mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
driver) fast path is required to make a paired
xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
addition, a driver using a different allocation scheme than the
default MEM_TYPE_PAGE_SHARED is required to additionally call
xdp_rxq_info_reg_mem_model.
For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
that the mem_id_ht rhashtable has a properly inserted allocator id. If
not, this would be a driver bug. A NULL pointer kernel OOPS is
preferred to the WARN.
Suggested-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Björn Töpel <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.
This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().
Acked-by: Jason Wang <[email protected]>
============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20
but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(cpu_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by swapper/0/1:
#0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
#1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
#2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60
v2: move cpus_read_lock() out of __netif_set_xps_queue()
Cc: "Nambiar, Amritha" <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Currently the refcnt is never decremented in case the value is not 1.
Fix it by adding decrement in case the refcnt is not 1.
Reported-by: Vlad Buslov <[email protected]>
Fixes: f71e0ca4db18 ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Fixes the following sparse warning:
net/decnet/dn_route.c:407:30: warning: Using plain integer as NULL pointer
net/decnet/dn_route.c:1923:22: warning: Using plain integer as NULL pointer
Signed-off-by: YueHaibing <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
IPv6 GRO over GRE tap is not working while GRO is not set
over the native interface.
gro_list_prepare function updates the same_flow variable
of existing sessions to 1 if their mac headers match the one
of the incoming packet.
same_flow is used to filter out non-matching sessions and keep
potential ones for aggregation.
The number of bytes to compare should be the number of bytes
in the mac headers. In gro_list_prepare this number is set to
be skb->dev->hard_header_len. For GRE interfaces this hard_header_len
should be as it is set in the initialization process (when GRE is
created), it should not be overridden. But currently it is being overridden
by the value that is actually supposed to represent the needed_headroom.
Therefore, the number of bytes compared in order to decide whether the
the mac headers are the same is greater than the length of the headers.
As it's documented in netdevice.h, hard_header_len is the maximum
hardware header length, and needed_headroom is the extra headroom
the hardware may need.
hard_header_len is basically all the bytes received by the physical
till layer 3 header of the packet received by the interface.
For example, if the interface is a GRE tap then the needed_headroom
should be the total length of the following headers:
IP header of the physical, GRE header, mac header of GRE.
It is often used to calculate the MTU of the created interface.
This patch removes the override of the hard_header_len, and
assigns the calculated value to needed_headroom.
This way, the comparison in gro_list_prepare is really of
the mac headers, and if the packets have the same mac headers
the same_flow will be set to 1.
Performance testing: 45% higher bandwidth.
Measuring bandwidth of single-stream IPv4 TCP traffic over IPv6
GRE tap while GRO is not set on the native.
NIC: ConnectX-4LX
Before (GRO not working) : 7.2 Gbits/sec
After (GRO working): 10.5 Gbits/sec
Signed-off-by: Maria Pasechnik <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The ret is not modified after initalization, So just remove the variable
and return 0.
Signed-off-by: zhong jiang <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
I've given up on the idea of zero-copy handling of SYMLINK on the
server side. This is because the Linux VFS symlink API requires the
symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
NUL-termination is going to be problematic (watching out for
landing on a page boundary and dealing with a 4096-byte pathname).
I don't believe that SYMLINK creation is on a performance path or is
requested frequently enough that it will cause noticeable CPU cache
pollution due to data copies.
There will be two places where a transport callout will be necessary
to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
helper that is used by NFSv2 and NFSv3, and the other will be in
nfsd4_decode_create().
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
fill_in_write_vector() is nearly the same logic as
svc_fill_write_vector(), but there are a few differences so that
the former can handle multiple WRITE payloads in a single COMPOUND.
svc_fill_write_vector() can be adjusted so that it can be used in
the NFSv4 WRITE code path too. Instead of assuming the pages are
coming from rq_args.pages, have the caller pass in the page list.
The immediate benefit is a reduction of code duplication. It also
prevents the NFSv4 WRITE decoder from passing an empty vector
element when the transport has provided the payload in the xdr_buf's
page array.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Simplify the error handling at the tail of recv_read_chunk() by
re-arranging rq_pages[] housekeeping and documenting it properly.
NB: In this path, svc_rdma_recvfrom returns zero. Therefore no
subsequent reply processing is done on the svc_rqstp, and thus the
rq_respages field does not need to be updated.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
svc_xprt_release() invokes svc_free_res_pages(), which releases
pages between rq_respages and rq_next_page.
Historically, the RPC/RDMA transport has set these two pointers to
be different by one, which means:
- one page gets released when svc_recv returns 0. This normally
happens whenever one or more RDMA Reads need to be dispatched to
complete construction of an RPC Call.
- one page gets released after every call to svc_send.
In both cases, this released page is immediately refilled by
svc_alloc_arg. There does not seem to be a reason for releasing this
page.
To avoid this unnecessary memory allocator traffic, set rq_next_page
more carefully.
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|
|
Variables 'checksumlen','blocksize' and 'data' are being assigned,
but are never used, hence they are redundant and can be removed.
Fix the following warning:
net/sunrpc/auth_gss/gss_krb5_wrap.c:443:7: warning: variable ‘blocksize’ set but not used [-Wunused-but-set-variable]
net/sunrpc/auth_gss/gss_krb5_crypto.c:376:15: warning: variable ‘checksumlen’ set but not used [-Wunused-but-set-variable]
net/sunrpc/xprtrdma/svc_rdma.c:97:9: warning: variable ‘data’ set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <[email protected]>
Reviewed-by: Chuck Lever <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
|