Age | Commit message (Collapse) | Author | Files | Lines |
|
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.
However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.
Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.
This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.
This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.
ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.
This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.
As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.
Commit 94f58c58c0b4 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.
Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.
This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.
FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.
Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.
Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.
Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Clean up: Move device capability detection into memreg-specific
source files.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.
This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.
ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.
ALLPHYSICAL exposes the client to server bugs, including:
o base/bounds errors causing data outside the i/o buffer to be
accessed
o RDMA access after reply causing data corruption and/or integrity
fail
ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.
ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Based on code audit.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
I found that commit ead3f26e359e ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.
The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.
Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01e9
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.
However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.
This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).
The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.
Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Clean up: Moving these helpers in a separate patch makes later
patches more readable.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.
One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.
Since commit 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.
I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().
Fixes: 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi <[email protected]>
Signed-off-by: Michal Kubecek <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Simon Wunderlich says:
====================
Here are a couple batman-adv bugfix patches, all by Sven Eckelmann:
- Fix possible NULL pointer dereference for vlan_insert_tag (two patches)
- Fix reference handling in some features, which may lead to reference
leaks or invalid memory access (four patches)
- Fix speedy join: DHCP packets handled by the gateway feature should
be sent with 4-address unicast instead of 3-address unicast to make
speedy join work. This fixes/speeds up DHCP assignment for clients
which join a mesh for the first time. (one patch)
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch corrects an off-by-one error in the DecodeQ931 function in
the nf_conntrack_h323 module. This error could result in reading off
the end of a Q.931 frame.
Signed-off-by: Toby DiPasquale <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:
====================
IPVS Updates for v4.8
please consider these enhancements to the IPVS. This alters the behaviour
of the "least connection" schedulers such that pre-established connections
are included in the active connection count. This avoids overloading
servers when a large number of new connections arrive in a short space of
time - e.g. when clients reconnect after a node or network failure.
====================
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
We can pass the netns pointer as parameter to the functions that need to
gain access to it. From basechains, I didn't find any client for this
field anymore so let's remove this too.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
If we want to use ct packets expr, and add a rule like follows:
# nft add rule filter input ct packets gt 1 counter
We will find that no packets will hit it, because
nf_conntrack_acct is disabled by default. So It will
not work until we enable it manually via
"echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct".
This is not friendly, so like xt_connbytes do, if the user
want to use ct byte/packet expr, enable nf_conntrack_acct
automatically.
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
physdev_mt() will check skb->nf_bridge first, which was alloced in
br_nf_pre_routing. So if we want to use --physdev-out and physdev-is-out,
we need to match it in FORWARD or POSTROUTING chain. physdev_mt_check()
only checked physdev-out and missed physdev-is-out. Fix it and update the
debug message to make it clearer.
Signed-off-by: Hangbin Liu <[email protected]>
Reviewed-by: Marcelo R Leitner <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
It did use a fixed-size bucket list plus single lock to protect add/del.
Unlike the main conntrack table we only need to add and remove keys.
Convert it to rhashtable to get table autosizing and per-bucket locking.
The maximum number of entries is -- as before -- tied to the number of
conntracks so we do not need another upperlimit.
The change does not handle rhashtable_remove_fast error, only possible
"error" is -ENOENT, and that is something that can happen legitimetely,
e.g. because nat module was inserted at a later time and no src manip
took place yet.
Tested with http-client-benchmark + httpterm with DNAT and SNAT rules
in place.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs
Simon Horman says:
====================
Second Round of IPVS Fixes for v4.7
The fix from Quentin Armitage allows the backup sync daemon to
be bound to a link-local mcast IPv6 address as is already the case
for IPv4.
====================
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
The nat extension structure is 32bytes in size on x86_64:
struct nf_conn_nat {
struct hlist_node bysource; /* 0 16 */
struct nf_conn * ct; /* 16 8 */
union nf_conntrack_nat_help help; /* 24 4 */
int masq_index; /* 28 4 */
/* size: 32, cachelines: 1, members: 4 */
/* last cacheline: 32 bytes */
};
The hlist is needed to quickly check for possible tuple collisions
when installing a new nat binding. Storing this in the extension
area has two drawbacks:
1. We need ct backpointer to get the conntrack struct from the extension.
2. When reallocation of extension area occurs we need to fixup the bysource
hash head via hlist_replace_rcu.
We can avoid both by placing the hlist_head in nf_conn and place nf_conn in
the bysource hash rather than the extenstion.
We can also remove the ->move support; no other extension needs it.
Moving the entire nat extension into nf_conn would be possible as well but
then we have to add yet another callback for deletion from the bysource
hash table rather than just using nat extension ->destroy hook for this.
nf_conn size doesn't increase due to aligment, followup patch replaces
hlist_node with single pointer.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
We don't need to acquire the bucket lock during early drop, we can
use lockless traveral just like ____nf_conntrack_find.
The timer deletion serves as synchronization point, if another cpu
attempts to evict same entry, only one will succeed with timer deletion.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
From: Liping Zhang <[email protected]>
Similar to ctnl_untimeout, when hash resize happened, we should try
to do unhelp from the 0# bucket again.
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Imagine such situation, nf_conntrack_htable_size now is 4096, we are doing
ctnl_untimeout, and iterate on 3000# bucket.
Meanwhile, another user try to reduce hash size to 2048, then all nf_conn
are removed to the new hashtable. When this hash resize operation finished,
we still try to itreate ct begin from 3000# bucket, find nothing to do and
just return.
We may miss unlinking some timeout objects. And later we will end up with
invalid references to timeout object that are already gone.
So when we find that hash resize happened, try to unlink timeout objects
from the 0# bucket again.
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
Call Trace:
[<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
[<ffffffff81261a1c>] seq_read+0x2cc/0x390
[<ffffffff812a8d62>] proc_reg_read+0x42/0x70
[<ffffffff8123bee7>] __vfs_read+0x37/0x130
[<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
[<ffffffff8123cf75>] vfs_read+0x95/0x140
[<ffffffff8123e475>] SyS_read+0x55/0xc0
[<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4
It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
while : ; do
echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
done
2. open more shells and input the following cmds:
while : ; do
cat /proc/net/nf_conntrack
done
3. just wait a monent, oops will happen soon.
The solution in this patch is based on Florian's Commit 5e3c61f98175
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.
Signed-off-by: Liping Zhang <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
When the target needs more time to process the received PDU, it sends
Response Timeout Extension (RTOX) PDU.
When the initiator receives a RTOX PDU, it must reply with a RTOX PDU
and extends the current rwt value with the formula:
rwt_int = rwt * rtox
This patch takes care of the rtox value passed by the target in the RTOX
PDU and extends the timeout for the next response accordingly.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
When sending an ATR_REQ, the initiator must wait for the ATR_RES at
least 'RWT(nfcdep,activation) + dRWT(nfcdep)' and no more than
'RWT(nfcdep,activation) + dRWT(nfcdep) + dT(nfcdep,initiator)'. This
gives a timeout value between 1237 ms and 1337 ms. This patch defines
DIGITAL_ATR_RES_RWT to 1337 used for the timeout value of ATR_REQ
command.
For other DEP PDUs, the initiator must wait between 'RWT + dRWT(nfcdep)'
and 'RWT + dRWT(nfcdep) + dT(nfcdep,initiator)' where RWT is given by
the following formula: '(256 * 16 / f(c)) * 2^wt' where wt is the value
of the TO field in the ATR_RES response and is in the range between 0
and 14. This patch declares a mapping table for wt values and gives RWT
max values between 100 ms and 5049 ms.
This patch also defines DIGITAL_ATR_RES_TO_WT, the maximum wt value in
target mode, to 8.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
This patch frees the RTOX resp sk_buff in initiator mode. It also makes
use of the free_resp exit point for ATN supervisor PDUs in both
initiator and target mode.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
With this patch, ACK PDU sk_buffs are now freed and code has been
refactored for better errors handling.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
When the target receives a NACK PDU, it re-sends the last sent PDU.
ACK PDUs are received by the target as a reply from the initiator to
chained I-PDUs. There are 3 cases to handle:
- If the target has previously received 1 or more ATN PDUs and the PNI
in the ACK PDU is equal to the target PNI - 1, then it means that the
initiator did not received the last issued PDU from the target. In
this case it re-sends this PDU.
- If the target has received 1 or more ATN PDUs but the ACK PNI is not
the target PNI - 1, then this means that this ACK is the reply of the
previous chained I-PDU sent by the target. The target did not received
it on the first attempt and it is being re-sent by the initiator. The
process continues as usual.
- No ATN PDU received before this ACK PDU. This is the reply of a
chained I-PDU. The target keeps on processing its chained I-PDU.
The code has been refactored to avoid too many indentation levels.
Also, ACK and NACK PDUs were not freed. This is now fixed.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
When the initiator sends a DEP_REQ I-PDU, the target device may not
reply in a timely manner. In this case the initiator device must send an
attention PDU (ATN) and if the recipient replies with an ATN PDU in
return, then the last I-PDU must be sent again by the initiator.
This patch fixes how the target handles I-PDU received after an ATN PDU
has been received.
There are 2 possible cases:
- The target has received the initial DEP_REQ and sends back the DEP_RES
but the initiator did not receive it. In this case, after the
initiator has sent an ATN PDU and the target replied it (with an ATN
as well), the initiator sends the saved skb of the initial DEP_REQ
again and the target replies with the saved skb of the initial
DEP_RES.
- Or the target did not even received the initial DEP_REQ. In this case,
after the ATN PDUs exchange, the initiator sends the saved skb and the
target simply passes it up, just as usual.
This behavior is controlled using the atn_count and the PNI field of the
digital device structure.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
When allocating chained I-PDUs, there is no need to call skb_reserve()
since it's already done by digital_alloc_skb() and contains enough room
for the driver head and tail data.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
This patch fixes the way an I-PDU is saved in case it needs to be sent
again. It is now copied using pskb_copy() and not simply referenced
using skb_get() since it could be modified by the driver.
digital_in_send_saved_skb() and digital_tg_send_saved_skb() still get a
reference on the saved skb which is re-sent but release it if the send
operation fails. That way the caller doesn't have to take care about skb
ref in case of error.
RTOX supervisor PDU must not be saved as this can override a previously
saved I-PDU that should be re-sent later on.
Signed-off-by: Thierry Escande <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
|
|
In the prep work I did before enabling BH while handling socket backlog,
I missed two points in DCCP :
1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
blocked. It is not anymore always true.
2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
IP_INC_STATS()
A similar fix was done for TCP, in commit 47dcc20a39d0
("ipv4: tcp: ip_send_unicast_reply() is not BH safe")
Fixes: 7309f8821fd6 ("dccp: do not assume DCCP code is non preemptible")
Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
All inet6_netconf_notify_devconf() callers are in process context,
so we can use GFP_KERNEL allocations if we take care of not holding
a rwlock while not needed in ip6mr (we hold RTNL there)
Fixes: d67b8c616b48 ("netconf: advertise mc_forwarding status")
Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding status")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Nicolas Dichtel <[email protected]>
Acked-by: Nicolas Dichtel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
inet_forward_change() runs with RTNL held.
We are allowed to sleep if required.
If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
we no longer have to use GFP_ATOMIC allocations in
inet_netconf_notify_devconf(), meaning we are less likely to miss
notifications under memory pressure, and wont touch precious memory
reserves either and risk dropping incoming packets.
inet_netconf_get_devconf() can also use GFP_KERNEL allocation.
Fixes: edc9e748934c ("rtnl/ipv4: use netconf msg to advertise forwarding status")
Fixes: 9e5511106f99 ("rtnl/ipv4: add support of RTM_GETNETCONF")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Nicolas Dichtel <[email protected]>
Acked-by: Nicolas Dichtel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
An important information for the napi_poll tracepoint is knowing
the work done (packets processed) by the napi_poll() call. Add
both the work done and budget, as they are related.
Handle trace_napi_poll() param change in dropwatch/drop_monitor
and in python perf script netdev-times.py in backward compat way,
as python fortunately supports optional parameter handling.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Allow MPLS routes on IPIP and SIT devices now that they
support forwarding MPLS packets.
Signed-off-by: Simon Horman <[email protected]>
Reviewed-by: Dinan Gunawardena <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Extend the IPIP driver to support MPLS over IPv4. The implementation is an
extension of existing support for IPv4 over IPv4 and is based of multiple
inner-protocol support for the SIT driver.
Signed-off-by: Simon Horman <[email protected]>
Reviewed-by: Dinan Gunawardena <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Extend the SIT driver to support MPLS over IPv4. This implementation
extends existing support for IPv6 over IPv4 and IPv4 over IPv4.
Signed-off-by: Simon Horman <[email protected]>
Reviewed-by: Dinan Gunawardena <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Extend tunnel support to MPLS over IPv4. The implementation extends the
existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS
over IPv4.
Signed-off-by: Simon Horman <[email protected]>
Reviewed-by: Dinan Gunawardena <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
As was suggested this patch adds support for the different versions of MLD
and IGMP query types. Since the user visible structure is still in net-next
we can augment it instead of adding netlink attributes.
The distinction between the different IGMP/MLD query types is done as
suggested in Section 7.1, RFC 3376 [1] and Section 8.1, RFC 3810 [2] based
on query payload size and code for IGMP. Since all IGMP packets go through
multicast_rcv() and it uses ip_mc_check_igmp/ipv6_mc_check_mld we can be
sure that at least the ip/ipv6 header can be directly used.
[1] https://tools.ietf.org/html/rfc3376#section-7
[2] https://tools.ietf.org/html/rfc3810#section-8.1
Suggested-by: Linus Lüssing <[email protected]>
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Acked-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|