aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2015-11-02net/core: generic support for disabling netdev features down stackJarod Wilson1-0/+50
There are some netdev features, which when disabled on an upper device, such as a bonding master or a bridge, must be disabled and cannot be re-enabled on underlying devices. This is a rework of an earlier more heavy-handed appraoch, which simply disables and prevents re-enabling of netdev features listed in a new define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper device that disables a flag in that feature mask, the disabling will propagate down the stack, and any lower device that has any upper device with one of those flags disabled should not be able to enable said flag. Initially, only LRO is included for proof of concept, and because this code effectively does the same thing as dev_disable_lro(), though it will also activate from the ethtool path, which was one of the goals here. [root@dell-per730-01 ~]# ethtool -k bond0 |grep large large-receive-offload: on [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large large-receive-offload: on [root@dell-per730-01 ~]# ethtool -K bond0 lro off [root@dell-per730-01 ~]# ethtool -k bond0 |grep large large-receive-offload: off [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large large-receive-offload: off dmesg dump: [ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2. [ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83 [ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1. [ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71 This has been successfully tested with bnx2x, qlcnic and netxen network cards as slaves in a bond interface. Turning LRO on or off on the master also turns it on or off on each of the slaves, new slaves are added with LRO in the same state as the master, and LRO can't be toggled on the slaves. Also, this should largely remove the need for dev_disable_lro(), and most, if not all, of its call sites can be replaced by simply making sure NETIF_F_LRO isn't included in the relevant device's feature flags. Note that this patch is driven by bug reports from users saying it was confusing that bonds and slaves had different settings for the same features, and while it won't be 100% in sync if a lower device doesn't support a feature like LRO, I think this is a good step in the right direction. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jay Vosburgh <[email protected]> CC: Veaceslav Falico <[email protected]> CC: Andy Gospodarek <[email protected]> CC: Jiri Pirko <[email protected]> CC: Nikolay Aleksandrov <[email protected]> CC: Michal Kubecek <[email protected]> CC: Alexander Duyck <[email protected]> CC: [email protected] Signed-off-by: Jarod Wilson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02sit: fix sit0 percpu double allocationsEric Dumazet1-22/+4
sit0 device allocates its percpu storage twice : - One time in ipip6_tunnel_init() - One time in ipip6_fb_tunnel_init() Thus we leak 48 bytes per possible cpu per network namespace dismantle. ipip6_fb_tunnel_init() can be much simpler and does not return an error, and should be called after register_netdev() Note that ipip6_tunnel_clone_6rd() also needs to be called after register_netdev() (calling ipip6_tunnel_init()) Fixes: ebe084aafb7e ("sit: Use ipip6_tunnel_init as the ndo_init function.") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02net: fix percpu memory leaksEric Dumazet5-18/+35
This patch fixes following problems : 1) percpu_counter_init() can return an error, therefore init_frag_mem_limit() must propagate this error so that inet_frags_init_net() can do the same up to its callers. 2) If ip[46]_frags_ns_ctl_register() fail, we must unwind properly and free the percpu_counter. Without this fix, we leave freed object in percpu_counters global list (if CONFIG_HOTPLUG_CPU) leading to crashes. This bug was detected by KASAN and syzkaller tool (http://github.com/google/syzkaller) Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Cc: Hannes Frederic Sowa <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Acked-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02libceph: clear msg->con in ceph_msg_release() onlyIlya Dryomov2-28/+20
The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: add nocephx_sign_messages optionIlya Dryomov3-1/+20
Support for message signing was merged into 3.19, along with nocephx_require_signatures option. But, all that option does is allow the kernel client to talk to clusters that don't support MSG_AUTH feature bit. That's pretty useless, given that it's been supported since bobtail. Meanwhile, if one disables message signing on the server side with "cephx sign messages = false", it becomes impossible to use the kernel client since it expects messages to be signed if MSG_AUTH was negotiated. Add nocephx_sign_messages option to support this use case. Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: stop duplicating client fields in messengerIlya Dryomov2-22/+10
supported_features and required_features serve no purpose at all, while nocrc and tcp_nodelay belong to ceph_options::flags. Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: drop authorizer check from cephx msg signing routinesIlya Dryomov1-4/+1
I don't see a way for auth->authorizer to be NULL in ceph_x_sign_message() or ceph_x_check_message_signature(). Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: msg signing callouts don't need con argumentIlya Dryomov2-8/+10
We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: evaluate osd_req_op_data() arguments only onceIoana Ciornei1-5/+7
This patch changes the osd_req_op_data() macro to not evaluate arguments more than once in order to follow the kernel coding style. Signed-off-by: Ioana Ciornei <[email protected]> Reviewed-by: Alex Elder <[email protected]> [[email protected]: changelog, formatting] Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: introduce ceph_x_authorizer_cleanup()Ilya Dryomov2-12/+20
Commit ae385eaf24dc ("libceph: store session key in cephx authorizer") introduced ceph_x_authorizer::session_key, but didn't update all the exit/error paths. Introduce ceph_x_authorizer_cleanup() to encapsulate ceph_x_authorizer cleanup and switch to it. This fixes ceph_x_destroy(), which currently always leaks key and ceph_x_build_authorizer() error paths. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Yan, Zheng <[email protected]>
2015-11-02libceph: use local variable cursor instead of &msg->cursorShraddha Barke1-6/+5
Use local variable cursor in place of &msg->cursor in read_partial_msg_data() and write_partial_msg_data(). Signed-off-by: Shraddha Barke <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02libceph: remove con argument in handle_reply()Shraddha Barke1-3/+2
Since handle_reply() does not use its con argument, remove it. Signed-off-by: Shraddha Barke <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2015-11-02Merge tag 'nfs-rdma-4.4-2' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust14-317/+923
NFS: NFSoRDMA Client Side Changes In addition to a variety of bugfixes, these patches are mostly geared at enabling both swap and backchannel support to the NFS over RDMA client. Signed-off-by: Anna Schumake <[email protected]>
2015-11-02ipv6: fix crash on ICMPv6 redirects with prohibited/blackholed sourceMatthias Schiffer1-2/+1
There are other error values besides ip6_null_entry that can be returned by ip6_route_redirect(): fib6_rule_action() can also result in ip6_blk_hole_entry and ip6_prohibit_entry if such ip rules are installed. Only checking for ip6_null_entry in rt6_do_redirect() causes ip6_ins_rt() to be called with rt->rt6i_table == NULL in these cases, making the kernel crash. Signed-off-by: Matthias Schiffer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02NFS: Enable client side NFSv4.1 backchannel to use other transportsChuck Lever4-0/+35
Forechannel transports get their own "bc_up" method to create an endpoint for the backchannel service. Signed-off-by: Chuck Lever <[email protected]> [Anna Schumaker: Add forward declaration of struct net to xprt.h] Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02net: make skb_set_owner_w() more robustEric Dumazet2-3/+23
skb_set_owner_w() is called from various places that assume skb->sk always point to a full blown socket (as it changes sk->sk_wmem_alloc) We'd like to attach skb to request sockets, and in the future to timewait sockets as well. For these kind of pseudo sockets, we need to take a traditional refcount and use sock_edemux() as the destructor. It is now time to un-inline skb_set_owner_w(), being too big. Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <[email protected]> Bisected-by: Haiyang Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02bridge: vlan: Use rcu_dereference instead of rtnl_dereferenceIdo Schimmel1-1/+1
br_should_learn() is protected by RCU and not by RTNL, so use correct flavor of nbp_vlan_group(). Fixes: 907b1e6e83ed ("bridge: vlan: use proper rcu for the vlgrp member") Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Acked-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() ↵Ani Sinha1-3/+3
in preemptible context. Fixes the following kernel BUG : BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758 caller is __this_cpu_preempt_check+0x13/0x15 CPU: 0 PID: 2758 Comm: bash Tainted: P O 3.18.19 #2 ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000 0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800 ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8 Call Trace: [<ffffffff81482b2a>] dump_stack+0x52/0x80 [<ffffffff812010ae>] check_preemption_disabled+0xce/0xe1 [<ffffffff812010d4>] __this_cpu_preempt_check+0x13/0x15 [<ffffffff81419d60>] ipmr_queue_xmit+0x647/0x70c [<ffffffff8141a154>] ip_mr_forward+0x32f/0x34e [<ffffffff8141af76>] ip_mroute_setsockopt+0xe03/0x108c [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42 [<ffffffff810e6974>] ? pollwake+0x4d/0x51 [<ffffffff81058ac0>] ? default_wake_function+0x0/0xf [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42 [<ffffffff810613d9>] ? __wake_up_common+0x45/0x77 [<ffffffff81486ea9>] ? _raw_spin_unlock_irqrestore+0x1d/0x32 [<ffffffff810618bc>] ? __wake_up_sync_key+0x4a/0x53 [<ffffffff8139a519>] ? sock_def_readable+0x71/0x75 [<ffffffff813dd226>] do_ip_setsockopt+0x9d/0xb55 [<ffffffff81429818>] ? unix_seqpacket_sendmsg+0x3f/0x41 [<ffffffff813963fe>] ? sock_sendmsg+0x6d/0x86 [<ffffffff813959d4>] ? sockfd_lookup_light+0x12/0x5d [<ffffffff8139650a>] ? SyS_sendto+0xf3/0x11b [<ffffffff810d5738>] ? new_sync_read+0x82/0xaa [<ffffffff813ddd19>] compat_ip_setsockopt+0x3b/0x99 [<ffffffff813fb24a>] compat_raw_setsockopt+0x11/0x32 [<ffffffff81399052>] compat_sock_common_setsockopt+0x18/0x1f [<ffffffff813c4d05>] compat_SyS_setsockopt+0x1a9/0x1cf [<ffffffff813c4149>] compat_SyS_socketcall+0x180/0x1e3 [<ffffffff81488ea1>] cstar_dispatch+0x7/0x1e Signed-off-by: Ani Sinha <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02bridge: vlan: Use correct flag name in commentIdo Schimmel1-3/+3
The flag used to indicate if a VLAN should be used for filtering - as opposed to context only - on the bridge itself (e.g. br0) is called 'brentry' and not 'brvlan'. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02bridge: vlan: Prevent possible use-after-freeIdo Schimmel1-0/+2
When adding a port to a bridge we initialize VLAN filtering on it. We do not bail out in case an error occurred in nbp_vlan_init, as it can be used as a non VLAN filtering bridge. However, if VLAN filtering is required and an error occurred in nbp_vlan_init, we should set vlgrp to NULL, so that VLAN filtering functions (e.g. br_vlan_find, br_get_pvid) will know the struct is invalid and will not try to access it. Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02tcp/dccp: fix ireq->pktopts raceEric Dumazet2-17/+17
IPv6 request sockets store a pointer to skb containing the SYN packet to be able to transfer it to full blown socket when 3WHS is done (ireq->pktopts -> np->pktoptions) As explained in commit 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions"), we must transfer the skb only if we won the hashdance race, if multiple cpus receive the 'ack' packet completing 3WHS at the same time. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02RDS: convert bind hash table to re-sizable hashtable[email protected]3-86/+56
To further improve the RDS connection scalabilty on massive systems where number of sockets grows into tens of thousands of sockets, there is a need of larger bind hashtable. Pre-allocated 8K or 16K table is not very flexible in terms of memory utilisation. The rhashtable infrastructure gives us the flexibility to grow the hashtbable based on use and also comes up with inbuilt efficient bucket(chain) handling. Reviewed-by: David Miller <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02net: rds: changing the return type from int to voidSaurabh Sengar1-4/+2
as result of function rds_iw_flush_mr_pool is nowhere checked, changing its return type from int to void. also removing the unused variable rc as there is nothing to return Signed-off-by: Saurabh Sengar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02ipv4: use l4 hash for locally generated multipath flowsPaolo Abeni1-1/+2
This patch changes how the multipath hash is computed for locally generated flows: now the hash comprises l4 information. This allows better utilization of the available paths when the existing flows have the same source IP and the same destination IP: with l3 hash, even when multiple connections are in place simultaneously, a single path will be used, while with l4 hash we can use all the available paths. v2 changes: - use get_hash_from_flowi4() instead of implementing just another l4 hash function Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-02SUNRPC: Remove the TCP-only restriction in bc_svc_process()Chuck Lever1-5/+0
Allow the use of other transport classes when handling a backward direction RPC call. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02svcrdma: Add backward direction service for RPC/RDMA transportChuck Lever2-0/+64
On NFSv4.1 mount points, the Linux NFS client uses this transport endpoint to receive backward direction calls and route replies back to the NFSv4.1 server. Signed-off-by: Chuck Lever <[email protected]> Acked-by: "J. Bruce Fields" <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Handle incoming backward direction RPC callsChuck Lever3-0/+161
Introduce a code path in the rpcrdma_reply_handler() to catch incoming backward direction RPC calls and route them to the ULP's backchannel server. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Add support for sending backward direction RPC repliesChuck Lever3-0/+51
Backward direction RPC replies are sent via the client transport's send_request method, the same way forward direction RPC calls are sent. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Pre-allocate Work Requests for backchannelChuck Lever3-2/+26
Pre-allocate extra send and receive Work Requests needed to handle backchannel receives and sends. The transport doesn't know how many extra WRs to pre-allocate until the xprt_setup_backchannel() call, but that's long after the WRs are allocated during forechannel setup. So, use a fixed value for now. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffersChuck Lever5-12/+309
xprtrdma's backward direction send and receive buffers are the same size as the forechannel's inline threshold, and must be pre- registered. The consumer has no control over which receive buffer the adapter chooses to catch an incoming backwards-direction call. Any receive buffer can be used for either a forward reply or a backward call. Thus both types of RPC message must all be the same size. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02SUNRPC: Abstract backchannel operationsChuck Lever2-2/+27
xprt_{setup,destroy}_backchannel() won't be adequate for RPC/RMDA bi-direction. In particular, receive buffers have to be pre- registered and posted in order to receive incoming backchannel requests. Add a virtual function call to allow the insertion of appropriate backchannel setup and destruction methods for each transport. In addition, freeing a backchannel request is a little different for RPC/RDMA. Introduce an rpc_xprt_op to handle the difference. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Saving IRQs no longer needed for rb_lockChuck Lever1-14/+10
Now that RPC replies are processed in a workqueue, there's no need to disable IRQs when managing send and receive buffers. This saves noticeable overhead per RPC. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Remove reply taskletChuck Lever1-43/+0
Clean up: The reply tasklet is no longer used. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Use workqueue to process RPC/RDMA repliesChuck Lever4-18/+65
The reply tasklet is fast, but it's single threaded. After reply traffic saturates a single CPU, there's no more reply processing capacity. Replace the tasklet with a workqueue to spread reply handling across all CPUs. This also moves RPC/RDMA reply handling out of the soft IRQ context and into a context that allows sleeps. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Replace send and receive arraysChuck Lever2-91/+73
The rb_send_bufs and rb_recv_bufs arrays are used to implement a pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep structs. Replace those arrays with free lists. To allow more than 512 RPCs in-flight at once, each of these arrays would be larger than a page (assuming 8-byte addresses and 4KB pages). Allowing up to 64K in-flight RPCs (as TCP now does), each buffer array would have to be 128 pages. That's an order-6 allocation. (Not that we're going there.) A list is easier to expand dynamically. Instead of allocating a larger array of pointers and copying the existing pointers to the new array, simply append more buffers to each list. This also makes it simpler to manage receive buffers that might catch backwards-direction calls, or to post receive buffers in bulk to amortize the overhead of ib_post_recv. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Refactor reply handler error handlingChuck Lever3-40/+53
Clean up: The error cases in rpcrdma_reply_handler() almost never execute. Ensure the compiler places them out of the hot path. No behavior change expected. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Prevent loss of completion signalsChuck Lever2-41/+38
Commit 8301a2c047cc ("xprtrdma: Limit work done by completion handler") was supposed to prevent xprtrdma's upcall handlers from starving other softIRQ work by letting them return to the provider before all CQEs have been polled. The logic assumes the provider will call the upcall handler again immediately if the CQ is re-armed while there are still queued CQEs. This assumption is invalid. The IBTA spec says that after a CQ is armed, the hardware must interrupt only when a new CQE is inserted. xprtrdma can't rely on the provider calling again, even though some providers do. Therefore, leaving CQEs on queue makes sense only when there is another mechanism that ensures all remaining CQEs are consumed in a timely fashion. xprtrdma does not have such a mechanism. If a CQE remains queued, the transport can wait forever to send the next RPC. Finally, move the wcs array back onto the stack to ensure that the poll array is always local to the CPU where the completion upcall is running. Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...") Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Re-arm after missed eventsChuck Lever1-56/+10
ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns a positive value if WCs were added to a CQ after the last completion upcall but before the CQ has been re-armed. Commit 7f23f6f6e388 ("xprtrmda: Reduce lock contention in completion handlers") assumed that when ib_req_notify_cq() returned a positive RC, the CQ had also been successfully re-armed, making it safe to return control to the provider without losing any completion signals. That is an invalid assumption. Change both completion handlers to continue polling while ib_req_notify_cq() returns a positive value. Fixes: 7f23f6f6e388 ("xprtrmda: Reduce lock contention in ...") Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Enable swap-on-NFS/RDMAChuck Lever1-1/+1
After adding a swapfile on an NFS/RDMA mount and removing the normal swap partition, I was able to push the NFS client well into swap without any issue. I forgot to swapoff the NFS file before rebooting. This pinned the NFS mount and the IB core and provider, causing shutdown to hang. I think this is expected and safe behavior. Probably shutdown scripts should "swapoff -a" before unmounting any filesystems. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: don't log warnings for flushed completionsSteve Wise1-2/+5
Unsignaled send WRs can get flushed as part of normal unmount, so don't log them as warnings. Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-01Use 64-bit timekeepingTina Ruchandani1-5/+6
This patch changes the use of struct timespec in dccp_probe to use struct timespec64 instead. timespec uses a 32-bit seconds field which will overflow in the year 2038 and beyond. timespec64 uses a 64-bit seconds field. Note that the correctness of the code isn't changed, since the original code only uses the timestamps to compute a small elapsed interval. This patch is part of a larger attempt to remove instances of 32-bit timekeeping structures (timespec, timeval, time_t) from the kernel so it is easier to identify where the real 2038 issues are. Signed-off-by: Tina Ruchandani <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01ipv4: update RTNH_F_LINKDOWN flag on UP eventJulian Anastasov1-0/+7
When nexthop is part of multipath route we should clear the LINKDOWN flag when link goes UP or when first address is added. This is needed because we always set LINKDOWN flag when DEAD flag was set but now on UP the nexthop is not dead anymore. Examples when LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered: - link goes down (LINKDOWN is set), then link goes UP and device shows carrier OK but LINKDOWN remains set - last address is deleted (LINKDOWN is set), then address is added and device shows carrier OK but LINKDOWN remains set Steps to reproduce: modprobe dummy ifconfig dummy0 192.168.168.1 up here add a multipath route where one nexthop is for dummy0: ip route add 1.2.3.4 nexthop dummy0 nexthop SOME_OTHER_DEVICE ifconfig dummy0 down ifconfig dummy0 up now ip route shows nexthop that is not dead. Now set the sysctl var: echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown now ip route will show a dead nexthop because the forgotten RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD. Fixes: 8a3d03166f19 ("net: track link-status of ipv4 nexthops") Signed-off-by: Julian Anastasov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01ipv4: fix to not remove local route on link downJulian Anastasov2-9/+15
When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event we should not delete the local routes if the local address is still present. The confusion comes from the fact that both fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN constant. Fix it by returning back the variable 'force'. Steps to reproduce: modprobe dummy ifconfig dummy0 192.168.168.1 up ifconfig dummy0 down ip route list table local | grep dummy | grep host local 192.168.168.1 dev dummy0 proto kernel scope host src 192.168.168.1 Fixes: 8a3d03166f19 ("net: track link-status of ipv4 nexthops") Signed-off-by: Julian Anastasov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01net: dsa: use switchdev obj for VLAN add/del opsVivien Didelot1-20/+9
Simplify DSA by pushing the switchdev objects for VLAN add and delete operations down to its drivers. Currently only mv88e6xxx is affected. Signed-off-by: Vivien Didelot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01VSOCK: define VSOCK_SS_LISTEN once onlyStefan Hajnoczi2-22/+19
The SS_LISTEN socket state is defined by both af_vsock.c and vmci_transport.c. This is risky since the value could be changed in one file and the other would be out of sync. Rename from SS_LISTEN to VSOCK_SS_LISTEN since the constant is not part of enum socket_state (SS_CONNECTED, ...). This way it is clear that the constant is vsock-specific. The big text reflow in af_vsock.c was necessary to keep to the maximum line length. Text is unchanged except for s/SS_LISTEN/VSOCK_SS_LISTEN/. Signed-off-by: Stefan Hajnoczi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01tipc: linearize arriving NAME_DISTR and LINK_PROTO buffersJon Paul Maloy1-0/+5
Testing of the new UDP bearer has revealed that reception of NAME_DISTRIBUTOR, LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE message buffers is not prepared for the case that those may be non-linear. We now linearize all such buffers before they are delivered up to the generic reception layer. In order for the commit to apply cleanly to 'net' and 'stable', we do the change in the function tipc_udp_recv() for now. Later, we will post a commit to 'net-next' moving the linearization to generic code, in tipc_named_rcv() and tipc_link_proto_rcv(). Fixes: commit d0f91938bede ("tipc: add ip/udp media type") Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragmentHannes Frederic Sowa1-4/+4
CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one of those warn about them once and handle them gracefully by recalculating the checksum. Fixes: commit 32dce968dd987 ("ipv6: Allow for partial checksums on non-ufo packets") See-also: commit 72e843bb09d45 ("ipv6: ip6_fragment() should check CHECKSUM_PARTIAL") Cc: Eric Dumazet <[email protected]> Cc: Vlad Yasevich <[email protected]> Cc: Benjamin Coddington <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01ipv6: no CHECKSUM_PARTIAL on MSG_MORE corked socketsHannes Frederic Sowa1-37/+33
We cannot reliable calculate packet size on MSG_MORE corked sockets and thus cannot decide if they are going to be fragmented later on, so better not use CHECKSUM_PARTIAL in the first place. The IPv6 code also intended to protect and not use CHECKSUM_PARTIAL in the existence of IPv6 extension headers, but the condition was wrong. Fix it up, too. Also the condition to check whether the packet fits into one fragment was wrong and has been corrected. Fixes: commit 32dce968dd987 ("ipv6: Allow for partial checksums on non-ufo packets") See-also: commit 72e843bb09d45 ("ipv6: ip6_fragment() should check CHECKSUM_PARTIAL") Cc: Eric Dumazet <[email protected]> Cc: Vlad Yasevich <[email protected]> Cc: Benjamin Coddington <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01ipv4: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragmentHannes Frederic Sowa1-3/+5
CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one of those warn about them once and handle them gracefully by recalculating the checksum. Cc: Eric Dumazet <[email protected]> Cc: Vlad Yasevich <[email protected]> Cc: Benjamin Coddington <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-11-01ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked socketsHannes Frederic Sowa1-0/+1
We cannot reliable calculate packet size on MSG_MORE corked sockets and thus cannot decide if they are going to be fragmented later on, so better not use CHECKSUM_PARTIAL in the first place. Cc: Eric Dumazet <[email protected]> Cc: Vlad Yasevich <[email protected]> Cc: Benjamin Coddington <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>