aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-10-10ipv6: fix incorrect bitwise operator used on rt6i_flagsColin Ian King1-2/+2
The use of the | operator always leads to true which looks rather suspect to me. Fix this by using & instead to just check the RTF_CACHE entry bit. Detected by CoverityScan, CID#1457734, #1457747 ("Wrong operator used") Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10ipv6: fix dereference of rt6_ex before null check errorColin Ian King1-1/+3
Currently rt6_ex is being dereferenced before it is null checked hence there is a possible null dereference bug. Fix this by only dereferencing rt6_ex after it has been null checked. Detected by CoverityScan, CID#1457749 ("Dereference before null check") Fixes: 81eb8447daae ("ipv6: take care of rt6_stats") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Add a driver for Renesas uPD60620 and uPD60620A PHYsBernd Edlinger3-0/+115
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09vhost_net: do not stall on zerocopy depletionWillem de Bruijn1-9/+3
Vhost-net has a hard limit on the number of zerocopy skbs in flight. When reached, transmission stalls. Stalls cause latency, as well as head-of-line blocking of other flows that do not use zerocopy. Instead of stalling, revert to copy-based transmission. Tested by sending two udp flows from guest to host, one with payload of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The large flow is redirected to a netem instance with 1MBps rate limit and deep 1000 entry queue. modprobe ifb ip link set dev ifb0 up tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit tc qdisc add dev tap0 ingress tc filter add dev tap0 parent ffff: protocol ip \ u32 match ip dport 8000 0xffff \ action mirred egress redirect dev ifb0 Before the delay, both flows process around 80K pps. With the delay, before this patch, both process around 400. After this patch, the large flow is still rate limited, while the small reverts to its original rate. See also discussion in the first link, below. Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to send at 100% zerocopy. The limit in vhost_exceeds_maxpend must be carefully chosen. With vq->num >> 1, the flows remain correlated. This value happens to correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller fractions and ensure correctness also for much smaller values of vq->num, by testing the min() of both explicitly. See also the discussion in the second link below. Changes v1 -> v2 - replaced min with typed min_t - avoid unnecessary whitespace change Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09openvswitch: Add erspan tunnel support.William Tu2-1/+51
Add erspan netlink interface for OVS. Signed-off-by: William Tu <u9012063@gmail.com> Cc: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09once: switch to new jump label APIEric Biggers2-7/+7
Switch the DO_ONCE() macro from the deprecated jump label API to the new one. The new one is more readable, and for DO_ONCE() it also makes the generated code more icache-friendly: now the one-time initialization code is placed out-of-line at the jump target, rather than at the inline fallthrough case. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller184-1201/+1393
2017-10-09ipv6: use rcu_dereference_bh() in ipv6_route_seq_next()Wei Wang1-1/+1
This patch replaces rcu_deference() with rcu_dereference_bh() in ipv6_route_seq_next() to avoid the following warning: [ 19.431685] WARNING: suspicious RCU usage [ 19.433451] 4.14.0-rc3-00914-g66f5d6c #118 Not tainted [ 19.435509] ----------------------------- [ 19.437267] net/ipv6/ip6_fib.c:2259 suspicious rcu_dereference_check() usage! [ 19.440790] [ 19.440790] other info that might help us debug this: [ 19.440790] [ 19.444734] [ 19.444734] rcu_scheduler_active = 2, debug_locks = 1 [ 19.447757] 2 locks held by odhcpd/3720: [ 19.449480] #0: (&p->lock){+.+.}, at: [<ffffffffb1231f7d>] seq_read+0x3c/0x333 [ 19.452720] #1: (rcu_read_lock_bh){....}, at: [<ffffffffb1d2b984>] ipv6_route_seq_start+0x5/0xfd [ 19.456323] [ 19.456323] stack backtrace: [ 19.458812] CPU: 0 PID: 3720 Comm: odhcpd Not tainted 4.14.0-rc3-00914-g66f5d6c #118 [ 19.462042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 19.465414] Call Trace: [ 19.466788] dump_stack+0x86/0xc0 [ 19.468358] lockdep_rcu_suspicious+0xea/0xf3 [ 19.470183] ipv6_route_seq_next+0x71/0x164 [ 19.471963] seq_read+0x244/0x333 [ 19.473522] proc_reg_read+0x48/0x67 [ 19.475152] ? proc_reg_write+0x67/0x67 [ 19.476862] __vfs_read+0x26/0x10b [ 19.478463] ? __might_fault+0x37/0x84 [ 19.480148] vfs_read+0xba/0x146 [ 19.481690] SyS_read+0x51/0x8e [ 19.483197] do_int80_syscall_32+0x66/0x15a [ 19.484969] entry_INT80_compat+0x32/0x50 [ 19.486707] RIP: 0023:0xf7f0be8e [ 19.488244] RSP: 002b:00000000ffa75d04 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ 19.491431] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 0000000008056068 [ 19.493886] RDX: 0000000000001000 RSI: 0000000008056008 RDI: 0000000000001000 [ 19.496331] RBP: 00000000000001ff R08: 0000000000000000 R09: 0000000000000000 [ 19.498768] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 19.501217] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Xiaolong Ye <xiaolong.ye@intel.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge branch 'ppc-bundle' (bundle from Michael Ellerman)Linus Torvalds2-2/+35
Merge powerpc transactional memory fixes from Michael Ellerman: "I figured I'd still send you the commits using a bundle to make sure it works in case I need to do it again in future" This fixes transactional memory state restore for powerpc. * bundle'd patches from Michael Ellerman: powerpc/tm: Fix illegal TM state in signal handler powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
2017-10-09Merge branch '40GbE' of ↵David S. Miller12-81/+107
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2017-10-09 This series contains updates to i40e and i40evf only. Jake fixes missed flag conversion from u64 to u32. Fixes a deafult ITR value issue where the driver defaults to an ITR value of half the expected value (in terms of minimum microseconds between interrupts). So fix this by changing the default values to be calculated using the ITR_REG_TO_USEC() macro which indicates that we are converting from the register units into microseconds. Updates the drivers to bump the tail in increments of 8 and double the number of descriptors we will bundle into one tail bump when receiving. With the recent kernel support for enabling XPS and QoS at the same time, we no longer need to worry about the number of traffic classes when enabling XPS. Lihong converts the use of hash_for_each() to hash_for_each_safe() to safely remove a hash entry. Adds a check for the return value for find_first_bit() in the case that it returns the size passed to search. Alan fixes a bug in which filters are erroneously removed if they are removed and then added again. So make sure that when adding a filter, if we find it already existed in our list, make sure it is not marked to be removed. Jayaprakash adds the retrying of PHY reads when the I2C is busy for a maximum period of 500ms. Rami fixes code comment typo. Stefano Brivio simplifies the code by removing the use of a local return code variable and simply return the results of the read function. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge branch '10GbE' of ↵David S. Miller7-89/+259
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 10GbE Intel Wired LAN Driver Updates 2017-10-09 This series contains updates to ixgbe only. Emil fixes an issue where the semaphore bits could be stuck after a reset or a crash, by adding the clearing of software resource bits in the software/firmware synchronization register. Added error checks when we attempt to identify and initialize the PHY to prevent a crash. Fixed a few issues in the logic of ixgbe_clean_test_rings() which was exposed by a previous commit that was causing a crash in ethtool diagnostics. Bhumika Goyal fixes a couple of instances which were overlooked when we made ixgbe_mac_operations constant. Shannon Nelson fixes an issue to restore normal operations after the last MACVLAN offload is removed, otherwise we get stuck in a single queue operations. The infamous Jesper Dangaard Brouer adds a counter which counts the number of times the recycle fails and the real page allocator is invoked. Alex updates the adaptive ITR algorithm to better support the needs of the network. This attempt to make it so that our ITR algorithm will try to prevent either starving a socket buffer for memory in the case of transmit, or overrunning an receive socket buffer on receive. We should function better with new features like XDP which can handle small packets at high rates without needing to lock us into NAPI polling mode. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds54-164/+211
Pull networking fixes from David Miller: 1) Fix object leak on IPSEC offload failure, from Steffen Klassert. 2) Fix range checks in ipset address range addition operations, from Jozsef Kadlecsik. 3) Fix pernet ops unregistration order in ipset, from Florian Westphal. 4) Add missing netlink attribute policy for nl80211 packet pattern attrs, from Peng Xu. 5) Fix PPP device destruction race, from Guillaume Nault. 6) Write marks get lost when BPF verifier processes R1=R2 register assignments, causing incorrect liveness information and less state pruning. Fix from Alexei Starovoitov. 7) Fix blockhole routes so that they are marked dead and therefore not cached in sockets, otherwise IPSEC stops working. From Steffen Klassert. 8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits) cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan net: thunderx: mark expected switch fall-throughs in nicvf_main() udp: fix bcast packet reception netlink: do not set cb_running if dump's start() errs ipv4: Fix traffic triggered IPsec connections. ipv6: Fix traffic triggered IPsec connections. ixgbe: incorrect XDP ring accounting in ethtool tx_frame param net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag Revert commit 1a8b6d76dc5b ("net:add one common config...") ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register ixgbe: Return error when getting PHY address if PHY access is not supported netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook tipc: Unclone message at secondary destination lookup tipc: correct initialization of skb list gso: fix payload length when gso_size is zero mlxsw: spectrum_router: Avoid expensive lookup during route removal bpf: fix liveness marking doc: Fix typo "8023.ad" in bonding documentation ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real ...
2017-10-09cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwanAleksander Morgado1-0/+13
The u-blox TOBY-L2 is a LTE Cat 4 module with HSPA+ and 2G fallback. This module allows switching to different USB profiles with the 'AT+UUSBCONF' command, and provides a ECM network interface when the 'AT+UUSBCONF=2' profile is selected. The u-blox SARA-U2 is a HSPA module with 2G fallback. The default USB configuration includes a ECM network interface. Both these modules are controlled via AT commands through one of the TTYs exposed. Connecting these modules may be done just by activating the desired PDP context with 'AT+CGACT=1,<cid>' and then running DHCP on the ECM interface. Signed-off-by: Aleksander Morgado <aleksander@aleksander.es> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09i40e: Avoid some useless variables and initializers in NVM functionsStefano Brivio1-13/+7
Fixes: 09f79fd49d94 ("i40e: avoid NVM acquire deadlock during NVM update") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: fix a typoRami Rosen1-1/+1
This patch fixes a typo in i40e_vsi_alloc_arrays() documentation. The first parameter name should be "vsi" instead of "type". Signed-off-by: Rami Rosen <rami.rosen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: use a local variable instead of calculating multiple timesLihong Yang1-13/+7
The computed result of I40E_MAX_VSI_QP * I40E_VIRTCHNL_SUPPORTED_QTYPES is used more than three times in function i40e_config_irq_link_list. Simply declare a local variable to store it to improve readability. Signed-off-by: Lihong Yang <lihong.yang@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: Retry AQC GetPhyAbilities to overcome I2CRead hangsJayaprakash Shanmugam3-13/+35
- When the I2C is busy, the PHY reads are delayed. The firmware will return EGAIN in these cases with an expectation that the SW will trigger the reads again - This patch retries the operation for a maximum period of 500ms Signed-off-by: Jayaprakash Shanmugam <jayaprakash.shanmugam@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: add check for return from find_first_bit callLihong Yang1-0/+4
The find_first_bit function will return the size passed to search if the first set bit is not found. This patch adds the check in case that happens as the return value would be used as the index in an array and that would have caused the out-of-bounds access. Detected by CoverityScan, CID 1295969 Out-of-bounds access Signed-off-by: Lihong Yang <lihong.yang@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: allow XPS with QoS enabledJacob Keller1-11/+6
Recently, the kernel gained support for enabling XPS and QoS at the same time. Thus, we no longer need to worry about the number of traffic classes when enabling XPS. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e/i40evf: bundle more descriptors when allocating buffersJacob Keller2-2/+2
Double the number of descriptors we'll bundle into one tail bump when receiving. Empirical testing has shown that we reduce CPU utilization and don't appear to reduce throughput or packet rate. 32 seems to be the sweet spot, as it's half the default polling budget, so we'd essentially reduce from 4 tail writes when polling down to 2. Increasing this up to 64 appears to have negative impacts as it may become possible that we don't bump the tail each time we get polled, which could cause a long delay between returning descriptors to the hardware. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e/i40evf: bump tail only in multiples of 8Jacob Keller2-0/+18
Hardware only fetches descriptors on cachelines of 8, essentially ignoring the lower 3 bits of the tail register. Thus, it is pointless to bump tail by an unaligned access as the hardware will ignore some of the new descriptors we allocated. Thus, it's ideal if we can ensure tail writes are always aligned to 8. At first, it seems like we'd already do this, since we allocate descriptors in batches which are a multiple of 8. Since we'd always increment by a multiple of 8, it seems like the value should always be aligned. However, this ignores allocation failures. If we fail to allocate a buffer, our tail register will become unaligned. Once it has become unaligned it will essentially be stuck unaligned until a buffer allocation happens to fail at the exact amount necessary to re-align it. We can do better, by simply rounding down the number of buffers we're about to allocate (cleaned_count) such that "next_to_clean + cleaned_count" is rounded to the nearest multiple of 8. We do this by calculating how far off that value is and subtracting it from the cleaned_count. This essentially defers allocation of buffers if they're going to be ignored by hardware anyways, and re-aligns our next_to_use and tail values after a failure to allocate a descriptor. This calculation ensures that we always align the tail writes in a way the hardware expects and don't unnecessarily allocate buffers which won't be fetched immediately. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: reduce lrxqthresh from 2 to 1Jacob Keller2-2/+2
The lrxq thresh value tells hardware to immediately interrupt when there are fewer than N*64 packets left in the ring. Counter intuitively, empirical testing has shown that decreasing this value from 2 to 1, and thus changing from an immediate interrupt at fewer than 128 descriptors down to 64 descriptors causes a small increase in the maximum total packets per second we can receive. This increase occurs even when we're polling with interrupts masked, as the hardware must still handle interrupts internally even if we've disabled them in software. Also reduce the value for any VFs we allocate. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e/i40evf: always set the CLEARPBA flag when re-enabling interruptsJacob Keller5-18/+10
In the past we changed driver behavior to not clear the PBA when re-enabling interrupts. This change was motivated by the flawed belief that clearing the PBA would cause a lost interrupt if a receive interrupt occurred while interrupts were disabled. According to empirical testing this isn't the case. Additionally, the data sheet specifically says that we should set the CLEARPBA bit when re-enabling interrupts in a polling setup. This reverts commit 40d72a509862 ("i40e/i40evf: don't lose interrupts") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e/i40evf: fix incorrect default ITR values on driver loadJacob Keller4-8/+12
The ITR register expects to be programmed in units of 2 microseconds. Because of this, all of the drivers I40E_ITR_* constants are in terms of this 2 microsecond register. Unfortunately, the rx_itr_default value is expected to be programmed in microseconds. Effectively the driver defaults to an ITR value of half the expected value (in terms of minimum microseconds between interrupts). Fix this by changing the default values to be calculated using ITR_REG_TO_USEC macro which indicates that we're converting from the register units into microseconds. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40evf: fix mac filter removal timing issueAlan Brady1-0/+2
Due to the asynchronous nature in which mac filters are added and deleted, there exists a bug in which filters are erroneously removed if removed then added again quickly. The events are as such: - filter marked for removal - same filter is re-added before watchdog that cleans up filters - we skip re-adding the filter because we have it already in the list - watchdog filter cleanup kicks off and filter is removed So when we were re-adding the same filter, it didn't actually get added because it already existed in the list, but was marked for removal and had yet to actually be removed. This patch fixes the issue by making sure that when adding a filter, if we find it already existing in our list, make sure it is not marked to be removed. Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: use the safe hash table iterator when deleting mac filtersLihong Yang1-1/+2
This patch replaces hash_for_each function with hash_for_each_safe when calling __i40e_del_filter. The hash_for_each_safe function is the right one to use when iterating over a hash table to safely remove a hash entry. Otherwise, incorrect values may be read from freed memory. Detected by CoverityScan, CID 1402048 Read from pointer after free Signed-off-by: Lihong Yang <lihong.yang@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09i40e: fix flags declarationJacob Keller1-1/+1
Since we don't yet have more than 32 flags, we'll use a u32 for both the hw_features and flag field. Should we gain more flags in the future, we may need to convert to a u64 or separate flags out into two fields. This was overlooked in the previous commit 2781de2134c4 ("i40e/i40evf: organize and re-number feature flags"), where the feature flag was not converted form u64 to u32. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds6-8/+8
Pull NFS client bugfixes from Trond Myklebust: "Hightlights include: stable fixes: - nfs/filelayout: fix oops when freeing filelayout segment - NFS: Fix uninitialized rpc_wait_queue bugfixes: - NFSv4/pnfs: Fix an infinite layoutget loop - nfs: RPC_MAX_AUTH_SIZE is in bytes" * tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4/pnfs: Fix an infinite layoutget loop nfs/filelayout: fix oops when freeing filelayout segment sunrpc: remove redundant initialization of sock NFS: Fix uninitialized rpc_wait_queue NFS: Cleanup error handling in nfs_idmap_request_key() nfs: RPC_MAX_AUTH_SIZE is in bytes
2017-10-09Merge branch 'ipv6-addrlabel-avoid-dirtying-ip6addrlbl_entry'David S. Miller1-52/+17
Eric Dumazet says: ==================== ipv6: addrlabel: avoid dirtying ip6addrlbl_entry The refcount on ip6addrlbl_entry is only used to make sure ip6addrlbl_entry does not disappear while ip6addrlbl_get() is allocating an skb. We can instead allocate skb first, then use RCU, so that we no longer need to refcount these structures. ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09ipv6: addrlabel: remove refcountingEric Dumazet1-29/+4
After previous patch ("ipv6: addrlabel: rework ip6addrlbl_get()") we can remove the refcount from struct ip6addrlbl_entry, since it is no longer elevated in p6addrlbl_get() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09ipv6: addrlabel: rework ip6addrlbl_get()Eric Dumazet1-23/+13
If we allocate skb before the lookup, we can use RCU without the need of ip6addrlbl_hold() This means that the following patch can get rid of refcounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net: thunderx: mark expected switch fall-throughs in nicvf_main()Gustavo A. R. Silva1-0/+2
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Robert Richter <rric@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: netdev@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller26-64/+107
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim Fedorenko. 2) Fix splat with mark restoration in xt_socket with non-full-sock, patch from Subash Abhinov Kasiviswanathan. 3) ipset bogusly bails out when adding IPv4 range containing more than 2^31 addresses, from Jozsef Kadlecsik. 4) Incorrect pernet unregistration order in ipset, from Florian Westphal. 5) Races between dump and swap in ipset results in BUG_ON splats, from Ross Lagerwall. 6) Fix chain renames in nf_tables, from JingPiao Chen. 7) Fix race in pernet codepath with ebtables table registration, from Artem Savkov. 8) Memory leak in error path in set name allocation in nf_tables, patch from Arvind Yadav. 9) Don't dump chain counters if they are not available, this fixes a crash when listing the ruleset. 10) Fix out of bound memory read in strlcpy() in x_tables compat code, from Eric Dumazet. 11) Make sure we only process TCP packets in SYNPROXY hooks, patch from Lin Zhang. 12) Cannot load rules incrementally anymore after xt_bpf with pinned objects, added in revision 1. From Shmulik Ladkani. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge branch '10GbE' of ↵David S. Miller6-54/+13
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2017-10-09 This series contains updates to ixgbe and arch/Kconfig. Mark fixes a case where PHY register access is not supported and we were returning a PHY address, when we should have been returning -EOPNOTSUPP. Sabrina Dubroca fixes the use of a logical "and" when it should have been the bitwise "and" operator. Ding Tianhong reverts the commit that added the Kconfig bool option ARCH_WANT_RELAX_ORDER, since there is now a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING that has been added to indicate that Relaxed Ordering Attributes should not be used for Transaction Layer Packets. Then follows up with making the needed changes to ixgbe to use the new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag. John Fastabend fixes an issue in the ring accounting when the transmit ring parameters are changed via ethtool when an XDP program is attached. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge branch 'mlx4-static-checker-warnings'David S. Miller5-7/+7
Tariq Toukan says: ==================== Fix mlx4 static checker warnings This patchset contains fixes for static checker warnings in the mlx4 Core and Eth drivers. Patch 1 fixes an actual bug discovered by the checker. Patches 2 and 3 fix the warnings without functional changes. Series generated against net-next commit: c49c777f9c87 qed: Delete redundant check on dcb_app priority ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net/mlx4_en: Use __force to fix a sparse warning in TX datapathTariq Toukan1-1/+1
In TX data-path, we intentionally do not byte-swap, as documented in code and in the cited commit log. This fixes sparse warning: en_tx.c:720:23: warning: incorrect type in argument 1 (different base types) en_tx.c:720:23: expected unsigned int [unsigned] [usertype] <noident> en_tx.c:720:23: got restricted __be32 [usertype] doorbell_qpn Fixes: 492f5add4be8 ("net/mlx4_en: Doorbell is byteswapped in Little Endian archs") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net/mlx4_core: Fix cast warning in fw.cTariq Toukan1-3/+3
Fix the following SPARSE warning, in MLX4_GET() macro: drivers/net/ethernet/mellanox/mlx4/fw.c:233:9: warning: cast to restricted __be64 Fixes: 17d5ceb6e43e ("net/mlx4_core: Fix unaligned accesses") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net/mlx4: Fix endianness issue in qp context paramsTariq Toukan3-3/+3
Should take care of the endianness before assigning to params2 field. Fixes: 53f33ae295a5 ("net/mlx4_core: Port aggregation upper layer interface") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09thunderbolt: Initialize Thunderbolt bus earlierMika Westerberg2-1/+4
The 0day kbuild robot reports following crash: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: tb_property_find+0xe/0x41 *pde = 00000000 Oops: 0000 [#1] CPU: 0 PID: 1 Comm: swapper Not tainted 4.14.0-rc1-00741-ge69b6c0 #412 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 task: 89c80000 task.stack: 89c7c000 EIP: tb_property_find+0xe/0x41 EFLAGS: 00210246 CPU: 0 EAX: 00000000 EBX: 7a368f47 ECX: 00000044 EDX: 7a368f47 ESI: 8851d340 EDI: 7a368f47 EBP: 89c7df0c ESP: 89c7defc DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 CR0: 80050033 CR2: 00000004 CR3: 027a2000 CR4: 00000690 Call Trace: tb_register_property_dir+0x49/0xb9 ? cdc_mbim_driver_init+0x1b/0x1b tbnet_init+0x77/0x9f ? cdc_mbim_driver_init+0x1b/0x1b do_one_initcall+0x7e/0x145 ? parse_args+0x10c/0x1b3 ? kernel_init_freeable+0xbe/0x159 kernel_init_freeable+0xd1/0x159 ? rest_init+0x110/0x110 kernel_init+0xd/0xd0 ret_from_fork+0x19/0x30 The reason is that both Thunderbolt bus and thunderbolt-net are build into the kernel image, and the latter is linked first because drivers/net comes before drivers/thunderbolt. Since both use module_init() thunderbolt-net ends up calling Thunderbolt bus functions too early triggering the above crash. Fix this by moving Thunderbolt bus initialization to happen earlier to make sure all the data structures are ready when Thunderbolt service drivers are initialized. To be on the safe side also add a check for properly initialized xdomain_property_dir to tb_register_property_dir(). Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09ipv6: avoid zeroing per cpu data againEric Dumazet1-11/+1
per cpu allocations are already zeroed, no need to clear them again. Fixes: d52d3997f843f ("ipv6: Create percpu rt6_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09udp: fix bcast packet receptionPaolo Abeni1-9/+5
The commit bc044e8db796 ("udp: perform source validation for mcast early demux") does not take into account that broadcast packets lands in the same code path and they need different checks for the source address - notably, zero source address are valid for bcast and invalid for mcast. As a result, 2nd and later broadcast packets with 0 source address landing to the same socket are dropped. This breaks dhcp servers. Since we don't have stringent performance requirements for ingress broadcast traffic, fix it by disabling UDP early demux such traffic. Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: bc044e8db796 ("udp: perform source validation for mcast early demux") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09netlink: do not set cb_running if dump's start() errsJason A. Donenfeld1-6/+7
It turns out that multiple places can call netlink_dump(), which means it's still possible to dereference partially initialized values in dump() that were the result of a faulty returned start(). This fixes the issue by calling start() _before_ setting cb_running to true, so that there's no chance at all of hitting the dump() function through any indirect paths. It also moves the call to start() to be when the mutex is held. This has the nice side effect of serializing invocations to start(), which is likely desirable anyway. It also prevents any possible other races that might come out of this logic. In testing this with several different pieces of tricky code to trigger these issues, this commit fixes all avenues that I'm aware of. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09Merge branch 'qed-Add-iWARP-support-for-unaligned-MPA-packets'David S. Miller5-20/+822
Michal Kalderon says: ==================== qed: Add iWARP support for unaligned MPA packets This patch series adds support for handling unaligned MPA packets. (FPDUs split over more than one tcp packet). When FW detects a packet is unaligned it fowards the packet to the driver via a light l2 dedicated connection. The driver then stores this packet until the remainder of the packet is received. Once the driver reconstructs the full FPDU, it sends it down to fw via the ll2 connection. Driver also breaks down any packed PDUs into separate packets for FW. Patches 1-6 are all slight modifications to ll2 to support additional requirements for the unaligned MPA ll2 client. Patch 7 opens the additional ll2 connection for iWARP. Patches 8-12 contain the algorithm for aligning packets. ==================== Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add iWARP support for fpdu spanned over more than two tcp packetsMichal Kalderon2-0/+194
We continue to maintain a maximum of three buffers per fpdu, to ensure that there are enough buffers for additional unaligned mpa packets. To support this, if a fpdu is split over more than two tcp packets, we use an intermediate buffer to copy the data to the previous buffer, then we can release the data. We need an intermediate buffer as the initial buffer partial packet could be located at the end of the packet, not leaving room for additional data. This is a corner case, and will usually not be the case. Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add support for MPA header being split over two tcp packetsMichal Kalderon2-1/+41
There is a special case where an MPA header is split over to tcp packets, in this case we need to wait for the next packet to get the fpdu length. We use the incomplete_bytes to mark this fpdu as a "special" one which requires updating the length with the next packet Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add support for freeing two ll2 buffers for corner casesMichal Kalderon2-0/+26
When posting a packet on the ll2 tx, we can provide a cookie that will be returned upon tx completion. This cookie is the ll2 iwarp buffer which is then reposted to the rx ring. Part of the unaligned mpa flow is determining when a buffer can be reposted. Each buffer needs to be sent only once as a cookie for on the tx ring. In packed fpdu case, only the last packet will be sent with the buffer, meaning we need to handle the case that a cookie can be NULL on tx complete. In addition, when a fpdu splits over two buffers, but there are no more fpdus on the second buffer, two buffers need to be provided as a cookie. To avoid changing the ll2 interface to provide two cookies, we introduce a piggy buf pointer, relevant for iWARP only, that holds a pointer to a second buffer that needs to be released during tx completion. Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add unaligned and packed packet processingMichal Kalderon2-0/+270
The fpdu data structure is preallocated per connection. Each connection stores the current status of the connection: either nothing pending, or there is a partial fpdu that is waiting for the rest of the fpdu (incomplete bytes != 0). The same structure is also used for splitting a packet when there are packed fpdus. The structure is initialized with all data required for sending the fpdu back to the FW. A fpdu will always be spanned across a maximum of 3 tx bds. One for the header, one for the partial fdpu received and one for the remainder (unaligned) packet. In case of packed fpdu's, two fragments are used, one for the header and one for the data. Corner cases are not handled in the patch for clarity, and will be added as a separate patch. Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add mpa buffer descriptors for storing and processing mpa fpdusMichal Kalderon2-0/+127
The mpa buff is a descriptor for iwarp ll2 buffers that contains additional information required for aligining fpdu's. In some cases, an additional packet will arrive which will complete the alignment of a fpdu, but we won't be able to post the fpdu due to insufficient place on the tx ring. In this case we can't loose the data and require storing it for later. Processing is therefore done in two places, during rx completion, where we initialize a mpa buffer descriptor and add it to the pending list, and during tx-completion, since we free up an entry in the tx chain we can process any pending mpa packets. The mpa buff descriptors are pre-allocated since we have to ensure that we won't reach a state where we can't store an incoming unaligned packet. All packets received on the ll2 MUST be processed by the driver at some stage. Since they are preallocated, we hold a free list. Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add ll2 connection for processing unaligned MPA packetsMichal Kalderon2-0/+66
This patch adds only the establishment and termination of the ll2 connection that handles unaligned MPA packets. Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09qed: Add LL2 slowpath handlingMichal Kalderon2-2/+43
For iWARP unaligned MPA flow, a slowpath event of flushing an MPA connection that entered an unaligned state is required. The flush ramrod is received on the ll2 queue, and a pre-registered callback function is called to handle the flush event. Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>