aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-03-31inet: frags: get rid of ipfrag_skb_cb/FRAG_CBEric Dumazet2-21/+15
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately this integer is currently in a different cache line than skb->next, meaning that we use two cache lines per skb when finding the insertion point. By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields in a single cache line and save precious memory bandwidth. Note that after the fast path added by Changli Gao in commit d6bebca92c66 ("fragment: add fast path for in-order fragments") this change wont help the fast path, since we still need to access prev->len (2nd cache line), but will show great benefits when slow path is entered, since we perform a linear scan of a potentially long list. Also, note that this potential long list is an attack vector, we might consider also using an rb-tree there eventually. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: reorganize struct netns_fragsEric Dumazet1-4/+5
Put the read-mostly fields in a separate cache line at the beginning of struct netns_frags, to reduce false sharing noticed in inet_frag_kill() Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31rhashtable: reorganize struct rhashtable layoutEric Dumazet1-4/+4
While under frags DDOS I noticed unfortunate false sharing between @nelems and @params.automatic_shrinking Move @nelems at the end of struct rhashtable so that first cache line is shared between all cpus, because almost never dirtied. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31ipv6: frags: rewrite ip6_expire_frag_queue()Eric Dumazet1-8/+16
Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: do not clone skb in ip_expire()Eric Dumazet1-10/+6
An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release spinlock before calling icmp_send()") While fixing the bug at that time, it also added a very high cost for DDOS frags, as the ICMP rate limit is applied after this expensive operation (skb_clone() + consume_skb(), implying memory allocations, copy, and freeing) We can use skb_get(head) here, all we want is to make sure skb wont be freed by another cpu. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: break the 2GB limit for frags storageEric Dumazet8-32/+32
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: remove inet_frag_maybe_warn_overflow()Eric Dumazet6-25/+8
This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: get rif of inet_frag_evicting()Eric Dumazet3-42/+32
This refactors ip_expire() since one indentation level is removed. Note: in the future, we should try hard to avoid the skb_clone() since this is a serious performance cost. Under DDOS, the ICMP message wont be sent because of rate limits. Fact that ip6_expire_frag_queue() does not use skb_clone() is disturbing too. Presumably IPv6 should have the same issue than the one we fixed in commit ec4fbd64751d ("inet: frag: release spinlock before calling icmp_send()") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: remove some helpersEric Dumazet6-23/+6
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem() Also since we use rhashtable we can bring back the number of fragments in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was removed in commit 434d305405ab ("inet: frag: don't account number of fragment queues") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: use rhashtables for reassembly unitsEric Dumazet9-573/+265
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by: Eric Dumazet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Florian Westphal <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Alexander Aring <[email protected]> Cc: Stefan Schmidt <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31rhashtable: add schedule pointsEric Dumazet1-0/+2
Rehashing and destroying large hash table takes a lot of time, and happens in process context. It is safe to add cond_resched() in rhashtable_rehash_table() and rhashtable_free_and_destroy() Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Herbert Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: refactor ipfrag_init()Eric Dumazet1-2/+2
We need to call inet_frags_init() before register_pernet_subsys(), as a prereq for following patch ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: refactor lowpan_net_frag_init()Eric Dumazet1-9/+11
We want to call lowpan_net_frag_init() earlier. Similar to commit "inet: frags: refactor ipv6_frag_init()" This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: refactor ipv6_frag_init()Eric Dumazet1-11/+14
We want to call inet_frags_init() earlier. This is a prereq to "inet: frags: use rhashtables for reassembly units" Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: add a pointer to struct netns_fragsEric Dumazet7-41/+48
In order to simplify the API, add a pointer to struct inet_frags. This will allow us to make things less complex. These functions no longer have a struct inet_frags parameter : inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */) ip6_expire_frag_queue(struct net *net, struct frag_queue *fq) Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: change inet_frags_init_net() return valueEric Dumazet5-12/+37
We will soon initialize one rhashtable per struct netns_frags in inet_frags_init_net(). This patch changes the return value to eventually propagate an error. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31ipv6: frag: remove unused fieldEric Dumazet1-1/+0
csum field in struct frag_queue is not used, remove it. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'bnxt_en-next'David S. Miller9-234/+706
Michael Chan says: ==================== bnxt_en: Update for net-next. Misc. updates including updated firmware interface, some additional port statistics, a new IRQ assignment scheme for the RDMA driver, support for VF trust, and other changes and improvements for SRIOV. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Add ULP calls to stop and restart IRQs.Michael Chan3-17/+90
When the driver needs to re-initailize the IRQ vectors, we make the new ulp_irq_stop() call to tell the RDMA driver to disable and free the IRQ vectors. After IRQ vectors have been re-initailized, we make the ulp_irq_restart() call to tell the RDMA driver that IRQs can be restarted. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Reserve completion rings and MSIX for bnxt_re RDMA driver.Michael Chan3-16/+65
Add additional logic to reserve completion rings for the bnxt_re driver when it requests MSIX vectors. The function bnxt_cp_rings_in_use() will return the total number of completion rings used by both drivers that need to be reserved. If the network interface in up, we will close and open the NIC to reserve the new set of completion rings and re-initialize the vectors. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Refactor bnxt_need_reserve_rings().Michael Chan1-32/+25
Refactor bnxt_need_reserve_rings() slightly so that __bnxt_reserve_rings() can call it and remove some duplicated code. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Add IRQ remapping logic.Michael Chan1-17/+42
Add remapping logic so that bnxt_en can use any arbitrary MSIX vectors. This will allow the driver to reserve one range of MSIX vectors to be used by both bnxt_en and bnxt_re. bnxt_en can now skip over the MSIX vectors used by bnxt_re. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Change IRQ assignment for RDMA driver.Michael Chan3-3/+61
In the current code, the range of MSIX vectors allocated for the RDMA driver is disjoint from the network driver. This creates a problem for the new firmware ring reservation scheme. The new scheme requires the reserved completion rings/MSIX vectors to be in a contiguous range. Change the logic to allocate RDMA MSIX vectors to be contiguous with the vectors used by bnxt_en on new firmware using the new scheme. The new function bnxt_get_num_msix() calculates the exact number of vectors needed by both drivers. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Improve ring allocation logic.Michael Chan2-15/+21
Currently, the driver code makes some assumptions about the group index and the map index of rings. This makes the code more difficult to understand and less flexible. Improve it by adding the grp_idx and map_idx fields explicitly to the bnxt_ring_struct as a union. The grp_idx is initialized for each tx ring and rx agg ring during init. time. We do the same for the map_idx for each cmpl ring. The grp_idx ties the tx ring to the ring group. The map_idx is the doorbell index of the ring. With this new infrastructure, we can change the ring index mapping scheme easily in the future. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Improve valid bit checking in firmware response message.Michael Chan2-5/+18
When firmware sends a DMA response to the driver, the last byte of the message will be set to 1 to indicate that the whole response is valid. The driver waits for the message to be valid before reading the message. The firmware spec allows these response messages to increase in length by adding new fields to the end of these messages. The older spec's valid location may become a new field in a newer spec. To guarantee compatibility, the driver should zero the valid byte before interpreting the entire message so that any new fields not implemented by the older spec will be read as zero. For messages that are forwarded to VFs, we need to set the length and re-instate the valid bit so the VF will see the valid response. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Improve resource accounting for SRIOV.Michael Chan1-10/+8
When VFs are created, the current code subtracts the maximum VF resources from the PF's pool. This under-estimates the resources remaining in the PF pool. Instead, we should subtract the minimum VF resources. The VF minimum resources are guaranteed to the VFs and only these should be subtracted from the PF's pool. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Check max_tx_scheduler_inputs value from firmware.Michael Chan3-2/+19
When checking for the maximum pre-set TX channels for ethtool -l, we need to check the current max_tx_scheduler_inputs parameter from firmware. This parameter specifies the max input for the internal QoS nodes currently available to this function. The function's TX rings will be capped by this parameter. By adding this logic, we provide a more accurate pre-set max TX channels to the user. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Add extended port statistics supportVasundhara Volam3-2/+81
Gather periodic extended port statistics, if the device is PF and link is up. Signed-off-by: Vasundhara Volam <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Include additional hardware port statistics in ethtool -S.Vasundhara Volam1-0/+5
Include additional hardware port statistics in ethtool -S, which are useful for debugging. Signed-off-by: Vasundhara Volam <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Add support for ndo_set_vf_trustVasundhara Volam4-9/+37
Trusted VFs are allowed to modify MAC address, even when PF has assigned one. Signed-off-by: Vasundhara Volam <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: fix clear flags in ethtool reset handlingScott Branden1-2/+6
Clear flags when reset command processed successfully for components specified. Fixes: 6502ad5963a5 ("bnxt_en: Add ETH_RESET_AP support") Signed-off-by: Scott Branden <[email protected]> Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Use a dedicated VNIC mode for RDMA.Michael Chan2-4/+15
If the RDMA driver is registered, use a new VNIC mode that allows RDMA traffic to be seen on the netdev in promiscuous mode. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Adjust default rings for multi-port NICs.Michael Chan1-3/+9
Change the default ring logic to select default number of rings to be up to 8 per port if the default rings x NIC ports <= total CPUs. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31bnxt_en: Update firmware interface to 1.9.1.15.Michael Chan4-103/+210
Minor changes, such as new extended port statistics. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31vlan: vlan_hw_filter_capable() can be staticWei Yongjun1-1/+1
Fixes the following sparse warning: net/8021q/vlan_core.c:168:6: warning: symbol 'vlan_hw_filter_capable' was not declared. Should it be static? Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge tag 'mlx5-updates-2018-03-30' of ↵David S. Miller12-447/+409
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2018-03-30 This series contains updates to mlx5 core and mlx5e netdev drivers. The main highlight of this series is the RX optimizations for striding RQ path, introduced by Tariq. First Four patches are trivial misc cleanups. - Spelling mistake fix - Dead code removal - Warning messages RX optimizations for striding RQ: 1) RX refactoring, cleanups and micro optimizations - MTU calculation simplifications, obsoletes some WQEs-to-packets translation functions and helps delete ~60 LOC. - Do not busy-wait a pending UMR completion. - post the new values of UMR WQE inline, instead of using a data pointer. - use pre-initialized structures to save calculations in datapath. 2) Use linear SKB in Striding RQ "build_skb", (Using linear SKB has many advantages): - Saves a memcpy of the headers. - No page-boundary checks in datapath. - No filler CQEs. - Significantly smaller CQ. - SKB data continuously resides in linear part, and not split to small amount (linear part) and large amount (fragment). This saves datapath cycles in driver and improves utilization of SKB fragments in GRO. - The fragments of a resulting GRO SKB follow the IP forwarding assumption of equal-size fragments. implementation details: HW writes the packets to the beginning of a stride, i.e. does not keep headroom. To overcome this we make sure we can extend backwards and use the last bytes of stride i-1. Extra care is needed for stride 0 as it has no preceding stride. We make sure headroom bytes are available by shifting the buffer pointer passed to HW by headroom bytes. This configuration now becomes default, whenever capable. Of course, this implies turning LRO off. Performance testing: ConnectX-5, single core, single RX ring, default MTU. UDP packet rate, early drop in TC layer: -------------------------------------------- | pkt size | before | after | ratio | -------------------------------------------- | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x | | 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x | | 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x | -------------------------------------------- TCP streams: ~20% gain 3) Support XDP over Striding RQ: Now that linear SKB is supported over Striding RQ, we can support XDP by setting stride size to PAGE_SIZE and headroom to XDP_PACKET_HEADROOM. Striding RQ is capable of a higher packet-rate than conventional RQ. Performance testing: ConnectX-5, 24 rings, default MTU. CQE compression ON (to reduce completions BW in PCI). XDP_DROP packet rate: -------------------------------------------------- | pkt size | XDP rate | 100GbE linerate | pct% | -------------------------------------------------- | 64byte | 126.2 Mpps | 148.0 Mpps | 85% | | 128byte | 80.0 Mpps | 84.8 Mpps | 94% | | 256byte | 42.7 Mpps | 42.7 Mpps | 100% | | 512byte | 23.4 Mpps | 23.4 Mpps | 100% | -------------------------------------------------- 4) Remove mlx5 page_ref bulking in Striding RQ and use page_ref_inc only when needed. Without this bulking, we have: - no atomic ops on WQE allocation or free - one atomic op per SKB - In the default MTU configuration (1500, stride size is 2K), the non-bulking method execute 2 atomic ops as before - For larger MTUs with stride size of 4K, non-bulking method executes only a single op. - For XDP (stride size of 4K, no SKBs), non-bulking have no atomic ops per packet at all. Performance testing: ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Single core packet rate (64 bytes). Early drop in TC: no degradation. XDP_DROP: before: 14,270,188 pps after: 20,503,603 pps, 43% improvement. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge tag 'rxrpc-next-20180330' of ↵David S. Miller20-71/+509
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes and more traces Here are some patches that add some more tracepoints to AF_RXRPC and fix some issues therein: (1) Fix the use of VERSION packets to keep firewall routes open. (2) Fix the incorrect current time usage in a tracepoint. (3) Fix Tx ring annotation corruption. (4) Fix accidental conversion of call-level abort into connection-level abort. (5) Fix calculation of resend time. (6) Remove a couple of unused variables. (7) Fix a bunch of checker warnings and an error. Note that not all warnings can be quashed as checker doesn't seem to correctly handle seqlocks. (8) Fix a potential race between call destruction and socket/net destruction. (9) Add a tracepoint to track rxrpc_local refcounting. (10) Fix an apparent leak of rxrpc_local objects. (11) Add a tracepoint to track rxrpc_peer refcounting. (12) Fix a leak of rxrpc_peer objects. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31hv_netvsc: Clean up extra parameter from rndis_filter_receive_data()Haiyang Zhang1-7/+9
The variables, msg and data, have the same value. This patch removes the extra one. Signed-off-by: Haiyang Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31ethernet: hisilicon: hns: hns_dsaf_mac: Use generic eth_broadcast_addrJoe Perches1-4/+2
Rather than use an on-stack array to copy a broadcast address, use the generic eth_broadcast_addr function to save a trivial amount of object code. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'net_rwsem-fixes'David S. Miller5-8/+8
Kirill Tkhai says: ==================== net_rwsem fixes there is wext_netdev_notifier_call()->wireless_nlevent_flush() netdevice notifier, which takes net_rwsem, so we can't take net_rwsem in {,un}register_netdevice_notifier(). Since {,un}register_netdevice_notifier() is executed under pernet_ops_rwsem, net_namespace_list can't change, while we holding it, so there is no need net_rwsem in these functions [1/2]. The same is in [2/2]. We make callers of __rtnl_link_unregister() take pernet_ops_rwsem, and close the race with setup_net() and cleanup_net(), so __rtnl_link_unregister() does not need it. This also fixes the problem of that __rtnl_link_unregister() does not see initializing and exiting nets. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: Do not take net_rwsem in __rtnl_link_unregister()Kirill Tkhai4-3/+8
This function calls call_netdevice_notifier(), which also may take net_rwsem. So, we can't use net_rwsem here. This patch makes callers of this functions take pernet_ops_rwsem, like register_netdevice_notifier() does. This will protect the modifications of net_namespace_list, and allows notifiers to take it (they won't have to care about context). Since __rtnl_link_unregister() is used on module load and unload (which are not frequent operations), this looks for me better, than make all call_netdevice_notifier() always executing in "protected net_namespace_list" context. Also, this fixes the problem we had a deal in 328fbe747ad4 "Close race between {un, }register_netdevice_notifier and ...", and guarantees __rtnl_link_unregister() does not skip exitting net. Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: Remove net_rwsem from {, un}register_netdevice_notifier()Kirill Tkhai1-5/+0
These functions take net_rwsem, while wireless_nlevent_flush() also takes it. But down_read() can't be taken recursive, because of rw_semaphore design, which prevents it to be occupied by only readers forever. Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(), net list can't change, so these down_read()/up_read() can be removed. Fixes: f0b07bb151b0 "net: Introduce net_rwsem to protect net_namespace_list" Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: hns3: remove unnecessary pci_set_drvdata() and devm_kfree()Wei Yongjun1-4/+0
There is no need for explicit calls of devm_kfree(), as the allocated memory will be freed during driver's detach. The driver core clears the driver data to NULL after device_release. Thus, it is not needed to manually clear the device driver data to NULL. So remove the unnecessary pci_set_drvdata() and devm_kfree(). Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31netdevsim: Change nsim_devlink_setup to return error to callerDavid Ahern3-8/+15
Change nsim_devlink_setup to return any error back to the caller and update nsim_init to handle it. Requested-by: Jakub Kicinski <[email protected]> Signed-off-by: David Ahern <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'tipc-slim-down-name-table'David S. Miller11-699/+556
Jon Maloy says: ==================== tipc: slim down name table We clean up and improve the name binding table: - Replace the memory consuming 'sub_sequence/service range' array with an RB tree. - Introduce support for overlapping service sequences/ranges v2: #1: Fixed a missing initialization reported by David Miller #4: Obsoleted and replaced a few more macros to get a consistent terminology in the API. #5: Added new commit to fix a potential string overflow bug (it is still only in net-next) reported by Arnd Bergmann ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: avoid possible string overflowJon Maloy2-2/+3
gcc points out that the combined length of the fixed-length inputs to l->name is larger than the destination buffer size: net/tipc/link.c: In function 'tipc_link_create': net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes into a region of size between 26 and 58 [-Werror=format-overflow=] sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes (assuming 75) into a destination of size 60 sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); A detailed analysis reveals that the theoretical maximum length of a link name is: max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name = 16 + 1 + 15 + 1 + 16 + 1 + 15 = 65 Since we also need space for a trailing zero we now set MAX_LINK_NAME to 68. Just to be on the safe side we also replace the sprintf() call with snprintf(). Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values") Reported-by: Arnd Bergmann <[email protected]> Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: tipc: rename address types in user apiJon Maloy1-24/+33
The three address type structs in the user API have names that in reality reflect the specific, non-Linux environment where they were originally created. We now give them more intuitive names, in accordance with how TIPC is described in the current documentation. struct tipc_portid -> struct tipc_socket_addr struct tipc_name -> struct tipc_service_addr struct tipc_name_seq -> struct tipc_service_range To avoid confusion, we also update some commmets and macro names to match the new terminology. For compatibility, we add macros that map all old names to the new ones. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: permit overlapping service ranges in name tableJon Maloy7-111/+60
With the new RB tree structure for service ranges it becomes possible to solve an old problem; - we can now allow overlapping service ranges in the table. When inserting a new service range to the tree, we use 'lower' as primary key, and when necessary 'upper' as secondary key. Since there may now be multiple service ranges matching an indicated 'lower' value, we must also add the 'upper' value to the functions used for removing publications, so that the correct, corresponding range item can be found. These changes guarantee that a well-formed publication/withdrawal item from a peer node never will be rejected, and make it possible to eliminate the problematic backlog functionality we currently have for handling such cases. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: refactor name table translate functionJon Maloy1-36/+25
The function tipc_nametbl_translate() function is ugly and hard to follow. This can be improved somewhat by introducing a stack variable for holding the publication list to be used and re-ordering the if- clauses for selection of algorithm. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: replace name table service range array with rb treeJon Maloy6-568/+477
The current design of the binding table has an unnecessary memory consuming and complex data structure. It aggregates the service range items into an array, which is expanded by a factor two every time it becomes too small to hold a new item. Furthermore, the arrays never shrink when the number of ranges diminishes. We now replace this array with an RB tree that is holding the range items as tree nodes, each range directly holding a list of bindings. This, along with a few name changes, improves both readability and volume of the code, as well as reducing memory consumption and hopefully improving cache hit rate. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>