aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-09-29netfilter: nf_tables: store and dump set policyArturo Borrero2-0/+8
We want to know in which cases the user explicitly sets the policy options. In that case, we also want to dump back the info. Signed-off-by: Arturo Borrero Gonzalez <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2014-09-29Bluetooth: 6lowpan: Make sure skb exists before accessing itJukka Rissanen1-0/+6
We need to make sure that the saved skb exists when resuming or suspending a CoC channel. This can happen if initial credits is 0 when channel is connected. Signed-off-by: Jukka Rissanen <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
2014-09-29Merge branch 'qca7000_spi'David S. Miller13-0/+2048
Stefan Wahren says: ==================== add Qualcomm QCA7000 ethernet driver This patch series adds support for the Qualcomm QCA7000 Homeplug GreenPHY. The QCA7000 is serial-to-powerline bridge with two interfaces: UART and SPI. These patches handles only the last one, with an Ethernet over SPI protocol driver. This driver based on the Qualcomm code [1], but contains a lot of changes since last year: * devicetree support * DebugFS support * ethtool support * better error handling * performance improvements * code cleanup * some bugfixes The code has been tested only on Freescale i.MX28 boards, but should work on other platforms. [1] - https://github.com/IoE/qca7000 Changes in V3: - Use ether_addr_copy instead of memcpy - Remove qcaspi_set_mac_address - Improve DT parsing - replace OF_GPIO dependancy with OF - fix compile error caused by SET_ETHTOOL_OPS - fix possible endless loop when spi read fails - fix DT documentation - fix coding style - fix sparse warnings Changes in V2: - replace in DT the SPI intr GPIO with pure interrupt - make legacy mode a boolean DT property and remove it as module parameter - make burst length a module parameter instead of DT property - make pluggable a module parameter instead of DT property - improve DT documentation - replace debugFS register dump with ethtool function - replace debugFS stats with ethtool function - implement function to get ring parameter via ethtool - implement function to set TX ring count via ethtool - fix TX ring state in debugFS - optimize tx ring flush - add byte limit for TX ring to avoid bufferbloat - fix TX queue full and write buffer miss counter - fix SPI clk speed module parameter - fix possible packet loss - fix possible race during transmit ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: qualcomm: new Ethernet over SPI driver for QCA7000Stefan Wahren12-0/+2001
This patch adds the Ethernet over SPI driver for the Qualcomm QCA7000 HomePlug GreenPHY. Signed-off-by: Stefan Wahren <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29Documentation: add Device tree bindings for QCA7000Stefan Wahren1-0/+47
This patch adds the Device tree bindings for the Ethernet over SPI protocol driver of the Qualcomm QCA7000 HomePlug GreenPHY. Signed-off-by: Stefan Wahren <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29Merge branch 'dctcp'David S. Miller12-78/+574
Daniel Borkmann says: ==================== net: tcp: DCTCP congestion control algorithm This patch series adds support for the DataCenter TCP (DCTCP) congestion control algorithm. Please see individual patches for the details. The last patch adds DCTCP as a congestion control module, and previous ones add needed infrastructure to extend the congestion control framework. Joint work between Florian Westphal, Daniel Borkmann and Glenn Judd. v3 -> v2: - No changes anywhere, just a resend as requested by Dave - Added Stephen's ACK v1 -> v2: - Rebased to latest net-next - Addressed Eric's feedback, thanks! - Update stale comment wrt. DCTCP ECN usage - Don't call INET_ECN_xmit for every packet - Add dctcp ss/inetdiag support to expose internal stats to userspace ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: tcp: add DCTCP congestion control algorithmDaniel Borkmann6-3/+425
This work adds the DataCenter TCP (DCTCP) congestion control algorithm [1], which has been first published at SIGCOMM 2010 [2], resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more recently as an informational IETF draft available at [4]). DCTCP is an enhancement to the TCP congestion control algorithm for data center networks. Typical data center workloads are i.e. i) partition/aggregate (queries; bursty, delay sensitive), ii) short messages e.g. 50KB-1MB (for coordination and control state; delay sensitive), and iii) large flows e.g. 1MB-100MB (data update; throughput sensitive). DCTCP has therefore been designed for such environments to provide/achieve the following three requirements: * High burst tolerance (incast due to partition/aggregate) * Low latency (short flows, queries) * High throughput (continuous data updates, large file transfers) with commodity, shallow buffered switches The basic idea of its design consists of two fundamentals: i) on the switch side, packets are being marked when its internal queue length > threshold K (K is chosen so that a large enough headroom for marked traffic is still available in the switch queue); ii) the sender/host side maintains a moving average of the fraction of marked packets, so each RTT, F is being updated as follows: F := X / Y, where X is # of marked ACKs, Y is total # of ACKs alpha := (1 - g) * alpha + g * F, where g is a smoothing constant The resulting alpha (iow: probability that switch queue is congested) is then being used in order to adaptively decrease the congestion window W: W := (1 - (alpha / 2)) * W The means for receiving marked packets resp. marking them on switch side in DCTCP is the use of ECN. RFC3168 describes a mechanism for using Explicit Congestion Notification from the switch for early detection of congestion, rather than waiting for segment loss to occur. However, this method only detects the presence of congestion, not the *extent*. In the presence of mild congestion, it reduces the TCP congestion window too aggressively and unnecessarily affects the throughput of long flows [4]. DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN) processing to estimate the fraction of bytes that encounter congestion, rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate [4], thus it can derive multibit feedback from the information present in the single-bit sequence of marks in its control law. And thus act in *proportion* to the extent of congestion, not its *presence*. Switches therefore set the Congestion Experienced (CE) codepoint in packets when internal queue lengths exceed threshold K. Resulting, DCTCP delivers the same or better throughput than normal TCP, while using 90% less buffer space. It was found in [2] that DCTCP enables the applications to handle 10x the current background traffic, without impacting foreground traffic. Moreover, a 10x increase in foreground traffic did not cause any timeouts, and thus largely eliminates TCP incast collapse problems. The algorithm itself has already seen deployments in large production data centers since then. We did a long-term stress-test and analysis in a data center, short summary of our TCP incast tests with iperf compared to cubic: This test measured DCTCP throughput and latency and compared it with CUBIC throughput and latency for an incast scenario. In this test, 19 senders sent at maximum rate to a single receiver. The receiver simply ran iperf -s. The senders ran iperf -c <receiver> -t 30. All senders started simultaneously (using local clocks synchronized by ntp). This test was repeated multiple times. Below shows the results from a single test. Other tests are similar. (DCTCP results were extremely consistent, CUBIC results show some variance induced by the TCP timeouts that CUBIC encountered.) For this test, we report statistics on the number of TCP timeouts, flow throughput, and traffic latency. 1) Timeouts (total over all flows, and per flow summaries): CUBIC DCTCP Total 3227 25 Mean 169.842 1.316 Median 183 1 Max 207 5 Min 123 0 Stddev 28.991 1.600 Timeout data is taken by measuring the net change in netstat -s "other TCP timeouts" reported. As a result, the timeout measurements above are not restricted to the test traffic, and we believe that it is likely that all of the "DCTCP timeouts" are actually timeouts for non-test traffic. We report them nevertheless. CUBIC will also include some non-test timeouts, but they are drawfed by bona fide test traffic timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing TCP timeouts. DCTCP reduces timeouts by at least two orders of magnitude and may well have eliminated them in this scenario. 2) Throughput (per flow in Mbps): CUBIC DCTCP Mean 521.684 521.895 Median 464 523 Max 776 527 Min 403 519 Stddev 105.891 2.601 Fairness 0.962 0.999 Throughput data was simply the average throughput for each flow reported by iperf. By avoiding TCP timeouts, DCTCP is able to achieve much better per-flow results. In CUBIC, many flows experience TCP timeouts which makes flow throughput unpredictable and unfair. DCTCP, on the other hand, provides very clean predictable throughput without incurring TCP timeouts. Thus, the standard deviation of CUBIC throughput is dramatically higher than the standard deviation of DCTCP throughput. Mean throughput is nearly identical because even though cubic flows suffer TCP timeouts, other flows will step in and fill the unused bandwidth. Note that this test is something of a best case scenario for incast under CUBIC: it allows other flows to fill in for flows experiencing a timeout. Under situations where the receiver is issuing requests and then waiting for all flows to complete, flows cannot fill in for timed out flows and throughput will drop dramatically. 3) Latency (in ms): CUBIC DCTCP Mean 4.0088 0.04219 Median 4.055 0.0395 Max 4.2 0.085 Min 3.32 0.028 Stddev 0.1666 0.01064 Latency for each protocol was computed by running "ping -i 0.2 <receiver>" from a single sender to the receiver during the incast test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure that traffic traversed the DCTCP queue and was not dropped when the queue size was greater than the marking threshold. The summary statistics above are over all ping metrics measured between the single sender, receiver pair. The latency results for this test show a dramatic difference between CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer which incurs the maximum queue latency (more buffer memory will lead to high latency.) DCTCP, on the other hand, deliberately attempts to keep queue occupancy low. The result is a two orders of magnitude reduction of latency with DCTCP - even with a switch with relatively little RAM. Switches with larger amounts of RAM will incur increasing amounts of latency for CUBIC, but not for DCTCP. 4) Convergence and stability test: This test measured the time that DCTCP took to fairly redistribute bandwidth when a new flow commences. It also measured DCTCP's ability to remain stable at a fair bandwidth distribution. DCTCP is compared with CUBIC for this test. At the commencement of this test, a single flow is sending at maximum rate (near 10 Gbps) to a single receiver. One second after that first flow commences, a new flow from a distinct server begins sending to the same receiver as the first flow. After the second flow has sent data for 10 seconds, the second flow is terminated. The first flow sends for an additional second. Ideally, the bandwidth would be evenly shared as soon as the second flow starts, and recover as soon as it stops. The results of this test are shown below. Note that the flow bandwidth for the two flows was measured near the same time, but not simultaneously. DCTCP performs nearly perfectly within the measurement limitations of this test: bandwidth is quickly distributed fairly between the two flows, remains stable throughout the duration of the test, and recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth fairly, and has trouble remaining stable. CUBIC DCTCP Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2 0 9.93 0 0 9.92 0 0.5 9.87 0 0.5 9.86 0 1 8.73 2.25 1 6.46 4.88 1.5 7.29 2.8 1.5 4.9 4.99 2 6.96 3.1 2 4.92 4.94 2.5 6.67 3.34 2.5 4.93 5 3 6.39 3.57 3 4.92 4.99 3.5 6.24 3.75 3.5 4.94 4.74 4 6 3.94 4 5.34 4.71 4.5 5.88 4.09 4.5 4.99 4.97 5 5.27 4.98 5 4.83 5.01 5.5 4.93 5.04 5.5 4.89 4.99 6 4.9 4.99 6 4.92 5.04 6.5 4.93 5.1 6.5 4.91 4.97 7 4.28 5.8 7 4.97 4.97 7.5 4.62 4.91 7.5 4.99 4.82 8 5.05 4.45 8 5.16 4.76 8.5 5.93 4.09 8.5 4.94 4.98 9 5.73 4.2 9 4.92 5.02 9.5 5.62 4.32 9.5 4.87 5.03 10 6.12 3.2 10 4.91 5.01 10.5 6.91 3.11 10.5 4.87 5.04 11 8.48 0 11 8.49 4.94 11.5 9.87 0 11.5 9.9 0 SYN/ACK ECT test: This test demonstrates the importance of ECT on SYN and SYN-ACK packets by measuring the connection probability in the presence of competing flows for a DCTCP connection attempt *without* ECT in the SYN packet. The test was repeated five times for each number of competing flows. Competing Flows 1 | 2 | 4 | 8 | 16 ------------------------------ Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0 Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0 As the number of competing flows moves beyond 1, the connection probability drops rapidly. Enabling DCTCP with this patch requires the following steps: DCTCP must be running both on the sender and receiver side in your data center, i.e.: sysctl -w net.ipv4.tcp_congestion_control=dctcp Also, ECN functionality must be enabled on all switches in your data center for DCTCP to work. The default ECN marking threshold (K) heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]). In above tests, for each switch port, traffic was segregated into two queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of 0x04 - the packet was placed into the DCTCP queue. All other packets were placed into the default drop-tail queue. For the DCTCP queue, RED/ECN marking was enabled, here, with a marking threshold of 75 KB. More details however, we refer you to the paper [2] under section 3). There are no code changes required to applications running in user space. DCTCP has been implemented in full *isolation* of the rest of the TCP code as its own congestion control module, so that it can run without a need to expose code to the core of the TCP stack, and thus nothing changes for non-DCTCP users. Changes in the CA framework code are minimal, and DCTCP algorithm operates on mechanisms that are already available in most Silicon. The gain (dctcp_shift_g) is currently a fixed constant (1/16) from the paper, but we leave the option that it can be chosen carefully to a different value by the user. In case DCTCP is being used and ECN support on peer site is off, DCTCP falls back after 3WHS to operate in normal TCP Reno mode. ss {-4,-6} -t -i diag interface: ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054 ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584 send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15 reordering:101 rcv_space:29200 ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448 cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate 325.5Mbps rcv_rtt:1.5 rcv_space:29200 More information about DCTCP can be found in [1-4]. [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Joint work with Florian Westphal and Glenn Judd. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Glenn Judd <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: tcp: more detailed ACK events and events for CE marked packetsFlorian Westphal3-5/+30
DataCenter TCP (DCTCP) determines cwnd growth based on ECN information and ACK properties, e.g. ACK that updates window is treated differently than DUPACK. Also DCTCP needs information whether ACK was delayed ACK. Furthermore, DCTCP also implements a CE state machine that keeps track of CE markings of incoming packets. Therefore, extend the congestion control framework to provide these event types, so that DCTCP can be properly implemented as a normal congestion algorithm module outside of the core stack. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Glenn Judd <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: tcp: split ack slow/fast events from cwnd_eventFlorian Westphal3-16/+32
The congestion control ops "cwnd_event" currently supports CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others). Both FAST and SLOW_ACK are only used by Westwood congestion control algorithm. This removes both flags from cwnd_event and adds a new in_ack_event callback for this. The goal is to be able to provide more detailed information about ACKs, such as whether ECE flag was set, or whether the ACK resulted in a window update. It is required for DataCenter TCP (DCTCP) congestion control algorithm as it makes a different choice depending on ECE being set or not. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Glenn Judd <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: tcp: add flag for ca to indicate that ECN is requiredDaniel Borkmann3-25/+63
This patch adds a flag to TCP congestion algorithms that allows for requesting to mark IPv4/IPv6 sockets with transport as ECN capable, that is, ECT(0), when required by a congestion algorithm. It is currently used and needed in DataCenter TCP (DCTCP), as it requires both peers to assert ECT on all IP packets sent - it uses ECN feedback (i.e. CE, Congestion Encountered information) from switches inside the data center to derive feedback to the end hosts. Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's algorithm/behaviour slightly diverges from RFC3168, therefore this is only (!) enabled iff the assigned congestion control ops module has requested this. By that, we can tightly couple this logic really only to the provided congestion control ops. Joint work with Florian Westphal and Glenn Judd. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Glenn Judd <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: tcp: assign tcp cong_ops when tcp sk is createdFlorian Westphal4-32/+27
Split assignment and initialization from one into two functions. This is required by followup patches that add Datacenter TCP (DCTCP) congestion control algorithm - we need to be able to determine if the connection is moderated by DCTCP before the 3WHS has finished. As we walk the available congestion control list during the assignment, we are always guaranteed to have Reno present as it's fixed compiled-in. Therefore, since we're doing the early assignment, we don't have a real use for the Reno alias tcp_init_congestion_ops anymore and can thus remove it. Actual usage of the congestion control operations are being made after the 3WHS has finished, in some cases however we can access get_info() via diag if implemented, therefore we need to zero out the private area for those modules. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Glenn Judd <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29net: sched: cls_rcvp, complete rcu conversionJohn Fastabend1-3/+41
This completes the cls_rsvp conversion to RCU safe copy, update semantics. As a result all cases of tcf_exts_change occur on empty lists now. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-29dql: dql_queued() should write first to reduce bus transactionsEric Dumazet1-2/+10
While doing high throughput test on a BQL enabled NIC, I found a very high cost in ndo_start_xmit() when accessing BQL data. It turned out the problem was caused by compiler trying to be smart, but involving a bad MESI transaction : 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count 58.23 │ add %edx,%edi 0.58 │ cmp %edi,0xc4(%rax) 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count 0.72 │ js bd8 I got an incredible 10 % gain [1] by making sure cpu do not attempt to get the cache line in Shared mode, but directly requests for ownership. New code : mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count add %edx,0xc0(%rax) // RMW dql->num_queued += count mov 0xc4(%rax),%ecx // LOAD dql->adj_limit mov 0xc0(%rax),%edx // LOAD dql->num_queued cmp %edx,%ecx The TX completion was running from another cpu, with high interrupts rate. Note that I am using barrier() as a soft hint, as mb() here could be too heavy cost. [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net_sched: fix another regression in cls_tcindexWANG Cong1-1/+1
Clearly the following change is not expected: - if (!cp.perfect && !cp.h) - cp.alloc_hash = cp.hash; + if (!cp->perfect && cp->h) + cp->alloc_hash = cp->hash; Fixes: commit 331b72922c5f58d48fd ("net: sched: RCU cls_tcindex") Cc: John Fastabend <[email protected]> Signed-off-by: Cong Wang <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net_sched: fix errno in tcindex_set_parms()WANG Cong1-3/+2
When kmemdup() fails, we should return -ENOMEM. Cc: John Fastabend <[email protected]> Signed-off-by: Cong Wang <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'cxgb4-next'David S. Miller6-96/+274
Hariprasad Shenai says: ==================== cxgb4: Use new BAR2 GTS for T5, adds adaptive rx and few Device ID's This patch series adds support to use new BAR2 GTS for T5 adapter. Adds support for adaptive rx. Remove redundant variable from a macro of cxgb4vf driver. Adds Device ID for new adapters. The patches series is created against 'net-next' tree. And includes patches on cxgb4 and cxgb4vf driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28cxgb4: Add support for adaptive rxHariprasad Shenai5-2/+49
Based on original work by Kumar Sanghvi <[email protected]> Signed-off-by: Hariprasad Shenai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28cxgb4/cxgb4vf: Add Devicde ID for two more adapterHariprasad Shenai2-0/+6
Signed-off-by: Hariprasad Shenai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28cxgb4vf: Remove superfluous "idx" parameter of CH_DEVICE() macro.Hariprasad Shenai1-52/+52
Remove redundant idx parameter of CH_DEVICE() macro, its always zero. Signed-off-by: Hariprasad Shenai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28cxgb4: Use BAR2 Going To Sleep (GTS) for T5 and later.Hariprasad Shenai3-42/+167
Use BAR2 GTS for T5. If we are on T4 use the old doorbell mechanism; otherwise ue the new BAR2 mechanism. Use BAR2 doorbells for refilling FL's. Based on original work by Casey Leedom <[email protected]> Signed-off-by: Hariprasad Shenai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28arp: Do not perturb drop profiles with ignored ARP packetsRick Jones1-1/+5
We do not wish to disturb dropwatch or perf drop profiles with an ARP we will ignore. Signed-off-by: Rick Jones <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Linux 3.17-rc7Linus Torvalds1-1/+1
2014-09-28net_sched: remove the first parameter from tcf_exts_destroy()WANG Cong11-21/+21
Cc: Jamal Hadi Salim <[email protected]> Signed-off-by: Cong Wang <[email protected]> Acked-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28mlx4: exploit skb->xmit_more to conditionally send doorbellEric Dumazet1-7/+16
skb->xmit_more tells us if another skb is coming next. We need to send doorbell when : xmit_more is not set, or txqueue is stopped (preventing next skb to come immediately) Tested with a modified pktgen version, I got a 40% increase of throughput. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'r8152'David S. Miller1-37/+209
Hayes Wang says: ==================== r8152: support setting eee by ethtool Modify some definitions about EEE, and add the support of setting the EEE through ethtool. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28r8152: support ethtool eeehayeswang1-0/+128
Support get_eee() and set_eee() of ethtool_ops. Signed-off-by: Hayes Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28r8152: add functions to set EEEhayeswang1-25/+76
Add functions to enable EEE and set EEE advertisement. Signed-off-by: Hayes Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28r8152: change the EEE definitionhayeswang1-21/+14
Replace the EEE definitions with the ones which is declared in "mdio.h". Chage some definitions to make them readable. Signed-off-by: Hayes Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'defxx-next'David S. Miller2-32/+44
Maciej W. Rozycki says: ==================== defxx: DEFEA fixes and updates I have finally got my hands on an EISA variation of the board (DEC FDDIcontroller/EISA aka DEFEA) and was able to do some testing. Here are initial updates to the driver that address problems I encountered so far. More to come later on as I get back to the system that I have in a remote location -- I need to double-check MMIO support and see what might have been causing spurious interrupts I saw with the 8259A PIC the board's interrupt line has been routed to. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28defxx: DEFEA's ESIC port I/O decoding cleanupMaciej W. Rozycki2-20/+32
Use the slot-specific I/O range for decoding accesses to PDQ ASIC registers (IOCS0) and the discrete Burst Holdoff register (IOCS1) as per the "HD64981F EISA Slave Interface Controller (ESIC)" datasheet. Use disjoint decode ranges now that the assignment of chip selects is known. Update the span of the port I/O resource requested accordingly. Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28defxx: DEFEA's Burst Holdoff register initialization fixMaciej W. Rozycki1-2/+2
Use the mask rather than bit number macro to initialize the chip select control bit for PDQ register space decoding in the Burst Holdoff register. Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28defxx: Correct DEFEA's ESIC port I/O accessesMaciej W. Rozycki1-15/+15
Reverse the order of arguments to `outb', data to write comes first. Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'master' of ↵David S. Miller10-38/+303
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2014-09-25 1) Remove useless hash_resize_mutex in xfrm_hash_resize(). This mutex is used only there, but xfrm_hash_resize() can't be called concurrently at all. From Ying Xue. 2) Extend policy hashing to prefixed policies based on prefix lenght thresholds. From Christophe Gouault. 3) Make the policy hash table thresholds configurable via netlink. From Christophe Gouault. 4) Remove the maximum authentication length for AH. This was needed to limit stack usage. We switched already to allocate space, so no need to keep the limit. From Herbert Xu. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28neigh: check error pointer instead of NULL for ipv4_neigh_lookup()WANG Cong1-1/+1
Fixes: commit f187bc6efb7250afee0e2009b6106 ("ipv4: No need to set generic neighbour pointer") Cc: David S. Miller <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'dsa_eee'David S. Miller5-18/+203
Florian Fainelli says: ==================== net: dsa: EEE and other PM features This patch set allows DSA switch drivers to enable/disable/query EEE on a per-port level, as well as control precisely which switch ports are enable/disabled. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28net: dsa: bcm_sf2: add support for controlling EEEFlorian Fainelli3-0/+79
When EEE is enabled, negotiate this feature with the PHY and make sure that the capability checking, local EEE advertisement, link partner EEE advertisement and auto-negotiation resolution returned by phy_init_eee() is positive, and enable EEE at the switch level. While querying the current EEE settings, verify the low-power indication and indicate its status. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net: dsa: allow switches driver to implement get/set EEEFlorian Fainelli2-0/+49
Allow switches driver to query and enable/disable EEE on a per-port basis by implementing the ethtool_{get,set}_eee settings and delegating these operations to the switch driver. set_eee() will need to coordinate with the PHY driver to make sure that EEE is enabled, the link-partner supports it and the auto-negotiation result is satisfactory. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net: dsa: bcm_sf2: add port_enable/disable callbacksFlorian Fainelli1-20/+40
The SF2 switch driver is already architected around per-port enable/disable callbacks, so we just need a slight update to our existing bcm_sf2_port_setup() resp. bcm_sf2_port_disable() functions to be suitable as callbacks for port_enable/port_disable. We need to shuffle a little the code that does the per-port VLAN configuration/isolation since ports can now be brought up/down separately, so we need to make sure that IMP (CPU, management) port is always included in that specific port setup. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net: dsa: bcm_sf2: disable RGMII interface(s) when link is downFlorian Fainelli1-0/+9
When the link is down, disable the RGMII interface to conserve as much power as possible. We re-enable the RGMII interface whenever the link is detected. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net: dsa: allow enabling and disable switch portsFlorian Fainelli2-0/+22
Whenever a per-port network device is used/unused, invoke the switch driver port_enable/port_disable callbacks to allow saving as much power as possible by disabling unused parts of the switch (RX/TX logic, memory arrays, PHYs...). We supply a PHY device argument to make sure the switch driver can act on the PHY device if needed (like putting/taking the PHY out of deep low power mode). Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28net: dsa: start and stop the PHY state machineFlorian Fainelli1-0/+6
dsa_slave_open() should start the PHY library state machine for its PHY interface, and dsa_slave_close() should stop the PHY library state machine accordingly. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds1-0/+5
Pull slave-dmaengine fixes from Vinod Koul: "Two small fixes for omap dmaengine driver which fixes cyclic suspend and resume" * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: omap-dma: Restore the CLINK_CTRL in resume path dmaengine: omap-dma: Add memory barrier to dma_resume path
2014-09-28tcp: use tcp_flags in tcp_data_queue()Peter Pan(潘卫平)1-3/+2
This patch is a cleanup which follows the idea in commit e11ecddf5128 (tcp: use TCP_SKB_CB(skb)->tcp_flags in input path), and it may reduce register pressure since skb->cb[] access is fast, bacause skb is probably in a register. v2: remove variable th v3: reword the changelog Signed-off-by: Weiping Pan <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28tcp: change tcp_skb_pcount() locationEric Dumazet4-11/+33
Our goal is to access no more than one cache line access per skb in a write or receive queue when doing the various walks. After recent TCP_SKB_CB() reorganizations, it is almost done. Last part is tcp_skb_pcount() which currently uses skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs 3 cache lines in current kernel (skb->head, skb->end, and shinfo->gso_segs are all in 3 different cache lines, far from skb->cb) This very simple patch reuses space currently taken by tcp_tw_isn only in input path, as tcp_skb_pcount is only needed for skb stored in write queue. This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags to get SKBTX_ACK_TSTAMP, which seems possible. This also speeds up all sack processing in general. This speeds up tcp_sendmsg() because it no longer has to access/dirty shinfo. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'tcp_skb_cb'David S. Miller11-34/+64
Eric Dumazet says: ==================== tcp: better TCP_SKB_CB layout TCP had the assumption that IPCB and IP6CB are first members of skb->cb[] This is fine, except that IPCB/IP6CB are used in TCP for a very short time in input path. What really matters for TCP stack is to get skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq in the same cache line. skb that are immediately consumed do not care because whole skb->cb[] is hot in cpu cache, while skb that sit in wocket write queue or receive queues do not need TCP_SKB_CB(skb)->header at all. This patch set implements the prereq for IPv4, IPv6, and TCP to make this possible. This makes TCP more efficient. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28tcp: better TCP_SKB_CB layout to reduce cache line missesEric Dumazet4-13/+30
TCP maintains lists of skb in write queue, and in receive queues (in order and out of order queues) Scanning these lists both in input and output path usually requires access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq These fields are currently in two different cache lines, meaning we waste lot of memory bandwidth when these queues are big and flows have either packet drops or packet reorders. We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because this header is not used in fast path. This allows TCP to search much faster in the skb lists. Even with regular flows, we save one cache line miss in fast path. Thanks to Christoph Paasch for noticing we need to cleanup skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path, and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options(). Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted()Eric Dumazet5-7/+9
ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm that it needs. Lets not assume this, as TCP stack might use a different place. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28ipv4: rename ip_options_echo to __ip_options_echo()Eric Dumazet4-14/+25
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt Lets break this assumption, but provide a helper to not change all call points. ip_send_unicast_reply() gets a new struct ip_options pointer. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-09-28Merge branch 'ipv6_tunnel'David S. Miller3-2/+12
Steffen Klassert says: ==================== ipv6: Return an error when adding an already existing tunnel The ipv6 tunnel locate functions should not return an existing tunnel if create is true. Otherwise it is possible to add the same tunnel multiple times without getting an error. All our ipv6 tunnels have this bug from the very beginning. Only the sit tunnel was fixed some years ago with: commit 8db99e57175 ("sit: Fail to create tunnel, if it already exists"). This patchset fixes the remaining ipv6 tunnels. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-09-28ip6_gre: Return an error when adding an existing tunnel.Steffen Klassert1-0/+2
ip6gre_tunnel_locate() should not return an existing tunnel if create is true. Otherwise it is possible to add the same tunnel multiple times without getting an error. So return NULL if the tunnel that should be created already exists. Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>