From da50d57abd7ecaba151600e726ccb944e7ddf81a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:16 +0200 Subject: docs: networking: convert caif files to ReST There are two text files for caif, plus one already converted file. Convert the two remaining ones to ReST, create a new index.rst file for CAIF, adding it to the main networking documentation index. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/caif/Linux-CAIF.txt | 175 -------------------- Documentation/networking/caif/caif.rst | 2 - Documentation/networking/caif/index.rst | 13 ++ Documentation/networking/caif/linux_caif.rst | 195 ++++++++++++++++++++++ Documentation/networking/caif/spi_porting.rst | 229 ++++++++++++++++++++++++++ Documentation/networking/caif/spi_porting.txt | 208 ----------------------- Documentation/networking/index.rst | 1 + drivers/net/caif/Kconfig | 2 +- 8 files changed, 439 insertions(+), 386 deletions(-) delete mode 100644 Documentation/networking/caif/Linux-CAIF.txt create mode 100644 Documentation/networking/caif/index.rst create mode 100644 Documentation/networking/caif/linux_caif.rst create mode 100644 Documentation/networking/caif/spi_porting.rst delete mode 100644 Documentation/networking/caif/spi_porting.txt diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/Linux-CAIF.txt deleted file mode 100644 index 0aa4bd381bec..000000000000 --- a/Documentation/networking/caif/Linux-CAIF.txt +++ /dev/null @@ -1,175 +0,0 @@ -Linux CAIF -=========== -copyright (C) ST-Ericsson AB 2010 -Author: Sjur Brendeland/ sjur.brandeland@stericsson.com -License terms: GNU General Public License (GPL) version 2 - - -Introduction ------------- -CAIF is a MUX protocol used by ST-Ericsson cellular modems for -communication between Modem and host. The host processes can open virtual AT -channels, initiate GPRS Data connections, Video channels and Utility Channels. -The Utility Channels are general purpose pipes between modem and host. - -ST-Ericsson modems support a number of transports between modem -and host. Currently, UART and Loopback are available for Linux. - - -Architecture: ------------- -The implementation of CAIF is divided into: -* CAIF Socket Layer and GPRS IP Interface. -* CAIF Core Protocol Implementation -* CAIF Link Layer, implemented as NET devices. - - - RTNL - ! - ! +------+ +------+ - ! +------+! +------+! - ! ! IP !! !Socket!! - +-------> !interf!+ ! API !+ <- CAIF Client APIs - ! +------+ +------! - ! ! ! - ! +-----------+ - ! ! - ! +------+ <- CAIF Core Protocol - ! ! CAIF ! - ! ! Core ! - ! +------+ - ! +----------!---------+ - ! ! ! ! - ! +------+ +-----+ +------+ - +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) - +------+ +-----+ +------+ - - - -I M P L E M E N T A T I O N -=========================== - - -CAIF Core Protocol Layer -========================================= - -CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson. -It implements the CAIF protocol stack in a layered approach, where -each layer described in the specification is implemented as a separate layer. -The architecture is inspired by the design patterns "Protocol Layer" and -"Protocol Packet". - -== CAIF structure == -The Core CAIF implementation contains: - - Simple implementation of CAIF. - - Layered architecture (a la Streams), each layer in the CAIF - specification is implemented in a separate c-file. - - Clients must call configuration function to add PHY layer. - - Clients must implement CAIF layer to consume/produce - CAIF payload with receive and transmit functions. - - Clients must call configuration function to add and connect the - Client layer. - - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed - to the called function (except for framing layers' receive function) - -Layered Architecture --------------------- -The CAIF protocol can be divided into two parts: Support functions and Protocol -Implementation. The support functions include: - - - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The - CAIF Packet has functions for creating, destroying and adding content - and for adding/extracting header and trailers to protocol packets. - -The CAIF Protocol implementation contains: - - - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol - Stack and provides a Client interface for adding Link-Layer and - Driver interfaces on top of the CAIF Stack. - - - CFCTRL CAIF Control layer. Encodes and Decodes control messages - such as enumeration and channel setup. Also matches request and - response messages. - - - CFSERVL General CAIF Service Layer functionality; handles flow - control and remote shutdown requests. - - - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual - External Interface). This layer encodes/decodes VEI frames. - - - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP - traffic), encodes/decodes Datagram frames. - - - CFMUX CAIF Mux layer. Handles multiplexing between multiple - physical bearers and multiple channels such as VEI, Datagram, etc. - The MUX keeps track of the existing CAIF Channels and - Physical Instances and selects the appropriate instance based - on Channel-Id and Physical-ID. - - - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length - and frame checksum. - - - CFSERL CAIF Serial layer. Handles concatenation/split of frames - into CAIF Frames with correct length. - - - - +---------+ - | Config | - | CFCNFG | - +---------+ - ! - +---------+ +---------+ +---------+ - | AT | | Control | | Datagram| - | CFVEIL | | CFCTRL | | CFDGML | - +---------+ +---------+ +---------+ - \_____________!______________/ - ! - +---------+ - | MUX | - | | - +---------+ - _____!_____ - / \ - +---------+ +---------+ - | CFFRML | | CFFRML | - | Framing | | Framing | - +---------+ +---------+ - ! ! - +---------+ +---------+ - | | | Serial | - | | | CFSERL | - +---------+ +---------+ - - -In this layered approach the following "rules" apply. - - All layers embed the same structure "struct cflayer" - - A layer does not depend on any other layer's private data. - - Layers are stacked by setting the pointers - layer->up , layer->dn - - In order to send data upwards, each layer should do - layer->up->receive(layer->up, packet); - - In order to send data downwards, each layer should do - layer->dn->transmit(layer->dn, packet); - - -CAIF Socket and IP interface -=========================== - -The IP interface and CAIF socket API are implemented on top of the -CAIF Core protocol. The IP Interface and CAIF socket have an instance of -'struct cflayer', just like the CAIF Core protocol stack. -Net device and Socket implement the 'receive()' function defined by -'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and -receive of packets is handled as by the rest of the layers: the 'dn->transmit()' -function is called in order to transmit data. - -Configuration of Link Layer ---------------------------- -The Link Layer is implemented as Linux network devices (struct net_device). -Payload handling and registration is done using standard Linux mechanisms. - -The CAIF Protocol relies on a loss-less link layer without implementing -retransmission. This implies that packet drops must not happen. -Therefore a flow-control mechanism is implemented where the physical -interface can initiate flow stop for all CAIF Channels. diff --git a/Documentation/networking/caif/caif.rst b/Documentation/networking/caif/caif.rst index 07afc8063d4d..a07213030ccf 100644 --- a/Documentation/networking/caif/caif.rst +++ b/Documentation/networking/caif/caif.rst @@ -1,5 +1,3 @@ -:orphan: - .. SPDX-License-Identifier: GPL-2.0 .. include:: diff --git a/Documentation/networking/caif/index.rst b/Documentation/networking/caif/index.rst new file mode 100644 index 000000000000..86e5b7832ec3 --- /dev/null +++ b/Documentation/networking/caif/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +CAIF +==== + +Contents: + +.. toctree:: + :maxdepth: 2 + + linux_caif + caif + spi_porting diff --git a/Documentation/networking/caif/linux_caif.rst b/Documentation/networking/caif/linux_caif.rst new file mode 100644 index 000000000000..a0480862ab8c --- /dev/null +++ b/Documentation/networking/caif/linux_caif.rst @@ -0,0 +1,195 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========== +Linux CAIF +========== + +Copyright |copy| ST-Ericsson AB 2010 + +:Author: Sjur Brendeland/ sjur.brandeland@stericsson.com +:License terms: GNU General Public License (GPL) version 2 + + +Introduction +============ + +CAIF is a MUX protocol used by ST-Ericsson cellular modems for +communication between Modem and host. The host processes can open virtual AT +channels, initiate GPRS Data connections, Video channels and Utility Channels. +The Utility Channels are general purpose pipes between modem and host. + +ST-Ericsson modems support a number of transports between modem +and host. Currently, UART and Loopback are available for Linux. + + +Architecture +============ + +The implementation of CAIF is divided into: + +* CAIF Socket Layer and GPRS IP Interface. +* CAIF Core Protocol Implementation +* CAIF Link Layer, implemented as NET devices. + +:: + + RTNL + ! + ! +------+ +------+ + ! +------+! +------+! + ! ! IP !! !Socket!! + +-------> !interf!+ ! API !+ <- CAIF Client APIs + ! +------+ +------! + ! ! ! + ! +-----------+ + ! ! + ! +------+ <- CAIF Core Protocol + ! ! CAIF ! + ! ! Core ! + ! +------+ + ! +----------!---------+ + ! ! ! ! + ! +------+ +-----+ +------+ + +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) + +------+ +-----+ +------+ + + + +Implementation +============== + + +CAIF Core Protocol Layer +------------------------ + +CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson. +It implements the CAIF protocol stack in a layered approach, where +each layer described in the specification is implemented as a separate layer. +The architecture is inspired by the design patterns "Protocol Layer" and +"Protocol Packet". + +CAIF structure +^^^^^^^^^^^^^^ + +The Core CAIF implementation contains: + + - Simple implementation of CAIF. + - Layered architecture (a la Streams), each layer in the CAIF + specification is implemented in a separate c-file. + - Clients must call configuration function to add PHY layer. + - Clients must implement CAIF layer to consume/produce + CAIF payload with receive and transmit functions. + - Clients must call configuration function to add and connect the + Client layer. + - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed + to the called function (except for framing layers' receive function) + +Layered Architecture +==================== + +The CAIF protocol can be divided into two parts: Support functions and Protocol +Implementation. The support functions include: + + - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The + CAIF Packet has functions for creating, destroying and adding content + and for adding/extracting header and trailers to protocol packets. + +The CAIF Protocol implementation contains: + + - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol + Stack and provides a Client interface for adding Link-Layer and + Driver interfaces on top of the CAIF Stack. + + - CFCTRL CAIF Control layer. Encodes and Decodes control messages + such as enumeration and channel setup. Also matches request and + response messages. + + - CFSERVL General CAIF Service Layer functionality; handles flow + control and remote shutdown requests. + + - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual + External Interface). This layer encodes/decodes VEI frames. + + - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP + traffic), encodes/decodes Datagram frames. + + - CFMUX CAIF Mux layer. Handles multiplexing between multiple + physical bearers and multiple channels such as VEI, Datagram, etc. + The MUX keeps track of the existing CAIF Channels and + Physical Instances and selects the appropriate instance based + on Channel-Id and Physical-ID. + + - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length + and frame checksum. + + - CFSERL CAIF Serial layer. Handles concatenation/split of frames + into CAIF Frames with correct length. + +:: + + +---------+ + | Config | + | CFCNFG | + +---------+ + ! + +---------+ +---------+ +---------+ + | AT | | Control | | Datagram| + | CFVEIL | | CFCTRL | | CFDGML | + +---------+ +---------+ +---------+ + \_____________!______________/ + ! + +---------+ + | MUX | + | | + +---------+ + _____!_____ + / \ + +---------+ +---------+ + | CFFRML | | CFFRML | + | Framing | | Framing | + +---------+ +---------+ + ! ! + +---------+ +---------+ + | | | Serial | + | | | CFSERL | + +---------+ +---------+ + + +In this layered approach the following "rules" apply. + + - All layers embed the same structure "struct cflayer" + - A layer does not depend on any other layer's private data. + - Layers are stacked by setting the pointers:: + + layer->up , layer->dn + + - In order to send data upwards, each layer should do:: + + layer->up->receive(layer->up, packet); + + - In order to send data downwards, each layer should do:: + + layer->dn->transmit(layer->dn, packet); + + +CAIF Socket and IP interface +============================ + +The IP interface and CAIF socket API are implemented on top of the +CAIF Core protocol. The IP Interface and CAIF socket have an instance of +'struct cflayer', just like the CAIF Core protocol stack. +Net device and Socket implement the 'receive()' function defined by +'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and +receive of packets is handled as by the rest of the layers: the 'dn->transmit()' +function is called in order to transmit data. + +Configuration of Link Layer +--------------------------- +The Link Layer is implemented as Linux network devices (struct net_device). +Payload handling and registration is done using standard Linux mechanisms. + +The CAIF Protocol relies on a loss-less link layer without implementing +retransmission. This implies that packet drops must not happen. +Therefore a flow-control mechanism is implemented where the physical +interface can initiate flow stop for all CAIF Channels. diff --git a/Documentation/networking/caif/spi_porting.rst b/Documentation/networking/caif/spi_porting.rst new file mode 100644 index 000000000000..d49f874b20ac --- /dev/null +++ b/Documentation/networking/caif/spi_porting.rst @@ -0,0 +1,229 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +CAIF SPI porting +================ + +CAIF SPI basics +=============== + +Running CAIF over SPI needs some extra setup, owing to the nature of SPI. +Two extra GPIOs have been added in order to negotiate the transfers +between the master and the slave. The minimum requirement for running +CAIF over SPI is a SPI slave chip and two GPIOs (more details below). +Please note that running as a slave implies that you need to keep up +with the master clock. An overrun or underrun event is fatal. + +CAIF SPI framework +================== + +To make porting as easy as possible, the CAIF SPI has been divided in +two parts. The first part (called the interface part) deals with all +generic functionality such as length framing, SPI frame negotiation +and SPI frame delivery and transmission. The other part is the CAIF +SPI slave device part, which is the module that you have to write if +you want to run SPI CAIF on a new hardware. This part takes care of +the physical hardware, both with regard to SPI and to GPIOs. + +- Implementing a CAIF SPI device: + + - Functionality provided by the CAIF SPI slave device: + + In order to implement a SPI device you will, as a minimum, + need to implement the following + functions: + + :: + + int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface to give + you a chance to set up your hardware to be ready to receive + a stream of data from the master. The xfer structure contains + both physical and logical addresses, as well as the total length + of the transfer in both directions.The dev parameter can be used + to map to different CAIF SPI slave devices. + + :: + + void (*sig_xfer) (bool xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface when the output + (SPI_INT) GPIO needs to change state. The boolean value of the xfer + variable indicates whether the GPIO should be asserted (HIGH) or + deasserted (LOW). The dev parameter can be used to map to different CAIF + SPI slave devices. + + - Functionality provided by the CAIF SPI interface: + + :: + + void (*ss_cb) (bool assert, struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + signal a change of state of the input GPIO (SS) to the interface. + Only active edges are mandatory to be reported. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + :: + + void (*xfer_done_cb) (struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + report that a transfer is completed. This function should only be + called once both the transmission and the reception are completed. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + - Connecting the bits and pieces: + + - Filling in the SPI slave device structure: + + Connect the necessary callback functions. + + Indicate clock speed (used to calculate toggle delays). + + Chose a suitable name (helps debugging if you use several CAIF + SPI slave devices). + + Assign your private data (can be used to map to your + structure). + + - Filling in the SPI slave platform device structure: + + Add name of driver to connect to ("cfspi_sspi"). + + Assign the SPI slave device structure as platform data. + +Padding +======= + +In order to optimize throughput, a number of SPI padding options are provided. +Padding can be enabled independently for uplink and downlink transfers. +Padding can be enabled for the head, the tail and for the total frame size. +The padding needs to be correctly configured on both sides of the link. +The padding can be changed via module parameters in cfspi_sspi.c or via +the sysfs directory of the cfspi_sspi driver (before device registration). + +- CAIF SPI device template:: + + /* + * Copyright (C) ST-Ericsson AB 2010 + * Author: Daniel Martensson / Daniel.Martensson@stericsson.com + * License terms: GNU General Public License (GPL), version 2. + * + */ + + #include + #include + #include + #include + #include + #include + #include + + MODULE_LICENSE("GPL"); + + struct sspi_struct { + struct cfspi_dev sdev; + struct cfspi_xfer *xfer; + }; + + static struct sspi_struct slave; + static struct platform_device slave_device; + + static irqreturn_t sspi_irq(int irq, void *arg) + { + /* You only need to trigger on an edge to the active state of the + * SS signal. Once a edge is detected, the ss_cb() function should be + * called with the parameter assert set to true. It is OK + * (and even advised) to call the ss_cb() function in IRQ context in + * order not to add any delay. */ + + return IRQ_HANDLED; + } + + static void sspi_complete(void *context) + { + /* Normally the DMA or the SPI framework will call you back + * in something similar to this. The only thing you need to + * do is to call the xfer_done_cb() function, providing the pointer + * to the CAIF SPI interface. It is OK to call this function + * from IRQ context. */ + } + + static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) + { + /* Store transfer info. For a normal implementation you should + * set up your DMA here and make sure that you are ready to + * receive the data from the master SPI. */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; + + sspi->xfer = xfer; + + return 0; + } + + void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) + { + /* If xfer is true then you should assert the SPI_INT to indicate to + * the master that you are ready to receive the data from the master + * SPI. If xfer is false then you should de-assert SPI_INT to indicate + * that the transfer is done. + */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; + } + + static void sspi_release(struct device *dev) + { + /* + * Here you should release your SPI device resources. + */ + } + + static int __init sspi_init(void) + { + /* Here you should initialize your SPI device by providing the + * necessary functions, clock speed, name and private data. Once + * done, you can register your device with the + * platform_device_register() function. This function will return + * with the CAIF SPI interface initialized. This is probably also + * the place where you should set up your GPIOs, interrupts and SPI + * resources. */ + + int res = 0; + + /* Initialize slave device. */ + slave.sdev.init_xfer = sspi_init_xfer; + slave.sdev.sig_xfer = sspi_sig_xfer; + slave.sdev.clk_mhz = 13; + slave.sdev.priv = &slave; + slave.sdev.name = "spi_sspi"; + slave_device.dev.release = sspi_release; + + /* Initialize platform device. */ + slave_device.name = "cfspi_sspi"; + slave_device.dev.platform_data = &slave.sdev; + + /* Register platform device. */ + res = platform_device_register(&slave_device); + if (res) { + printk(KERN_WARNING "sspi_init: failed to register dev.\n"); + return -ENODEV; + } + + return res; + } + + static void __exit sspi_exit(void) + { + platform_device_del(&slave_device); + } + + module_init(sspi_init); + module_exit(sspi_exit); diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt deleted file mode 100644 index 9efd0687dc4c..000000000000 --- a/Documentation/networking/caif/spi_porting.txt +++ /dev/null @@ -1,208 +0,0 @@ -- CAIF SPI porting - - -- CAIF SPI basics: - -Running CAIF over SPI needs some extra setup, owing to the nature of SPI. -Two extra GPIOs have been added in order to negotiate the transfers - between the master and the slave. The minimum requirement for running -CAIF over SPI is a SPI slave chip and two GPIOs (more details below). -Please note that running as a slave implies that you need to keep up -with the master clock. An overrun or underrun event is fatal. - -- CAIF SPI framework: - -To make porting as easy as possible, the CAIF SPI has been divided in -two parts. The first part (called the interface part) deals with all -generic functionality such as length framing, SPI frame negotiation -and SPI frame delivery and transmission. The other part is the CAIF -SPI slave device part, which is the module that you have to write if -you want to run SPI CAIF on a new hardware. This part takes care of -the physical hardware, both with regard to SPI and to GPIOs. - -- Implementing a CAIF SPI device: - - - Functionality provided by the CAIF SPI slave device: - - In order to implement a SPI device you will, as a minimum, - need to implement the following - functions: - - int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev): - - This function is called by the CAIF SPI interface to give - you a chance to set up your hardware to be ready to receive - a stream of data from the master. The xfer structure contains - both physical and logical addresses, as well as the total length - of the transfer in both directions.The dev parameter can be used - to map to different CAIF SPI slave devices. - - void (*sig_xfer) (bool xfer, struct cfspi_dev *dev): - - This function is called by the CAIF SPI interface when the output - (SPI_INT) GPIO needs to change state. The boolean value of the xfer - variable indicates whether the GPIO should be asserted (HIGH) or - deasserted (LOW). The dev parameter can be used to map to different CAIF - SPI slave devices. - - - Functionality provided by the CAIF SPI interface: - - void (*ss_cb) (bool assert, struct cfspi_ifc *ifc); - - This function is called by the CAIF SPI slave device in order to - signal a change of state of the input GPIO (SS) to the interface. - Only active edges are mandatory to be reported. - This function can be called from IRQ context (recommended in order - not to introduce latency). The ifc parameter should be the pointer - returned from the platform probe function in the SPI device structure. - - void (*xfer_done_cb) (struct cfspi_ifc *ifc); - - This function is called by the CAIF SPI slave device in order to - report that a transfer is completed. This function should only be - called once both the transmission and the reception are completed. - This function can be called from IRQ context (recommended in order - not to introduce latency). The ifc parameter should be the pointer - returned from the platform probe function in the SPI device structure. - - - Connecting the bits and pieces: - - - Filling in the SPI slave device structure: - - Connect the necessary callback functions. - Indicate clock speed (used to calculate toggle delays). - Chose a suitable name (helps debugging if you use several CAIF - SPI slave devices). - Assign your private data (can be used to map to your structure). - - - Filling in the SPI slave platform device structure: - Add name of driver to connect to ("cfspi_sspi"). - Assign the SPI slave device structure as platform data. - -- Padding: - -In order to optimize throughput, a number of SPI padding options are provided. -Padding can be enabled independently for uplink and downlink transfers. -Padding can be enabled for the head, the tail and for the total frame size. -The padding needs to be correctly configured on both sides of the link. -The padding can be changed via module parameters in cfspi_sspi.c or via -the sysfs directory of the cfspi_sspi driver (before device registration). - -- CAIF SPI device template: - -/* - * Copyright (C) ST-Ericsson AB 2010 - * Author: Daniel Martensson / Daniel.Martensson@stericsson.com - * License terms: GNU General Public License (GPL), version 2. - * - */ - -#include -#include -#include -#include -#include -#include -#include - -MODULE_LICENSE("GPL"); - -struct sspi_struct { - struct cfspi_dev sdev; - struct cfspi_xfer *xfer; -}; - -static struct sspi_struct slave; -static struct platform_device slave_device; - -static irqreturn_t sspi_irq(int irq, void *arg) -{ - /* You only need to trigger on an edge to the active state of the - * SS signal. Once a edge is detected, the ss_cb() function should be - * called with the parameter assert set to true. It is OK - * (and even advised) to call the ss_cb() function in IRQ context in - * order not to add any delay. */ - - return IRQ_HANDLED; -} - -static void sspi_complete(void *context) -{ - /* Normally the DMA or the SPI framework will call you back - * in something similar to this. The only thing you need to - * do is to call the xfer_done_cb() function, providing the pointer - * to the CAIF SPI interface. It is OK to call this function - * from IRQ context. */ -} - -static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) -{ - /* Store transfer info. For a normal implementation you should - * set up your DMA here and make sure that you are ready to - * receive the data from the master SPI. */ - - struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; - - sspi->xfer = xfer; - - return 0; -} - -void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) -{ - /* If xfer is true then you should assert the SPI_INT to indicate to - * the master that you are ready to receive the data from the master - * SPI. If xfer is false then you should de-assert SPI_INT to indicate - * that the transfer is done. - */ - - struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; -} - -static void sspi_release(struct device *dev) -{ - /* - * Here you should release your SPI device resources. - */ -} - -static int __init sspi_init(void) -{ - /* Here you should initialize your SPI device by providing the - * necessary functions, clock speed, name and private data. Once - * done, you can register your device with the - * platform_device_register() function. This function will return - * with the CAIF SPI interface initialized. This is probably also - * the place where you should set up your GPIOs, interrupts and SPI - * resources. */ - - int res = 0; - - /* Initialize slave device. */ - slave.sdev.init_xfer = sspi_init_xfer; - slave.sdev.sig_xfer = sspi_sig_xfer; - slave.sdev.clk_mhz = 13; - slave.sdev.priv = &slave; - slave.sdev.name = "spi_sspi"; - slave_device.dev.release = sspi_release; - - /* Initialize platform device. */ - slave_device.name = "cfspi_sspi"; - slave_device.dev.platform_data = &slave.sdev; - - /* Register platform device. */ - res = platform_device_register(&slave_device); - if (res) { - printk(KERN_WARNING "sspi_init: failed to register dev.\n"); - return -ENODEV; - } - - return res; -} - -static void __exit sspi_exit(void) -{ - platform_device_del(&slave_device); -} - -module_init(sspi_init); -module_exit(sspi_exit); diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 6538ede29661..5b3421ec25ec 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -15,6 +15,7 @@ Contents: device_drivers/index dsa/index devlink/index + caif/index ethtool-netlink ieee802154 j1939 diff --git a/drivers/net/caif/Kconfig b/drivers/net/caif/Kconfig index 661c25eb1c46..1538ad194cf4 100644 --- a/drivers/net/caif/Kconfig +++ b/drivers/net/caif/Kconfig @@ -28,7 +28,7 @@ config CAIF_SPI_SLAVE The CAIF Link layer SPI Protocol driver for Slave SPI interface. This driver implements a platform driver to accommodate for a platform specific SPI device. A sample CAIF SPI Platform device is - provided in . + provided in . config CAIF_SPI_SYNC bool "Next command and length in start of frame" -- cgit From a434aaba17f56c0a25edff4104dd5f9d5b3ceba2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:17 +0200 Subject: docs: networking: convert 6pack.txt to ReST - add SPDX header; - use title markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/6pack.rst | 191 +++++++++++++++++++++++++++++++++++++ Documentation/networking/6pack.txt | 175 --------------------------------- Documentation/networking/index.rst | 1 + drivers/net/hamradio/Kconfig | 2 +- 4 files changed, 193 insertions(+), 176 deletions(-) create mode 100644 Documentation/networking/6pack.rst delete mode 100644 Documentation/networking/6pack.txt diff --git a/Documentation/networking/6pack.rst b/Documentation/networking/6pack.rst new file mode 100644 index 000000000000..bc5bf1f1a98f --- /dev/null +++ b/Documentation/networking/6pack.rst @@ -0,0 +1,191 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +6pack Protocol +============== + +This is the 6pack-mini-HOWTO, written by + +Andreas Könsgen DG3KQ + +:Internet: ajk@comnets.uni-bremen.de +:AMPR-net: dg3kq@db0pra.ampr.org +:AX.25: dg3kq@db0ach.#nrw.deu.eu + +Last update: April 7, 1998 + +1. What is 6pack, and what are the advantages to KISS? +====================================================== + +6pack is a transmission protocol for data exchange between the PC and +the TNC over a serial line. It can be used as an alternative to KISS. + +6pack has two major advantages: + +- The PC is given full control over the radio + channel. Special control data is exchanged between the PC and the TNC so + that the PC knows at any time if the TNC is receiving data, if a TNC + buffer underrun or overrun has occurred, if the PTT is + set and so on. This control data is processed at a higher priority than + normal data, so a data stream can be interrupted at any time to issue an + important event. This helps to improve the channel access and timing + algorithms as everything is computed in the PC. It would even be possible + to experiment with something completely different from the known CSMA and + DAMA channel access methods. + This kind of real-time control is especially important to supply several + TNCs that are connected between each other and the PC by a daisy chain + (however, this feature is not supported yet by the Linux 6pack driver). + +- Each packet transferred over the serial line is supplied with a checksum, + so it is easy to detect errors due to problems on the serial line. + Received packets that are corrupt are not passed on to the AX.25 layer. + Damaged packets that the TNC has received from the PC are not transmitted. + +More details about 6pack are described in the file 6pack.ps that is located +in the doc directory of the AX.25 utilities package. + +2. Who has developed the 6pack protocol? +======================================== + +The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech +DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and +Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet. +They have also written a firmware for TNCs to perform the 6pack +protocol (see section 4 below). + +3. Where can I get the latest version of 6pack for LinuX? +========================================================= + +At the moment, the 6pack stuff can obtained via anonymous ftp from +db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq, +there is a file named 6pack.tgz. + +4. Preparing the TNC for 6pack operation +======================================== + +To be able to use 6pack, a special firmware for the TNC is needed. The EPROM +of a newly bought TNC does not contain 6pack, so you will have to +program an EPROM yourself. The image file for 6pack EPROMs should be +available on any packet radio box where PC/FlexNet can be found. The name of +the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet +team. It can be used under the terms of the license that comes along +with PC/FlexNet. Please do not ask me about the internals of this file as I +don't know anything about it. I used a textual description of the 6pack +protocol to program the Linux driver. + +TNCs contain a 64kByte EPROM, the lower half of which is used for +the firmware/KISS. The upper half is either empty or is sometimes +programmed with software called TAPR. In the latter case, the TNC +is supplied with a DIP switch so you can easily change between the +two systems. When programming a new EPROM, one of the systems is replaced +by 6pack. It is useful to replace TAPR, as this software is rarely used +nowadays. If your TNC is not equipped with the switch mentioned above, you +can build in one yourself that switches over the highest address pin +of the EPROM between HIGH and LOW level. After having inserted the new EPROM +and switched to 6pack, apply power to the TNC for a first test. The connect +and the status LED are lit for about a second if the firmware initialises +the TNC correctly. + +5. Building and installing the 6pack driver +=========================================== + +The driver has been tested with kernel version 2.1.90. Use with older +kernels may lead to a compilation error because the interface to a kernel +function has been changed in the 2.1.8x kernels. + +How to turn on 6pack support: +============================= + +- In the linux kernel configuration program, select the code maturity level + options menu and turn on the prompting for development drivers. + +- Select the amateur radio support menu and turn on the serial port 6pack + driver. + +- Compile and install the kernel and the modules. + +To use the driver, the kissattach program delivered with the AX.25 utilities +has to be modified. + +- Do a cd to the directory that holds the kissattach sources. Edit the + kissattach.c file. At the top, insert the following lines:: + + #ifndef N_6PACK + #define N_6PACK (N_AX25+1) + #endif + + Then find the line: + + int disc = N_AX25; + + and replace N_AX25 by N_6PACK. + +- Recompile kissattach. Rename it to spattach to avoid confusions. + +Installing the driver: +---------------------- + +- Do an insmod 6pack. Look at your /var/log/messages file to check if the + module has printed its initialization message. + +- Do a spattach as you would launch kissattach when starting a KISS port. + Check if the kernel prints the message '6pack: TNC found'. + +- From here, everything should work as if you were setting up a KISS port. + The only difference is that the network device that represents + the 6pack port is called sp instead of sl or ax. So, sp0 would be the + first 6pack port. + +Although the driver has been tested on various platforms, I still declare it +ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module +and spattaching. Watch out if your computer behaves strangely. Read section +6 of this file about known problems. + +Note that the connect and status LEDs of the TNC are controlled in a +different way than they are when the TNC is used with PC/FlexNet. When using +FlexNet, the connect LED is on if there is a connection; the status LED is +on if there is data in the buffer of the PC's AX.25 engine that has to be +transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer, +so the 6pack driver doesn't know anything about connects or data that +has not yet been transmitted. Therefore the LEDs are controlled +as they are in KISS mode: The connect LED is turned on if data is transferred +from the PC to the TNC over the serial line, the status LED if data is +sent to the PC. + +6. Known problems +================= + +When testing the driver with 2.0.3x kernels and +operating with data rates on the radio channel of 9600 Baud or higher, +the driver may, on certain systems, sometimes print the message '6pack: +bad checksum', which is due to data loss if the other station sends two +or more subsequent packets. I have been told that this is due to a problem +with the serial driver of 2.0.3x kernels. I don't know yet if the problem +still exists with 2.1.x kernels, as I have heard that the serial driver +code has been changed with 2.1.x. + +When shutting down the sp interface with ifconfig, the kernel crashes if +there is still an AX.25 connection left over which an IP connection was +running, even if that IP connection is already closed. The problem does not +occur when there is a bare AX.25 connection still running. I don't know if +this is a problem of the 6pack driver or something else in the kernel. + +The driver has been tested as a module, not yet as a kernel-builtin driver. + +The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is +connected to one serial port of the PC. This feature is not implemented +and at least at the moment I won't be able to do it because I do not have +the opportunity to build a TNC daisy-chain and test it. + +Some of the comments in the source code are inaccurate. They are left from +the SLIP/KISS driver, from which the 6pack driver has been derived. +I haven't modified or removed them yet -- sorry! The code itself needs +some cleaning and optimizing. This will be done in a later release. + +If you encounter a bug or if you have a question or suggestion concerning the +driver, feel free to mail me, using the addresses given at the beginning of +this file. + +Have fun! + +Andreas diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt deleted file mode 100644 index 8f339428fdf4..000000000000 --- a/Documentation/networking/6pack.txt +++ /dev/null @@ -1,175 +0,0 @@ -This is the 6pack-mini-HOWTO, written by - -Andreas Könsgen DG3KQ -Internet: ajk@comnets.uni-bremen.de -AMPR-net: dg3kq@db0pra.ampr.org -AX.25: dg3kq@db0ach.#nrw.deu.eu - -Last update: April 7, 1998 - -1. What is 6pack, and what are the advantages to KISS? - -6pack is a transmission protocol for data exchange between the PC and -the TNC over a serial line. It can be used as an alternative to KISS. - -6pack has two major advantages: -- The PC is given full control over the radio - channel. Special control data is exchanged between the PC and the TNC so - that the PC knows at any time if the TNC is receiving data, if a TNC - buffer underrun or overrun has occurred, if the PTT is - set and so on. This control data is processed at a higher priority than - normal data, so a data stream can be interrupted at any time to issue an - important event. This helps to improve the channel access and timing - algorithms as everything is computed in the PC. It would even be possible - to experiment with something completely different from the known CSMA and - DAMA channel access methods. - This kind of real-time control is especially important to supply several - TNCs that are connected between each other and the PC by a daisy chain - (however, this feature is not supported yet by the Linux 6pack driver). - -- Each packet transferred over the serial line is supplied with a checksum, - so it is easy to detect errors due to problems on the serial line. - Received packets that are corrupt are not passed on to the AX.25 layer. - Damaged packets that the TNC has received from the PC are not transmitted. - -More details about 6pack are described in the file 6pack.ps that is located -in the doc directory of the AX.25 utilities package. - -2. Who has developed the 6pack protocol? - -The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech -DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and -Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet. -They have also written a firmware for TNCs to perform the 6pack -protocol (see section 4 below). - -3. Where can I get the latest version of 6pack for LinuX? - -At the moment, the 6pack stuff can obtained via anonymous ftp from -db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq, -there is a file named 6pack.tgz. - -4. Preparing the TNC for 6pack operation - -To be able to use 6pack, a special firmware for the TNC is needed. The EPROM -of a newly bought TNC does not contain 6pack, so you will have to -program an EPROM yourself. The image file for 6pack EPROMs should be -available on any packet radio box where PC/FlexNet can be found. The name of -the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet -team. It can be used under the terms of the license that comes along -with PC/FlexNet. Please do not ask me about the internals of this file as I -don't know anything about it. I used a textual description of the 6pack -protocol to program the Linux driver. - -TNCs contain a 64kByte EPROM, the lower half of which is used for -the firmware/KISS. The upper half is either empty or is sometimes -programmed with software called TAPR. In the latter case, the TNC -is supplied with a DIP switch so you can easily change between the -two systems. When programming a new EPROM, one of the systems is replaced -by 6pack. It is useful to replace TAPR, as this software is rarely used -nowadays. If your TNC is not equipped with the switch mentioned above, you -can build in one yourself that switches over the highest address pin -of the EPROM between HIGH and LOW level. After having inserted the new EPROM -and switched to 6pack, apply power to the TNC for a first test. The connect -and the status LED are lit for about a second if the firmware initialises -the TNC correctly. - -5. Building and installing the 6pack driver - -The driver has been tested with kernel version 2.1.90. Use with older -kernels may lead to a compilation error because the interface to a kernel -function has been changed in the 2.1.8x kernels. - -How to turn on 6pack support: - -- In the linux kernel configuration program, select the code maturity level - options menu and turn on the prompting for development drivers. - -- Select the amateur radio support menu and turn on the serial port 6pack - driver. - -- Compile and install the kernel and the modules. - -To use the driver, the kissattach program delivered with the AX.25 utilities -has to be modified. - -- Do a cd to the directory that holds the kissattach sources. Edit the - kissattach.c file. At the top, insert the following lines: - - #ifndef N_6PACK - #define N_6PACK (N_AX25+1) - #endif - - Then find the line - - int disc = N_AX25; - - and replace N_AX25 by N_6PACK. - -- Recompile kissattach. Rename it to spattach to avoid confusions. - -Installing the driver: - -- Do an insmod 6pack. Look at your /var/log/messages file to check if the - module has printed its initialization message. - -- Do a spattach as you would launch kissattach when starting a KISS port. - Check if the kernel prints the message '6pack: TNC found'. - -- From here, everything should work as if you were setting up a KISS port. - The only difference is that the network device that represents - the 6pack port is called sp instead of sl or ax. So, sp0 would be the - first 6pack port. - -Although the driver has been tested on various platforms, I still declare it -ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module -and spattaching. Watch out if your computer behaves strangely. Read section -6 of this file about known problems. - -Note that the connect and status LEDs of the TNC are controlled in a -different way than they are when the TNC is used with PC/FlexNet. When using -FlexNet, the connect LED is on if there is a connection; the status LED is -on if there is data in the buffer of the PC's AX.25 engine that has to be -transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer, -so the 6pack driver doesn't know anything about connects or data that -has not yet been transmitted. Therefore the LEDs are controlled -as they are in KISS mode: The connect LED is turned on if data is transferred -from the PC to the TNC over the serial line, the status LED if data is -sent to the PC. - -6. Known problems - -When testing the driver with 2.0.3x kernels and -operating with data rates on the radio channel of 9600 Baud or higher, -the driver may, on certain systems, sometimes print the message '6pack: -bad checksum', which is due to data loss if the other station sends two -or more subsequent packets. I have been told that this is due to a problem -with the serial driver of 2.0.3x kernels. I don't know yet if the problem -still exists with 2.1.x kernels, as I have heard that the serial driver -code has been changed with 2.1.x. - -When shutting down the sp interface with ifconfig, the kernel crashes if -there is still an AX.25 connection left over which an IP connection was -running, even if that IP connection is already closed. The problem does not -occur when there is a bare AX.25 connection still running. I don't know if -this is a problem of the 6pack driver or something else in the kernel. - -The driver has been tested as a module, not yet as a kernel-builtin driver. - -The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is -connected to one serial port of the PC. This feature is not implemented -and at least at the moment I won't be able to do it because I do not have -the opportunity to build a TNC daisy-chain and test it. - -Some of the comments in the source code are inaccurate. They are left from -the SLIP/KISS driver, from which the 6pack driver has been derived. -I haven't modified or removed them yet -- sorry! The code itself needs -some cleaning and optimizing. This will be done in a later release. - -If you encounter a bug or if you have a question or suggestion concerning the -driver, feel free to mail me, using the addresses given at the beginning of -this file. - -Have fun! - -Andreas diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5b3421ec25ec..dc37fc8d5bee 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -37,6 +37,7 @@ Contents: tls-offload nfc 6lowpan + 6pack .. only:: subproject and html diff --git a/drivers/net/hamradio/Kconfig b/drivers/net/hamradio/Kconfig index 8e05b5c31a77..bf306fed04cc 100644 --- a/drivers/net/hamradio/Kconfig +++ b/drivers/net/hamradio/Kconfig @@ -30,7 +30,7 @@ config 6PACK Note that this driver is still experimental and might cause problems. For details about the features and the usage of the - driver, read . + driver, read . To compile this driver as a module, choose M here: the module will be called 6pack. -- cgit From 5a7f3132121bbcafd61f616170a08e511d675347 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:18 +0200 Subject: docs: networking: convert altera_tse.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/altera_tse.rst | 286 ++++++++++++++++++++++++++++++++ Documentation/networking/altera_tse.txt | 263 ----------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 287 insertions(+), 263 deletions(-) create mode 100644 Documentation/networking/altera_tse.rst delete mode 100644 Documentation/networking/altera_tse.txt diff --git a/Documentation/networking/altera_tse.rst b/Documentation/networking/altera_tse.rst new file mode 100644 index 000000000000..7a7040072e58 --- /dev/null +++ b/Documentation/networking/altera_tse.rst @@ -0,0 +1,286 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: + +======================================= +Altera Triple-Speed Ethernet MAC driver +======================================= + +Copyright |copy| 2008-2014 Altera Corporation + +This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers +using the SGDMA and MSGDMA soft DMA IP components. The driver uses the +platform bus to obtain component resources. The designs used to test this +driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, +and tested with ARM and NIOS processor hosts separately. The anticipated use +cases are simple communications between an embedded system and an external peer +for status and simple configuration of the embedded system. + +For more information visit www.altera.com and www.rocketboards.org. Support +forums for the driver may be found on www.rocketboards.org, and a design used +to test this driver may be found there as well. Support is also available from +the maintainer of this driver, found in MAINTAINERS. + +The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP +components that can be assembled and built into an FPGA using the Altera +Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that +this driver was tested against. The sopc2dts tool is used to create the +device tree for the driver, and may be found at rocketboards.org. + +The driver probe function examines the device tree and determines if the +Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The +probe function then installs the appropriate set of DMA routines to +initialize, setup transmits, receives, and interrupt handling primitives for +the respective configurations. + +The SGDMA component is to be deprecated in the near future (over the next 1-2 +years as of this writing in early 2014) in favor of the MSGDMA component. +SGDMA support is included for existing designs and reference in case a +developer wishes to support their own soft DMA logic and driver support. Any +new designs should not use the SGDMA. + +The SGDMA supports only a single transmit or receive operation at a time, and +therefore will not perform as well compared to the MSGDMA soft IP. Please +visit www.altera.com for known, documented SGDMA errata. + +Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. +Scatter-gather DMA will be added to a future maintenance update to this +driver. + +Jumbo frames are not supported at this time. + +The driver limits PHY operations to 10/100Mbps, and has not yet been fully +tested for 1Gbps. This support will be added in a future maintenance update. + +1. Kernel Configuration +======================= + +The kernel configuration option is ALTERA_TSE: + + Device Drivers ---> Network device support ---> Ethernet driver support ---> + Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) + +2. Driver parameters list +========================= + + - debug: message level (0: no output, 16: all); + - dma_rx_num: Number of descriptors in the RX list (default is 64); + - dma_tx_num: Number of descriptors in the TX list (default is 64). + +3. Command line options +======================= + +Driver parameters can be also passed in command line by using:: + + altera_tse=dma_rx_num:128,dma_tx_num:512 + +4. Driver information and notes +=============================== + +4.1. Transmit process +--------------------- +When the driver's transmit routine is called by the kernel, it sets up a +transmit descriptor by calling the underlying DMA transmit routine (SGDMA or +MSGDMA), and initiates a transmit operation. Once the transmit is complete, an +interrupt is driven by the transmit DMA logic. The driver handles the transmit +completion in the context of the interrupt handling chain by recycling +resource required to send and track the requested transmit operation. + +4.2. Receive process +-------------------- +The driver will post receive buffers to the receive DMA logic during driver +initialization. Receive buffers may or may not be queued depending upon the +underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able +to queue receive buffers to the SGDMA receive logic). When a packet is +received, the DMA logic generates an interrupt. The driver handles a receive +interrupt by obtaining the DMA receive logic status, reaping receive +completions until no more receive completions are available. + +4.3. Interrupt Mitigation +------------------------- +The driver is able to mitigate the number of its DMA interrupts +using NAPI for receive operations. Interrupt mitigation is not yet supported +for transmit operations, but will be added in a future maintenance release. + +4.4) Ethtool support +-------------------- +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.5) PHY Support +---------------- +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.7) List of source files: +-------------------------- + - Kconfig + - Makefile + - altera_tse_main.c: main network device driver + - altera_tse_ethtool.c: ethtool support + - altera_tse.h: private driver structure and common definitions + - altera_msgdma.h: MSGDMA implementation function definitions + - altera_sgdma.h: SGDMA implementation function definitions + - altera_msgdma.c: MSGDMA implementation + - altera_sgdma.c: SGDMA implementation + - altera_sgdmahw.h: SGDMA register and descriptor definitions + - altera_msgdmahw.h: MSGDMA register and descriptor definitions + - altera_utils.c: Driver utility functions + - altera_utils.h: Driver utility function definitions + +5. Debug Information +==================== + +The driver exports debug information such as internal statistics, +debug information, MAC and DMA registers etc. + +A user may use the ethtool support to get statistics: +e.g. using: ethtool -S ethX (that shows the statistics counters) +or sees the MAC registers: e.g. using: ethtool -d ethX + +The developer can also use the "debug" module parameter to get +further debug information. + +6. Statistics Support +===================== + +The controller and driver support a mix of IEEE standard defined statistics, +RFC defined statistics, and driver or Altera defined statistics. The four +specifications containing the standard definitions for these statistics are +as follows: + + - IEEE 802.3-2012 - IEEE Standard for Ethernet. + - RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. + - RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. + - Altera Triple Speed Ethernet User Guide, found at http://www.altera.com + +The statistics supported by the TSE and the device driver are as follows: + +"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.2. This statistics is the count of frames that are successfully +transmitted. + +"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.5. This statistic is the count of frames that are successfully +received. This count does not include any error packets such as CRC errors, +length errors, or alignment errors. + +"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE +802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are +an integral number of bytes in length and do not pass the CRC test as the frame +is received. + +"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, +Section 5.2.2.1.7. This statistic is the count of frames that are not an +integral number of bytes in length and do not pass the CRC test as the frame is +received. + +"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.8. This statistic is the count of data and pad bytes +successfully transmitted from the interface. + +"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.14. This statistic is the count of data and pad bytes +successfully received by the controller. + +"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE +802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames +transmitted from the network controller. + +"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE +802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames +received by the network controller. + +"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is +a count of the number of packets received containing errors that prevented the +packet from being delivered to a higher level protocol. + +"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic +is a count of the number of packets that could not be transmitted due to errors. + +"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were not addressed +to the broadcast address or a multicast group. + +"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +a multicast address group. + +"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +the broadcast address. + +"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This +statistic is the number of outbound packets not transmitted even though an +error was not detected. An example of a reason this might occur is to free up +internal buffer space. + +"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were not addressed to +a multicast group or broadcast address. + +"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +multicast group. + +"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +broadcast address. + +"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. +This statistic counts the number of packets dropped due to lack of internal +controller resources. + +"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. +This statistic counts the total number of bytes received by the controller, +including error and discarded packets. + +"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. +This statistic counts the total number of packets received by the controller, +including error, discarded, unicast, multicast, and broadcast packets. + +"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets received less +than 64 bytes long. + +"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets greater than 1518 +bytes long. + +"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. +This statistic counts the total number of packets received that were 64 octets +in length. + +"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC +2819. This statistic counts the total number of packets received that were +between 65 and 127 octets in length inclusive. + +"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 128 and 255 octets in length inclusive. + +"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 256 and 511 octets in length inclusive. + +"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 512 and 1023 octets in length inclusive. + +"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define +in RFC 2819. This statistic is the total number of packets received that were +between 1024 and 1518 octets in length inclusive. + +"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the +Altera TSE. This statistics counts the number of received good and errored +frames between the length of 1519 and the maximum frame length configured +in the frm_length register. See the Altera TSE User Guide for More details. + +"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This +statistic is the total number of packets received that were longer than 1518 +octets, and had either a bad CRC with an integral number of octets (CRC Error) +or a bad CRC with a non-integral number of octets (Alignment Error). + +"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This +statistic is the total number of packets received that were less than 64 octets +in length and had either a bad CRC with an integral number of octets (CRC +error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt deleted file mode 100644 index 50b8589d12fd..000000000000 --- a/Documentation/networking/altera_tse.txt +++ /dev/null @@ -1,263 +0,0 @@ - Altera Triple-Speed Ethernet MAC driver - -Copyright (C) 2008-2014 Altera Corporation - -This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers -using the SGDMA and MSGDMA soft DMA IP components. The driver uses the -platform bus to obtain component resources. The designs used to test this -driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, -and tested with ARM and NIOS processor hosts separately. The anticipated use -cases are simple communications between an embedded system and an external peer -for status and simple configuration of the embedded system. - -For more information visit www.altera.com and www.rocketboards.org. Support -forums for the driver may be found on www.rocketboards.org, and a design used -to test this driver may be found there as well. Support is also available from -the maintainer of this driver, found in MAINTAINERS. - -The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP -components that can be assembled and built into an FPGA using the Altera -Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that -this driver was tested against. The sopc2dts tool is used to create the -device tree for the driver, and may be found at rocketboards.org. - -The driver probe function examines the device tree and determines if the -Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The -probe function then installs the appropriate set of DMA routines to -initialize, setup transmits, receives, and interrupt handling primitives for -the respective configurations. - -The SGDMA component is to be deprecated in the near future (over the next 1-2 -years as of this writing in early 2014) in favor of the MSGDMA component. -SGDMA support is included for existing designs and reference in case a -developer wishes to support their own soft DMA logic and driver support. Any -new designs should not use the SGDMA. - -The SGDMA supports only a single transmit or receive operation at a time, and -therefore will not perform as well compared to the MSGDMA soft IP. Please -visit www.altera.com for known, documented SGDMA errata. - -Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. -Scatter-gather DMA will be added to a future maintenance update to this -driver. - -Jumbo frames are not supported at this time. - -The driver limits PHY operations to 10/100Mbps, and has not yet been fully -tested for 1Gbps. This support will be added in a future maintenance update. - -1) Kernel Configuration -The kernel configuration option is ALTERA_TSE: - Device Drivers ---> Network device support ---> Ethernet driver support ---> - Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) - -2) Driver parameters list: - debug: message level (0: no output, 16: all); - dma_rx_num: Number of descriptors in the RX list (default is 64); - dma_tx_num: Number of descriptors in the TX list (default is 64). - -3) Command line options -Driver parameters can be also passed in command line by using: - altera_tse=dma_rx_num:128,dma_tx_num:512 - -4) Driver information and notes - -4.1) Transmit process -When the driver's transmit routine is called by the kernel, it sets up a -transmit descriptor by calling the underlying DMA transmit routine (SGDMA or -MSGDMA), and initiates a transmit operation. Once the transmit is complete, an -interrupt is driven by the transmit DMA logic. The driver handles the transmit -completion in the context of the interrupt handling chain by recycling -resource required to send and track the requested transmit operation. - -4.2) Receive process -The driver will post receive buffers to the receive DMA logic during driver -initialization. Receive buffers may or may not be queued depending upon the -underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able -to queue receive buffers to the SGDMA receive logic). When a packet is -received, the DMA logic generates an interrupt. The driver handles a receive -interrupt by obtaining the DMA receive logic status, reaping receive -completions until no more receive completions are available. - -4.3) Interrupt Mitigation -The driver is able to mitigate the number of its DMA interrupts -using NAPI for receive operations. Interrupt mitigation is not yet supported -for transmit operations, but will be added in a future maintenance release. - -4.4) Ethtool support -Ethtool is supported. Driver statistics and internal errors can be taken using: -ethtool -S ethX command. It is possible to dump registers etc. - -4.5) PHY Support -The driver is compatible with PAL to work with PHY and GPHY devices. - -4.7) List of source files: - o Kconfig - o Makefile - o altera_tse_main.c: main network device driver - o altera_tse_ethtool.c: ethtool support - o altera_tse.h: private driver structure and common definitions - o altera_msgdma.h: MSGDMA implementation function definitions - o altera_sgdma.h: SGDMA implementation function definitions - o altera_msgdma.c: MSGDMA implementation - o altera_sgdma.c: SGDMA implementation - o altera_sgdmahw.h: SGDMA register and descriptor definitions - o altera_msgdmahw.h: MSGDMA register and descriptor definitions - o altera_utils.c: Driver utility functions - o altera_utils.h: Driver utility function definitions - -5) Debug Information - -The driver exports debug information such as internal statistics, -debug information, MAC and DMA registers etc. - -A user may use the ethtool support to get statistics: -e.g. using: ethtool -S ethX (that shows the statistics counters) -or sees the MAC registers: e.g. using: ethtool -d ethX - -The developer can also use the "debug" module parameter to get -further debug information. - -6) Statistics Support - -The controller and driver support a mix of IEEE standard defined statistics, -RFC defined statistics, and driver or Altera defined statistics. The four -specifications containing the standard definitions for these statistics are -as follows: - - o IEEE 802.3-2012 - IEEE Standard for Ethernet. - o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. - o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. - o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com - -The statistics supported by the TSE and the device driver are as follows: - -"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.2. This statistics is the count of frames that are successfully -transmitted. - -"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.5. This statistic is the count of frames that are successfully -received. This count does not include any error packets such as CRC errors, -length errors, or alignment errors. - -"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE -802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are -an integral number of bytes in length and do not pass the CRC test as the frame -is received. - -"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, -Section 5.2.2.1.7. This statistic is the count of frames that are not an -integral number of bytes in length and do not pass the CRC test as the frame is -received. - -"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.8. This statistic is the count of data and pad bytes -successfully transmitted from the interface. - -"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.14. This statistic is the count of data and pad bytes -successfully received by the controller. - -"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE -802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames -transmitted from the network controller. - -"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE -802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames -received by the network controller. - -"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is -a count of the number of packets received containing errors that prevented the -packet from being delivered to a higher level protocol. - -"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic -is a count of the number of packets that could not be transmitted due to errors. - -"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This -statistic is a count of the number of packets received that were not addressed -to the broadcast address or a multicast group. - -"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This -statistic is a count of the number of packets received that were addressed to -a multicast address group. - -"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This -statistic is a count of the number of packets received that were addressed to -the broadcast address. - -"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This -statistic is the number of outbound packets not transmitted even though an -error was not detected. An example of a reason this might occur is to free up -internal buffer space. - -"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This -statistic counts the number of packets transmitted that were not addressed to -a multicast group or broadcast address. - -"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This -statistic counts the number of packets transmitted that were addressed to a -multicast group. - -"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This -statistic counts the number of packets transmitted that were addressed to a -broadcast address. - -"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. -This statistic counts the number of packets dropped due to lack of internal -controller resources. - -"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. -This statistic counts the total number of bytes received by the controller, -including error and discarded packets. - -"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. -This statistic counts the total number of packets received by the controller, -including error, discarded, unicast, multicast, and broadcast packets. - -"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. -This statistic counts the number of correctly formed packets received less -than 64 bytes long. - -"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. -This statistic counts the number of correctly formed packets greater than 1518 -bytes long. - -"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. -This statistic counts the total number of packets received that were 64 octets -in length. - -"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC -2819. This statistic counts the total number of packets received that were -between 65 and 127 octets in length inclusive. - -"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in -RFC 2819. This statistic is the total number of packets received that were -between 128 and 255 octets in length inclusive. - -"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in -RFC 2819. This statistic is the total number of packets received that were -between 256 and 511 octets in length inclusive. - -"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in -RFC 2819. This statistic is the total number of packets received that were -between 512 and 1023 octets in length inclusive. - -"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define -in RFC 2819. This statistic is the total number of packets received that were -between 1024 and 1518 octets in length inclusive. - -"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the -Altera TSE. This statistics counts the number of received good and errored -frames between the length of 1519 and the maximum frame length configured -in the frm_length register. See the Altera TSE User Guide for More details. - -"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This -statistic is the total number of packets received that were longer than 1518 -octets, and had either a bad CRC with an integral number of octets (CRC Error) -or a bad CRC with a non-integral number of octets (Alignment Error). - -"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This -statistic is the total number of packets received that were less than 64 octets -in length and had either a bad CRC with an integral number of octets (CRC -error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index dc37fc8d5bee..96ffad845fd9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -38,6 +38,7 @@ Contents: nfc 6lowpan 6pack + altera_tse .. only:: subproject and html -- cgit From aa92320b3e38f2b64b2d91a20761db1683e6c531 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:19 +0200 Subject: docs: networking: convert arcnet-hardware.txt to ReST - add SPDX header; - add document title markup; - add notes markups; - mark tables as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/arcnet-hardware.rst | 3234 ++++++++++++++++++++++++++ Documentation/networking/arcnet-hardware.txt | 3133 ------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 3235 insertions(+), 3133 deletions(-) create mode 100644 Documentation/networking/arcnet-hardware.rst delete mode 100644 Documentation/networking/arcnet-hardware.txt diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst new file mode 100644 index 000000000000..b5a1a020c824 --- /dev/null +++ b/Documentation/networking/arcnet-hardware.rst @@ -0,0 +1,3234 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +ARCnet Hardware +=============== + +.. note:: + + 1) This file is a supplement to arcnet.txt. Please read that for general + driver configuration help. + 2) This file is no longer Linux-specific. It should probably be moved out + of the kernel sources. Ideas? + +Because so many people (myself included) seem to have obtained ARCnet cards +without manuals, this file contains a quick introduction to ARCnet hardware, +some cabling tips, and a listing of all jumper settings I can find. Please +e-mail apenwarr@worldvisions.ca with any settings for your particular card, +or any other information you have! + + +Introduction to ARCnet +====================== + +ARCnet is a network type which works in a way similar to popular Ethernet +networks but which is also different in some very important ways. + +First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps +(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact, +there are others as well, but these are less common. The different hardware +types, as far as I'm aware, are not compatible and so you cannot wire a +100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does +work with 100 Mbps cards, but I haven't been able to verify this myself, +since I only have the 2.5 Mbps variety. It is probably not going to saturate +your 100 Mbps card. Stop complaining. :) + +You also cannot connect an ARCnet card to any kind of Ethernet card and +expect it to work. + +There are two "types" of ARCnet - STAR topology and BUS topology. This +refers to how the cards are meant to be wired together. According to most +available documentation, you can only connect STAR cards to STAR cards and +BUS cards to BUS cards. That makes sense, right? Well, it's not quite +true; see below under "Cabling." + +Once you get past these little stumbling blocks, ARCnet is actually quite a +well-designed standard. It uses something called "modified token passing" +which makes it completely incompatible with so-called "Token Ring" cards, +but which makes transfers much more reliable than Ethernet does. In fact, +ARCnet will guarantee that a packet arrives safely at the destination, and +even if it can't possibly be delivered properly (ie. because of a cable +break, or because the destination computer does not exist) it will at least +tell the sender about it. + +Because of the carefully defined action of the "token", it will always make +a pass around the "ring" within a maximum length of time. This makes it +useful for realtime networks. + +In addition, all known ARCnet cards have an (almost) identical programming +interface. This means that with one ARCnet driver you can support any +card, whereas with Ethernet each manufacturer uses what is sometimes a +completely different programming interface, leading to a lot of different, +sometimes very similar, Ethernet drivers. Of course, always using the same +programming interface also means that when high-performance hardware +facilities like PCI bus mastering DMA appear, it's hard to take advantage of +them. Let's not go into that. + +One thing that makes ARCnet cards difficult to program for, however, is the +limit on their packet sizes; standard ARCnet can only send packets that are +up to 508 bytes in length. This is smaller than the Internet "bare minimum" +of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra +level of encapsulation is defined by RFC1201, which I call "packet +splitting," that allows "virtual packets" to grow as large as 64K each, +although they are generally kept down to the Ethernet-style 1500 bytes. + +For more information on the advantages and disadvantages (mostly the +advantages) of ARCnet networks, you might try the "ARCnet Trade Association" +WWW page: + + http://www.arcnet.com + + +Cabling ARCnet Networks +======================= + +This section was rewritten by + + Vojtech Pavlik + +using information from several people, including: + + - Avery Pennraun + - Stephen A. Wood + - John Paul Morrison + - Joachim Koenig + +and Avery touched it up a bit, at Vojtech's request. + +ARCnet (the classic 2.5 Mbps version) can be connected by two different +types of cabling: coax and twisted pair. The other ARCnet-type networks +(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of +cabling (Type1, Fiber, C1, C4, C5). + +For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables +also work fine, because ARCnet is a very stable network. I personally use 75 +Ohm TV antenna cable. + +Cards for coax cabling are shipped in two different variants: for BUS and +STAR network topologies. They are mostly the same. The only difference +lies in the hybrid chip installed. BUS cards use high impedance output, +while STAR use low impedance. Low impedance card (STAR) is electrically +equal to a high impedance one with a terminator installed. + +Usually, the ARCnet networks are built up from STAR cards and hubs. There +are two types of hubs - active and passive. Passive hubs are small boxes +with four BNC connectors containing four 47 Ohm resistors:: + + | | wires + R + junction + -R-+-R- R 47 Ohm resistors + R + | + +The shielding is connected together. Active hubs are much more complicated; +they are powered and contain electronics to amplify the signal and send it +to other segments of the net. They usually have eight connectors. Active +hubs come in two variants - dumb and smart. The dumb variant just +amplifies, but the smart one decodes to digital and encodes back all packets +coming through. This is much better if you have several hubs in the net, +since many dumb active hubs may worsen the signal quality. + +And now to the cabling. What you can connect together: + +1. A card to a card. This is the simplest way of creating a 2-computer + network. + +2. A card to a passive hub. Remember that all unused connectors on the hub + must be properly terminated with 93 Ohm (or something else if you don't + have the right ones) terminators. + + (Avery's note: oops, I didn't know that. Mine (TV cable) works + anyway, though.) + +3. A card to an active hub. Here is no need to terminate the unused + connectors except some kind of aesthetic feeling. But, there may not be + more than eleven active hubs between any two computers. That of course + doesn't limit the number of active hubs on the network. + +4. An active hub to another. + +5. An active hub to passive hub. + +Remember that you cannot connect two passive hubs together. The power loss +implied by such a connection is too high for the net to operate reliably. + +An example of a typical ARCnet network:: + + R S - STAR type card + S------H--------A-------S R - Terminator + | | H - Hub + | | A - Active hub + | S----H----S + S | + | + S + +The BUS topology is very similar to the one used by Ethernet. The only +difference is in cable and terminators: they should be 93 Ohm. Ethernet +uses 50 Ohm impedance. You use T connectors to put the computers on a single +line of cable, the bus. You have to put terminators at both ends of the +cable. A typical BUS ARCnet network looks like:: + + RT----T------T------T------T------TR + B B B B B B + + B - BUS type card + R - Terminator + T - T connector + +But that is not all! The two types can be connected together. According to +the official documentation the only way of connecting them is using an active +hub:: + + A------T------T------TR + | B B B + S---H---S + | + S + +The official docs also state that you can use STAR cards at the ends of +BUS network in place of a BUS card and a terminator:: + + S------T------T------S + B B + +But, according to my own experiments, you can simply hang a BUS type card +anywhere in middle of a cable in a STAR topology network. And more - you +can use the bus card in place of any star card if you use a terminator. Then +you can build very complicated networks fulfilling all your needs! An +example:: + + S + | + RT------T-------T------H------S + B B B | + | R + S------A------T-------T-------A-------H------TR + | B B | | B + | S BT | + | | | S----A-----S + S------H---A----S | | + | | S------T----H---S | + S S B R S + +A basically different cabling scheme is used with Twisted Pair cabling. Each +of the TP cards has two RJ (phone-cord style) connectors. The cards are +then daisy-chained together using a cable connecting every two neighboring +cards. The ends are terminated with RJ 93 Ohm terminators which plug into +the empty connectors of cards on the ends of the chain. An example:: + + ___________ ___________ + _R_|_ _|_|_ _|_R_ + | | | | | | + |Card | |Card | |Card | + |_____| |_____| |_____| + + +There are also hubs for the TP topology. There is nothing difficult +involved in using them; you just connect a TP chain to a hub on any end or +even at both. This way you can create almost any network configuration. +The maximum of 11 hubs between any two computers on the net applies here as +well. An example:: + + RP-------P--------P--------H-----P------P-----PR + | + RP-----H--------P--------H-----P------PR + | | + PR PR + + R - RJ Terminator + P - TP Card + H - TP Hub + +Like any network, ARCnet has a limited cable length. These are the maximum +cable lengths between two active ends (an active end being an active hub or +a STAR card). + + ========== ======= =========== + RG-62 93 Ohm up to 650 m + RG-59/U 75 Ohm up to 457 m + RG-11/U 75 Ohm up to 533 m + IBM Type 1 150 Ohm up to 200 m + IBM Type 3 100 Ohm up to 100 m + ========== ======= =========== + +The maximum length of all cables connected to a passive hub is limited to 65 +meters for RG-62 cabling; less for others. You can see that using passive +hubs in a large network is a bad idea. The maximum length of a single "BUS +Trunk" is about 300 meters for RG-62. The maximum distance between the two +most distant points of the net is limited to 3000 meters. The maximum length +of a TP cable between two cards/hubs is 650 meters. + + +Setting the Jumpers +=================== + +All ARCnet cards should have a total of four or five different settings: + + - the I/O address: this is the "port" your ARCnet card is on. Probed + values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If + your card has additional ones, which is possible, please tell me.) This + should not be the same as any other device on your system. According to + a doc I got from Novell, MS Windows prefers values of 0x300 or more, + eating net connections on my system (at least) otherwise. My guess is + this may be because, if your card is at 0x2E0, probing for a serial port + at 0x2E8 will reset the card and probably mess things up royally. + + - Avery's favourite: 0x300. + + - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7. + on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15. + + Make sure this is different from any other card on your system. Note + that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can + "cat /proc/interrupts" for a somewhat complete list of which ones are in + use at any given time. Here is a list of common usages from Vojtech + Pavlik : + + ("Not on bus" means there is no way for a card to generate this + interrupt) + + ====== ========================================================= + IRQ 0 Timer 0 (Not on bus) + IRQ 1 Keyboard (Not on bus) + IRQ 2 IRQ Controller 2 (Not on bus, nor does interrupt the CPU) + IRQ 3 COM2 + IRQ 4 COM1 + IRQ 5 FREE (LPT2 if you have it; sometimes COM3; maybe PLIP) + IRQ 6 Floppy disk controller + IRQ 7 FREE (LPT1 if you don't use the polling driver; PLIP) + IRQ 8 Realtime Clock Interrupt (Not on bus) + IRQ 9 FREE (VGA vertical sync interrupt if enabled) + IRQ 10 FREE + IRQ 11 FREE + IRQ 12 FREE + IRQ 13 Numeric Coprocessor (Not on bus) + IRQ 14 Fixed Disk Controller + IRQ 15 FREE (Fixed Disk Controller 2 if you have it) + ====== ========================================================= + + + .. note:: + + IRQ 9 is used on some video cards for the "vertical retrace" + interrupt. This interrupt would have been handy for things like + video games, as it occurs exactly once per screen refresh, but + unfortunately IBM cancelled this feature starting with the original + VGA and thus many VGA/SVGA cards do not support it. For this + reason, no modern software uses this interrupt and it can almost + always be safely disabled, if your video card supports it at all. + + If your card for some reason CANNOT disable this IRQ (usually there + is a jumper), one solution would be to clip the printed circuit + contact on the board: it's the fourth contact from the left on the + back side. I take no responsibility if you try this. + + - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though. + + - the memory address: Unlike most cards, ARCnets use "shared memory" for + copying buffers around. Make SURE it doesn't conflict with any other + used memory in your system! + + :: + + A0000 - VGA graphics memory (ok if you don't have VGA) + B0000 - Monochrome text mode + C0000 \ One of these is your VGA BIOS - usually C0000. + E0000 / + F0000 - System BIOS + + Anything less than 0xA0000 is, well, a BAD idea since it isn't above + 640k. + + - Avery's favourite: 0xD0000 + + - the station address: Every ARCnet card has its own "unique" network + address from 0 to 255. Unlike Ethernet, you can set this address + yourself with a jumper or switch (or on some cards, with special + software). Since it's only 8 bits, you can only have 254 ARCnet cards + on a network. DON'T use 0 or 255, since these are reserved (although + neat stuff will probably happen if you DO use them). By the way, if you + haven't already guessed, don't set this the same as any other ARCnet on + your network! + + - Avery's favourite: 3 and 4. Not that it matters. + + - There may be ETS1 and ETS2 settings. These may or may not make a + difference on your card (many manuals call them "reserved"), but are + used to change the delays used when powering up a computer on the + network. This is only necessary when wiring VERY long range ARCnet + networks, on the order of 4km or so; in any case, the only real + requirement here is that all cards on the network with ETS1 and ETS2 + jumpers have them in the same position. Chris Hindy + sent in a chart with actual values for this: + + ======= ======= =============== ==================== + ET1 ET2 Response Time Reconfiguration Time + ======= ======= =============== ==================== + open open 74.7us 840us + open closed 283.4us 1680us + closed open 561.8us 1680us + closed closed 1118.6us 1680us + ======= ======= =============== ==================== + + Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your + network. + +Also, on many cards (not mine, though) there are red and green LED's. +Vojtech Pavlik tells me this is what they mean: + + =============== =============== ===================================== + GREEN RED Status + =============== =============== ===================================== + OFF OFF Power off + OFF Short flashes Cabling problems (broken cable or not + terminated) + OFF (short) ON Card init + ON ON Normal state - everything OK, nothing + happens + ON Long flashes Data transfer + ON OFF Never happens (maybe when wrong ID) + =============== =============== ===================================== + + +The following is all the specific information people have sent me about +their own particular ARCnet cards. It is officially a mess, and contains +huge amounts of duplicated information. I have no time to fix it. If you +want to, PLEASE DO! Just send me a 'diff -u' of all your changes. + +The model # is listed right above specifics for that card, so you should be +able to use your text viewer's "search" function to find the entry you want. +If you don't KNOW what kind of card you have, try looking through the +various diagrams to see if you can tell. + +If your model isn't listed and/or has different settings, PLEASE PLEASE +tell me. I had to figure mine out without the manual, and it WASN'T FUN! + +Even if your ARCnet model isn't listed, but has the same jumpers as another +model that is, please e-mail me to say so. + +Cards Listed in this file (in this order, mostly): + + =============== ======================= ==== + Manufacturer Model # Bits + =============== ======================= ==== + SMC PC100 8 + SMC PC110 8 + SMC PC120 8 + SMC PC130 8 + SMC PC270E 8 + SMC PC500 16 + SMC PC500Longboard 16 + SMC PC550Longboard 16 + SMC PC600 16 + SMC PC710 8 + SMC? LCS-8830(-T) 8/16 + Puredata PDI507 8 + CNet Tech CN120-Series 8 + CNet Tech CN160-Series 16 + Lantech? UM9065L chipset 8 + Acer 5210-003 8 + Datapoint? LAN-ARC-8 8 + Topware TA-ARC/10 8 + Thomas-Conrad 500-6242-0097 REV A 8 + Waterloo? (C)1985 Waterloo Micro. 8 + No Name -- 8/16 + No Name Taiwan R.O.C? 8 + No Name Model 9058 8 + Tiara Tiara Lancard? 8 + =============== ======================= ==== + + +* SMC = Standard Microsystems Corp. +* CNet Tech = CNet Technology, Inc. + +Unclassified Stuff +================== + + - Please send any other information you can find. + + - And some other stuff (more info is welcome!):: + + From: root@ultraworld.xs4all.nl (Timo Hilbrink) + To: apenwarr@foxnet.net (Avery Pennarun) + Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT) + Reply-To: timoh@xs4all.nl + + [...parts deleted...] + + About the jumpers: On my PC130 there is one more jumper, located near the + cable-connector and it's for changing to star or bus topology; + closed: star - open: bus + On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI + and another with ALE,LA17,LA18,LA19 these are undocumented.. + + [...more parts deleted...] + + --- CUT --- + +Standard Microsystems Corp (SMC) +================================ + +PC100, PC110, PC120, PC130 (8-bit cards) and PC500, PC600 (16-bit cards) +------------------------------------------------------------------------ + + - mainly from Avery Pennarun . Values depicted + are from Avery's setup. + - special thanks to Timo Hilbrink for noting that PC120, + 130, 500, and 600 all have the same switches as Avery's PC100. + PC500/600 have several extra, undocumented pins though. (?) + - PC110 settings were verified by Stephen A. Wood + - Also, the JP- and S-numbers probably don't match your card exactly. Try + to find jumpers/switches with the same number of settings - it's + probably more reliable. + +:: + + JP5 [|] : : : : + (IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7 + Put exactly one jumper on exactly one set of pins. + + + 1 2 3 4 5 6 7 8 9 10 + S1 /----------------------------------\ + (I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 | + addresses) \----------------------------------/ + |--| |--------| |--------| + (a) (b) (m) + + WARNING. It's very important when setting these which way + you're holding the card, and which way you think is '1'! + + If you suspect that your settings are not being made + correctly, try reversing the direction or inverting the + switch positions. + + a: The first digit of the I/O address. + Setting Value + ------- ----- + 00 0 + 01 1 + 10 2 + 11 3 + + b: The second digit of the I/O address. + Setting Value + ------- ----- + 0000 0 + 0001 1 + 0010 2 + ... ... + 1110 E + 1111 F + + The I/O address is in the form ab0. For example, if + a is 0x2 and b is 0xE, the address will be 0x2E0. + + DO NOT SET THIS LESS THAN 0x200!!!!! + + + m: The first digit of the memory address. + Setting Value + ------- ----- + 0000 0 + 0001 1 + 0010 2 + ... ... + 1110 E + 1111 F + + The memory address is in the form m0000. For example, if + m is D, the address will be 0xD0000. + + DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000! + + 1 2 3 4 5 6 7 8 + S2 /--------------------------\ + (Station Address) | 1 1 0 0 0 0 0 0 | + \--------------------------/ + + Setting Value + ------- ----- + 00000000 00 + 10000000 01 + 01000000 02 + ... + 01111111 FE + 11111111 FF + + Note that this is binary with the digits reversed! + + DO NOT SET THIS TO 0 OR 255 (0xFF)! + + +PC130E/PC270E (8-bit cards) +--------------------------- + + - from Juergen Seifert + +This description has been written by Juergen Seifert +using information from the following Original SMC Manual + + "Configuration Guide for ARCNET(R)-PC130E/PC270 Network + Controller Boards Pub. # 900.044A June, 1989" + +ARCNET is a registered trademark of the Datapoint Corporation +SMC is a registered trademark of the Standard Microsystems Corporation + +The PC130E is an enhanced version of the PC130 board, is equipped with a +standard BNC female connector for connection to RG-62/U coax cable. +Since this board is designed both for point-to-point connection in star +networks and for connection to bus networks, it is downwardly compatible +with all the other standard boards designed for coax networks (that is, +the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and +PC200 bus topology boards). + +The PC270E is an enhanced version of the PC260 board, is equipped with two +modular RJ11-type jacks for connection to twisted pair wiring. +It can be used in a star or a daisy-chained network. + +:: + + 8 7 6 5 4 3 2 1 + ________________________________________________________________ + | | S1 | | + | |_________________| | + | Offs|Base |I/O Addr | + | RAM Addr | ___| + | ___ ___ CR3 |___| + | | \/ | CR4 |___| + | | PROM | ___| + | | | N | | 8 + | | SOCKET | o | | 7 + | |________| d | | 6 + | ___________________ e | | 5 + | | | A | S | 4 + | |oo| EXT2 | | d | 2 | 3 + | |oo| EXT1 | SMC | d | | 2 + | |oo| ROM | 90C63 | r |___| 1 + | |oo| IRQ7 | | |o| _____| + | |oo| IRQ5 | | |o| | J1 | + | |oo| IRQ4 | | STAR |_____| + | |oo| IRQ3 | | | J2 | + | |oo| IRQ2 |___________________| |_____| + |___ ______________| + | | + |_____________________________________________| + +Legend:: + + SMC 90C63 ARCNET Controller / Transceiver /Logic + S1 1-3: I/O Base Address Select + 4-6: Memory Base Address Select + 7-8: RAM Offset Select + S2 1-8: Node ID Select + EXT Extended Timeout Select + ROM ROM Enable Select + STAR Selected - Star Topology (PC130E only) + Deselected - Bus Topology (PC130E only) + CR3/CR4 Diagnostic LEDs + J1 BNC RG62/U Connector (PC130E only) + J1 6-position Telephone Jack (PC270E only) + J2 6-position Telephone Jack (PC270E only) + +Setting one of the switches to Off/Open means "1", On/Closed means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group S2 are used to set the node ID. +These switches work in a way similar to the PC100-series cards; see that +entry for more information. + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first three switches in switch group S1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 1 2 3 | Address + -------|-------- + 0 0 0 | 260 + 0 0 1 | 290 + 0 1 0 | 2E0 (Manufacturer's default) + 0 1 1 | 2F0 + 1 0 0 | 300 + 1 0 1 | 350 + 1 1 0 | 380 + 1 1 1 | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 4-6 of switch group S1 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 7 and 8 of group S1. + +:: + + Switch | Hex RAM | Hex ROM + 4 5 6 7 8 | Address | Address *) + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + *) To enable the 8K Boot PROM install the jumper ROM. + The default is jumper ROM not installed. + + +Setting the Timeouts and Interrupt +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers labeled EXT1 and EXT2 are used to determine the timeout +parameters. These two jumpers are normally left open. + +To select a hardware interrupt level set one (only one!) of the jumpers +IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2. + + +Configuring the PC130E for Star or Bus Topology +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The single jumper labeled STAR is used to configure the PC130E board for +star or bus topology. +When the jumper is installed, the board may be used in a star network, when +it is removed, the board can be used in a bus topology. + + +Diagnostic LEDs +^^^^^^^^^^^^^^^ + +Two diagnostic LEDs are visible on the rear bracket of the board. +The green LED monitors the network activity: the red one shows the +board activity:: + + Green | Status Red | Status + -------|------------------- ---------|------------------- + on | normal activity flash/on | data transfer + blink | reconfiguration off | no data transfer; + off | defective board or | incorrect memory or + | node ID is zero | I/O address + + +PC500/PC550 Longboard (16-bit cards) +------------------------------------ + + - from Juergen Seifert + + + .. note:: + + There is another Version of the PC500 called Short Version, which + is different in hard- and software! The most important differences + are: + + - The long board has no Shared memory. + - On the long board the selection of the interrupt is done by binary + coded switch, on the short board directly by jumper. + +[Avery's note: pay special attention to that: the long board HAS NO SHARED +MEMORY. This means the current Linux-ARCnet driver can't use these cards. +I have obtained a PC500Longboard and will be doing some experiments on it in +the future, but don't hold your breath. Thanks again to Juergen Seifert for +his advice about this!] + +This description has been written by Juergen Seifert +using information from the following Original SMC Manual + + "Configuration Guide for SMC ARCNET-PC500/PC550 + Series Network Controller Boards Pub. # 900.033 Rev. A + November, 1989" + +ARCNET is a registered trademark of the Datapoint Corporation +SMC is a registered trademark of the Standard Microsystems Corporation + +The PC500 is equipped with a standard BNC female connector for connection +to RG-62/U coax cable. +The board is designed both for point-to-point connection in star networks +and for connection to bus networks. + +The PC550 is equipped with two modular RJ11-type jacks for connection +to twisted pair wiring. +It can be used in a star or a daisy-chained (BUS) network. + +:: + + 1 + 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1 + ____________________________________________________________________ + < | SW1 | | SW2 | | + > |_____________________| |_____________| | + < IRQ |I/O Addr | + > ___| + < CR4 |___| + > CR3 |___| + < ___| + > N | | 8 + < o | | 7 + > d | S | 6 + < e | W | 5 + > A | 3 | 4 + < d | | 3 + > d | | 2 + < r |___| 1 + > |o| _____| + < |o| | J1 | + > 3 1 JP6 |_____| + < |o|o| JP2 | J2 | + > |o|o| |_____| + < 4 2__ ______________| + > | | | + <____| |_____________________________________________| + +Legend:: + + SW1 1-6: I/O Base Address Select + 7-10: Interrupt Select + SW2 1-6: Reserved for Future Use + SW3 1-8: Node ID Select + JP2 1-4: Extended Timeout Select + JP6 Selected - Star Topology (PC500 only) + Deselected - Bus Topology (PC500 only) + CR3 Green Monitors Network Activity + CR4 Red Monitors Board Activity + J1 BNC RG62/U Connector (PC500 only) + J1 6-position Telephone Jack (PC550 only) + J2 6-position Telephone Jack (PC550 only) + +Setting one of the switches to Off/Open means "1", On/Closed means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW3 are used to set the node ID. Each node +attached to the network must have an unique node ID which must be +different from 0. +Switch 1 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first six switches in switch group SW1 are used to select one +of 32 possible I/O Base addresses using the following table:: + + Switch | Hex I/O + 6 5 4 3 2 1 | Address + -------------|-------- + 0 1 0 0 0 0 | 200 + 0 1 0 0 0 1 | 210 + 0 1 0 0 1 0 | 220 + 0 1 0 0 1 1 | 230 + 0 1 0 1 0 0 | 240 + 0 1 0 1 0 1 | 250 + 0 1 0 1 1 0 | 260 + 0 1 0 1 1 1 | 270 + 0 1 1 0 0 0 | 280 + 0 1 1 0 0 1 | 290 + 0 1 1 0 1 0 | 2A0 + 0 1 1 0 1 1 | 2B0 + 0 1 1 1 0 0 | 2C0 + 0 1 1 1 0 1 | 2D0 + 0 1 1 1 1 0 | 2E0 (Manufacturer's default) + 0 1 1 1 1 1 | 2F0 + 1 1 0 0 0 0 | 300 + 1 1 0 0 0 1 | 310 + 1 1 0 0 1 0 | 320 + 1 1 0 0 1 1 | 330 + 1 1 0 1 0 0 | 340 + 1 1 0 1 0 1 | 350 + 1 1 0 1 1 0 | 360 + 1 1 0 1 1 1 | 370 + 1 1 1 0 0 0 | 380 + 1 1 1 0 0 1 | 390 + 1 1 1 0 1 0 | 3A0 + 1 1 1 0 1 1 | 3B0 + 1 1 1 1 0 0 | 3C0 + 1 1 1 1 0 1 | 3D0 + 1 1 1 1 1 0 | 3E0 + 1 1 1 1 1 1 | 3F0 + + +Setting the Interrupt +^^^^^^^^^^^^^^^^^^^^^ + +Switches seven through ten of switch group SW1 are used to select the +interrupt level. The interrupt level is binary coded, so selections +from 0 to 15 would be possible, but only the following eight values will +be supported: 3, 4, 5, 7, 9, 10, 11, 12. + +:: + + Switch | IRQ + 10 9 8 7 | + ---------|-------- + 0 0 1 1 | 3 + 0 1 0 0 | 4 + 0 1 0 1 | 5 + 0 1 1 1 | 7 + 1 0 0 1 | 9 (=2) (default) + 1 0 1 0 | 10 + 1 0 1 1 | 11 + 1 1 0 0 | 12 + + +Setting the Timeouts +^^^^^^^^^^^^^^^^^^^^ + +The two jumpers JP2 (1-4) are used to determine the timeout parameters. +These two jumpers are normally left open. +Refer to the COM9026 Data Sheet for alternate configurations. + + +Configuring the PC500 for Star or Bus Topology +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The single jumper labeled JP6 is used to configure the PC500 board for +star or bus topology. +When the jumper is installed, the board may be used in a star network, when +it is removed, the board can be used in a bus topology. + + +Diagnostic LEDs +^^^^^^^^^^^^^^^ + +Two diagnostic LEDs are visible on the rear bracket of the board. +The green LED monitors the network activity: the red one shows the +board activity:: + + Green | Status Red | Status + -------|------------------- ---------|------------------- + on | normal activity flash/on | data transfer + blink | reconfiguration off | no data transfer; + off | defective board or | incorrect memory or + | node ID is zero | I/O address + + +PC710 (8-bit card) +------------------ + + - from J.S. van Oosten + +Note: this data is gathered by experimenting and looking at info of other +cards. However, I'm sure I got 99% of the settings right. + +The SMC710 card resembles the PC270 card, but is much more basic (i.e. no +LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing:: + + _______________________________________ + | +---------+ +---------+ |____ + | | S2 | | S1 | | + | +---------+ +---------+ | + | | + | +===+ __ | + | | R | | | X-tal ###___ + | | O | |__| ####__'| + | | M | || ### + | +===+ | + | | + | .. JP1 +----------+ | + | .. | big chip | | + | .. | 90C63 | | + | .. | | | + | .. +----------+ | + ------- ----------- + ||||||||||||||||||||| + +The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes +labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM, +IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) ) + +S1 and S2 perform the same function as on the PC270, only their numbers +are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address). + +I know it works when connected to a PC110 type ARCnet board. + + +***************************************************************************** + +Possibly SMC +============ + +LCS-8830(-T) (8 and 16-bit cards) +--------------------------------- + + - from Mathias Katzer + - Marek Michalkiewicz says the + LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS + only (the JP0 jumper is hardwired), and BNC only. + +This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC, +nowhere else, not even on the few Xeroxed sheets from the manual). + +SMC ARCnet Board Type LCS-8830-T:: + + ------------------------------------ + | | + | JP3 88 8 JP2 | + | ##### | \ | + | ##### ET1 ET2 ###| + | 8 ###| + | U3 SW 1 JP0 ###| Phone Jacks + | -- ###| + | | | | + | | | SW2 | + | | | | + | | | ##### | + | -- ##### #### BNC Connector + | #### + | 888888 JP1 | + | 234567 | + -- ------- + ||||||||||||||||||||||||||| + -------------------------- + + + SW1: DIP-Switches for Station Address + SW2: DIP-Switches for Memory Base and I/O Base addresses + + JP0: If closed, internal termination on (default open) + JP1: IRQ Jumpers + JP2: Boot-ROM enabled if closed + JP3: Jumpers for response timeout + + U3: Boot-ROM Socket + + + ET1 ET2 Response Time Idle Time Reconfiguration Time + + 78 86 840 + X 285 316 1680 + X 563 624 1680 + X X 1130 1237 1680 + + (X means closed jumper) + + (DIP-Switch downwards means "0") + +The station address is binary-coded with SW1. + +The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2: + +======== ======== +Switches Base +678 Address +======== ======== +000 260-26f +100 290-29f +010 2e0-2ef +110 2f0-2ff +001 300-30f +101 350-35f +011 380-38f +111 3e0-3ef +======== ======== + + +DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range: + +======== ============= ================ +Switches RAM ROM +12345 Address Range Address Range +======== ============= ================ +00000 C:0000-C:07ff C:2000-C:3fff +10000 C:0800-C:0fff +01000 C:1000-C:17ff +11000 C:1800-C:1fff +00100 C:4000-C:47ff C:6000-C:7fff +10100 C:4800-C:4fff +01100 C:5000-C:57ff +11100 C:5800-C:5fff +00010 C:C000-C:C7ff C:E000-C:ffff +10010 C:C800-C:Cfff +01010 C:D000-C:D7ff +11010 C:D800-C:Dfff +00110 D:0000-D:07ff D:2000-D:3fff +10110 D:0800-D:0fff +01110 D:1000-D:17ff +11110 D:1800-D:1fff +00001 D:4000-D:47ff D:6000-D:7fff +10001 D:4800-D:4fff +01001 D:5000-D:57ff +11001 D:5800-D:5fff +00101 D:8000-D:87ff D:A000-D:bfff +10101 D:8800-D:8fff +01101 D:9000-D:97ff +11101 D:9800-D:9fff +00011 D:C000-D:c7ff D:E000-D:ffff +10011 D:C800-D:cfff +01011 D:D000-D:d7ff +11011 D:D800-D:dfff +00111 E:0000-E:07ff E:2000-E:3fff +10111 E:0800-E:0fff +01111 E:1000-E:17ff +11111 E:1800-E:1fff +======== ============= ================ + + +PureData Corp +============= + +PDI507 (8-bit card) +-------------------- + + - from Mark Rejhon (slight modifications by Avery) + - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards) + are mostly the same as this. PDI508Plus cards appear to be mainly + software-configured. + +Jumpers: + + There is a jumper array at the bottom of the card, near the edge + connector. This array is labelled J1. They control the IRQs and + something else. Put only one jumper on the IRQ pins. + + ETS1, ETS2 are for timing on very long distance networks. See the + more general information near the top of this file. + + There is a J2 jumper on two pins. A jumper should be put on them, + since it was already there when I got the card. I don't know what + this jumper is for though. + + There is a two-jumper array for J3. I don't know what it is for, + but there were already two jumpers on it when I got the card. It's + a six pin grid in a two-by-three fashion. The jumpers were + configured as follows:: + + .-------. + o | o o | + :-------: ------> Accessible end of card with connectors + o | o o | in this direction -------> + `-------' + +Carl de Billy explains J3 and J4: + + J3 Diagram:: + + .-------. + o | o o | + :-------: TWIST Technology + o | o o | + `-------' + .-------. + | o o | o + :-------: COAX Technology + | o o | o + `-------' + + - If using coax cable in a bus topology the J4 jumper must be removed; + place it on one pin. + + - If using bus topology with twisted pair wiring move the J3 + jumpers so they connect the middle pin and the pins closest to the RJ11 + Connectors. Also the J4 jumper must be removed; place it on one pin of + J4 jumper for storage. + + - If using star topology with twisted pair wiring move the J3 + jumpers so they connect the middle pin and the pins closest to the RJ11 + connectors. + + +DIP Switches: + + The DIP switches accessible on the accessible end of the card while + it is installed, is used to set the ARCnet address. There are 8 + switches. Use an address from 1 to 254 + + ========== ========================= + Switch No. ARCnet address + 12345678 + ========== ========================= + 00000000 FF (Don't use this!) + 00000001 FE + 00000010 FD + ... + 11111101 2 + 11111110 1 + 11111111 0 (Don't use this!) + ========== ========================= + + There is another array of eight DIP switches at the top of the + card. There are five labelled MS0-MS4 which seem to control the + memory address, and another three labelled IO0-IO2 which seem to + control the base I/O address of the card. + + This was difficult to test by trial and error, and the I/O addresses + are in a weird order. This was tested by setting the DIP switches, + rebooting the computer, and attempting to load ARCETHER at various + addresses (mostly between 0x200 and 0x400). The address that caused + the red transmit LED to blink, is the one that I thought works. + + Also, the address 0x3D0 seem to have a special meaning, since the + ARCETHER packet driver loaded fine, but without the red LED + blinking. I don't know what 0x3D0 is for though. I recommend using + an address of 0x300 since Windows may not like addresses below + 0x300. + + ============= =========== + IO Switch No. I/O address + 210 + ============= =========== + 111 0x260 + 110 0x290 + 101 0x2E0 + 100 0x2F0 + 011 0x300 + 010 0x350 + 001 0x380 + 000 0x3E0 + ============= =========== + + The memory switches set a reserved address space of 0x1000 bytes + (0x100 segment units, or 4k). For example if I set an address of + 0xD000, it will use up addresses 0xD000 to 0xD100. + + The memory switches were tested by booting using QEMM386 stealth, + and using LOADHI to see what address automatically became excluded + from the upper memory regions, and then attempting to load ARCETHER + using these addresses. + + I recommend using an ARCnet memory address of 0xD000, and putting + the EMS page frame at 0xC000 while using QEMM stealth mode. That + way, you get contiguous high memory from 0xD100 almost all the way + the end of the megabyte. + + Memory Switch 0 (MS0) didn't seem to work properly when set to OFF + on my card. It could be malfunctioning on my card. Experiment with + it ON first, and if it doesn't work, set it to OFF. (It may be a + modifier for the 0x200 bit?) + + ============= ============================================ + MS Switch No. + 43210 Memory address + ============= ============================================ + 00001 0xE100 (guessed - was not detected by QEMM) + 00011 0xE000 (guessed - was not detected by QEMM) + 00101 0xDD00 + 00111 0xDC00 + 01001 0xD900 + 01011 0xD800 + 01101 0xD500 + 01111 0xD400 + 10001 0xD100 + 10011 0xD000 + 10101 0xCD00 + 10111 0xCC00 + 11001 0xC900 (guessed - crashes tested system) + 11011 0xC800 (guessed - crashes tested system) + 11101 0xC500 (guessed - crashes tested system) + 11111 0xC400 (guessed - crashes tested system) + ============= ============================================ + +CNet Technology Inc. +==================== + +120 Series (8-bit cards) +------------------------ + - from Juergen Seifert + +This description has been written by Juergen Seifert +using information from the following Original CNet Manual + + "ARCNET USER'S MANUAL for + CN120A + CN120AB + CN120TP + CN120ST + CN120SBT + P/N:12-01-0007 + Revision 3.00" + +ARCNET is a registered trademark of the Datapoint Corporation + +- P/N 120A ARCNET 8 bit XT/AT Star +- P/N 120AB ARCNET 8 bit XT/AT Bus +- P/N 120TP ARCNET 8 bit XT/AT Twisted Pair +- P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair +- P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair + +:: + + __________________________________________________________________ + | | + | ___| + | LED |___| + | ___| + | N | | ID7 + | o | | ID6 + | d | S | ID5 + | e | W | ID4 + | ___________________ A | 2 | ID3 + | | | d | | ID2 + | | | 1 2 3 4 5 6 7 8 d | | ID1 + | | | _________________ r |___| ID0 + | | 90C65 || SW1 | ____| + | JP 8 7 | ||_________________| | | + | |o|o| JP1 | | | J2 | + | |o|o| |oo| | | JP 1 1 1 | | + | ______________ | | 0 1 2 |____| + | | PROM | |___________________| |o|o|o| _____| + | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 | + | |______________| |o|o|o|o|o| |o|o|o| |_____| + |_____ |o|o|o|o|o| ______________| + | | + |_____________________________________________| + +Legend:: + + 90C65 ARCNET Probe + S1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select + S2 1-8: Node ID Select (ID0-ID7) + JP1 ROM Enable Select + JP2 IRQ2 + JP3 IRQ3 + JP4 IRQ4 + JP5 IRQ5 + JP6 IRQ7 + JP7/JP8 ET1, ET2 Timeout Parameters + JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only) + JP12 Terminator Select (CN120AB/ST/SBT only) + J1 BNC RG62/U Connector (all except CN120TP) + J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only) + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must be different from 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + ======= ====== ===== + Switch Label Value + ======= ====== ===== + 1 ID0 1 + 2 ID1 2 + 3 ID2 4 + 4 ID3 8 + 5 ID4 16 + 6 ID5 32 + 7 ID6 64 + 8 ID7 128 + ======= ====== ===== + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 8K or memory base + 0x2000. +Switches 1-5 of switch block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM install the jumper JP1 + +.. note:: + + Since the switches 1 and 2 are always set to ON it may be possible + that they can be used to add an offset of 2K, 4K or 6K to the base + address, but this feature is not documented in the manual and I + haven't tested it yet. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level install one (only one!) of the jumpers +JP2, JP3, JP4, JP5, JP6. JP2 is the default:: + + Jumper | IRQ + -------|----- + 2 | 2 + 3 | 3 + 4 | 4 + 5 | 5 + 6 | 7 + + +Setting the Internal Terminator on CN120AB/TP/SBT +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumper JP12 is used to enable the internal terminator:: + + ----- + 0 | 0 | + ----- ON | | ON + | 0 | | 0 | + | | OFF ----- OFF + | 0 | 0 + ----- + Terminator Terminator + disabled enabled + + +Selecting the Connector Type on CN120ST/SBT +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + JP10 JP11 JP10 JP11 + ----- ----- + 0 0 | 0 | | 0 | + ----- ----- | | | | + | 0 | | 0 | | 0 | | 0 | + | | | | ----- ----- + | 0 | | 0 | 0 0 + ----- ----- + Coaxial Cable Twisted Pair Cable + (Default) + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers labeled EXT1 and EXT2 are used to determine the timeout +parameters. These two jumpers are normally left open. + + +CNet Technology Inc. +==================== + +160 Series (16-bit cards) +------------------------- + - from Juergen Seifert + +This description has been written by Juergen Seifert +using information from the following Original CNet Manual + + "ARCNET USER'S MANUAL for + CN160A CN160AB CN160TP + P/N:12-01-0006 Revision 3.00" + +ARCNET is a registered trademark of the Datapoint Corporation + +- P/N 160A ARCNET 16 bit XT/AT Star +- P/N 160AB ARCNET 16 bit XT/AT Bus +- P/N 160TP ARCNET 16 bit XT/AT Twisted Pair + +:: + + ___________________________________________________________________ + < _________________________ ___| + > |oo| JP2 | | LED |___| + < |oo| JP1 | 9026 | LED |___| + > |_________________________| ___| + < N | | ID7 + > 1 o | | ID6 + < 1 2 3 4 5 6 7 8 9 0 d | S | ID5 + > _______________ _____________________ e | W | ID4 + < | PROM | | SW1 | A | 2 | ID3 + > > SOCKET | |_____________________| d | | ID2 + < |_______________| | IO-Base | MEM | d | | ID1 + > r |___| ID0 + < ____| + > | | + < | J1 | + > | | + < |____| + > 1 1 1 1 | + < 3 4 5 6 7 JP 8 9 0 1 2 3 | + > |o|o|o|o|o| |o|o|o|o|o|o| | + < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________| + > | | | + <____________| |_______________________________________| + +Legend:: + + 9026 ARCNET Probe + SW1 1-6: Base I/O Address Select + 7-10: Base Memory Address Select + SW2 1-8: Node ID Select (ID0-ID7) + JP1/JP2 ET1, ET2 Timeout Parameters + JP3-JP13 Interrupt Select + J1 BNC RG62/U Connector (CN160A/AB only) + J1 Two 6-position Telephone Jack (CN160TP only) + LED + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must be different from 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first six switches in switch block SW1 are used to select the I/O Base +address using the following table:: + + Switch | Hex I/O + 1 2 3 4 5 6 | Address + ------------------------|-------- + OFF ON ON OFF OFF ON | 260 + OFF ON OFF ON ON OFF | 290 + OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default) + OFF ON OFF OFF OFF OFF | 2F0 + OFF OFF ON ON ON ON | 300 + OFF OFF ON OFF ON OFF | 350 + OFF OFF OFF ON ON ON | 380 + OFF OFF OFF OFF OFF ON | 3E0 + +Note: Other IO-Base addresses seem to be selectable, but only the above + combinations are documented. + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switches 7-10 of switch block SW1 are used to select the Memory +Base address of the RAM (2K) and the PROM:: + + Switch | Hex RAM | Hex ROM + 7 8 9 10 | Address | Address + ----------------|---------|----------- + OFF OFF ON ON | C0000 | C8000 + OFF OFF ON OFF | D0000 | D8000 (Default) + OFF OFF OFF ON | E0000 | E8000 + +.. note:: + + Other MEM-Base addresses seem to be selectable, but only the above + combinations are documented. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level install one (only one!) of the jumpers +JP3 through JP13 using the following table:: + + Jumper | IRQ + -------|----------------- + 3 | 14 + 4 | 15 + 5 | 12 + 6 | 11 + 7 | 10 + 8 | 3 + 9 | 4 + 10 | 5 + 11 | 6 + 12 | 7 + 13 | 2 (=9) Default! + +.. note:: + + - Do not use JP11=IRQ6, it may conflict with your Floppy Disk + Controller + - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL- + Hard Disk, it may conflict with their controllers + + +Setting the Timeout Parameters +------------------------------ + +The jumpers labeled JP1 and JP2 are used to determine the timeout +parameters. These two jumpers are normally left open. + + +Lantech +======= + +8-bit card, unknown model +------------------------- + - from Vlad Lungu - his e-mail address seemed broken at + the time I tried to reach him. Sorry Vlad, if you didn't get my reply. + +:: + + ________________________________________________________________ + | 1 8 | + | ___________ __| + | | SW1 | LED |__| + | |__________| | + | ___| + | _____________________ |S | 8 + | | | |W | + | | | |2 | + | | | |__| 1 + | | UM9065L | |o| JP4 ____|____ + | | | |o| | CN | + | | | |________| + | | | | + | |___________________| | + | | + | | + | _____________ | + | | | | + | | PROM | |ooooo| JP6 | + | |____________| |ooooo| | + |_____________ _ _| + |____________________________________________| |__| + + +UM9065L : ARCnet Controller + +SW 1 : Shared Memory Address and I/O Base + +:: + + ON=0 + + 12345|Memory Address + -----|-------------- + 00001| D4000 + 00010| CC000 + 00110| D0000 + 01110| D1000 + 01101| D9000 + 10010| CC800 + 10011| DC800 + 11110| D1800 + +It seems that the bits are considered in reverse order. Also, you must +observe that some of those addresses are unusual and I didn't probe them; I +used a memory dump in DOS to identify them. For the 00000 configuration and +some others that I didn't write here the card seems to conflict with the +video card (an S3 GENDAC). I leave the full decoding of those addresses to +you. + +:: + + 678| I/O Address + ---|------------ + 000| 260 + 001| failed probe + 010| 2E0 + 011| 380 + 100| 290 + 101| 350 + 110| failed probe + 111| 3E0 + + SW 2 : Node ID (binary coded) + + JP 4 : Boot PROM enable CLOSE - enabled + OPEN - disabled + + JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6) + + +Acer +==== + +8-bit card, Model 5210-003 +-------------------------- + + - from Vojtech Pavlik using portions of the existing + arcnet-hardware file. + +This is a 90C26 based card. Its configuration seems similar to the SMC +PC100, but has some additional jumpers I don't know the meaning of. + +:: + + __ + | | + ___________|__|_________________________ + | | | | + | | BNC | | + | |______| ___| + | _____________________ |___ + | | | | + | | Hybrid IC | | + | | | o|o J1 | + | |_____________________| 8|8 | + | 8|8 J5 | + | o|o | + | 8|8 | + |__ 8|8 | + (|__| LED o|o | + | 8|8 | + | 8|8 J15 | + | | + | _____ | + | | | _____ | + | | | | | ___| + | | | | | | + | _____ | ROM | | UFS | | + | | | | | | | | + | | | ___ | | | | | + | | | | | |__.__| |__.__| | + | | NCR | |XTL| _____ _____ | + | | | |___| | | | | | + | |90C26| | | | | | + | | | | RAM | | UFS | | + | | | J17 o|o | | | | | + | | | J16 o|o | | | | | + | |__.__| |__.__| |__.__| | + | ___ | + | | |8 | + | |SW2| | + | | | | + | |___|1 | + | ___ | + | | |10 J18 o|o | + | | | o|o | + | |SW1| o|o | + | | | J21 o|o | + | |___|1 | + | | + |____________________________________| + + +Legend:: + + 90C26 ARCNET Chip + XTL 20 MHz Crystal + SW1 1-6 Base I/O Address Select + 7-10 Memory Address Select + SW2 1-8 Node ID Select (ID0-ID7) + J1-J5 IRQ Select + J6-J21 Unknown (Probably extra timeouts & ROM enable ...) + LED1 Activity LED + BNC Coax connector (STAR ARCnet) + RAM 2k of SRAM + ROM Boot ROM socket + UFS Unidentified Flying Sockets + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +Setting one of the switches to OFF means "1", ON means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Don't set this to 0 or 255; these values are reserved. + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switches 1 to 6 of switch block SW1 are used to select one +of 32 possible I/O Base addresses using the following tables:: + + | Hex + Switch | Value + -------|------- + 1 | 200 + 2 | 100 + 3 | 80 + 4 | 40 + 5 | 20 + 6 | 10 + +The I/O address is sum of all switches set to "1". Remember that +the I/O address space bellow 0x200 is RESERVED for mainboard, so +switch 1 should be ALWAYS SET TO OFF. + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of sixteen positions. However, the addresses below +A0000 are likely to cause system hang because there's main RAM. + +Jumpers 7-10 of switch block SW1 select the Memory Base address:: + + Switch | Hex RAM + 7 8 9 10 | Address + ----------------|--------- + OFF OFF OFF OFF | F0000 (conflicts with main BIOS) + OFF OFF OFF ON | E0000 + OFF OFF ON OFF | D0000 + OFF OFF ON ON | C0000 (conflicts with video BIOS) + OFF ON OFF OFF | B0000 (conflicts with mono video) + OFF ON OFF ON | A0000 (conflicts with graphics) + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means +shorted, OFF means open:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 7 + OFF ON OFF OFF OFF | 5 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 3 + OFF OFF OFF OFF ON | 2 + + +Unknown jumpers & sockets +^^^^^^^^^^^^^^^^^^^^^^^^^ + +I know nothing about these. I just guess that J16&J17 are timeout +jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and +J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't +guess the purpose. + +Datapoint? +========== + +LAN-ARC-8, an 8-bit card +------------------------ + + - from Vojtech Pavlik + +This is another SMC 90C65-based ARCnet card. I couldn't identify the +manufacturer, but it might be DataPoint, because the card has the +original arcNet logo in its upper right corner. + +:: + + _______________________________________________________ + | _________ | + | | SW2 | ON arcNet | + | |_________| OFF ___| + | _____________ 1 ______ 8 | | 8 + | | | SW1 | XTAL | ____________ | S | + | > RAM (2k) | |______|| | | W | + | |_____________| | H | | 3 | + | _________|_____ y | |___| 1 + | _________ | | |b | | + | |_________| | | |r | | + | | SMC | |i | | + | | 90C65| |d | | + | _________ | | | | | + | | SW1 | ON | | |I | | + | |_________| OFF |_________|_____/C | _____| + | 1 8 | | | |___ + | ______________ | | | BNC |___| + | | | |____________| |_____| + | > EPROM SOCKET | _____________ | + | |______________| |_____________| | + | ______________| + | | + |________________________________________| + +Legend:: + + 90C65 ARCNET Chip + SW1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select + SW2 1-8: Node ID Select + SW3 1-5: IRQ Select + 6-7: Extra Timeout + 8 : ROM Enable + BNC Coax connector + XTAL 20 MHz Crystal + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW3 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. + +Jumpers 3-5 of switch block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON. + +The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Switches 1-5 of the switch block SW3 control the IRQ level:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 3 + OFF ON OFF OFF OFF | 4 + OFF OFF ON OFF OFF | 5 + OFF OFF OFF ON OFF | 7 + OFF OFF OFF OFF ON | 2 + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switches 6-7 of the switch block SW3 are used to determine the timeout +parameters. These two switches are normally left in the OFF position. + + +Topware +======= + +8-bit card, TA-ARC/10 +--------------------- + + - from Vojtech Pavlik + +This is another very similar 90C65 card. Most of the switches and jumpers +are the same as on other clones. + +:: + + _____________________________________________________________________ + | ___________ | | ______ | + | |SW2 NODE ID| | | | XTAL | | + | |___________| | Hybrid IC | |______| | + | ___________ | | __| + | |SW1 MEM+I/O| |_________________________| LED1|__|) + | |___________| 1 2 | + | J3 |o|o| TIMEOUT ______| + | ______________ |o|o| | | + | | | ___________________ | RJ | + | > EPROM SOCKET | | \ |------| + |J2 |______________| | | | | + ||o| | | |______| + ||o| ROM ENABLE | SMC | _________ | + | _____________ | 90C65 | |_________| _____| + | | | | | | |___ + | > RAM (2k) | | | | BNC |___| + | |_____________| | | |_____| + | |____________________| | + | ________ IRQ 2 3 4 5 7 ___________ | + ||________| |o|o|o|o|o| |___________| | + |________ J1|o|o|o|o|o| ______________| + | | + |_____________________________________________| + +Legend:: + + 90C65 ARCNET Chip + XTAL 20 MHz Crystal + SW1 1-5 Base Memory Address Select + 6-8 Base I/O Address Select + SW2 1-8 Node ID Select (ID0-ID7) + J1 IRQ Select + J2 ROM Enable + J3 Extra Timeout + LED1 Activity LED + BNC Coax connector (BUS ARCnet) + RJ Twisted Pair Connector (daisy chain) + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached to +the network must have an unique node ID which must not be 0. Switch 1 (ID0) +serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 (Manufacturer's default) + OFF ON ON | 290 + ON OFF ON | 2E0 + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. + +Jumpers 3-5 of switch block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default) + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM short the jumper J2. + +The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means +shorted, OFF means open:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 2 + OFF ON OFF OFF OFF | 3 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 5 + OFF OFF OFF OFF ON | 7 + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers J3 are used to set the timeout parameters. These two +jumpers are normally left open. + +Thomas-Conrad +============= + +Model #500-6242-0097 REV A (8-bit card) +--------------------------------------- + + - from Lars Karlsson <100617.3473@compuserve.com> + +:: + + ________________________________________________________ + | ________ ________ |_____ + | |........| |........| | + | |________| |________| ___| + | SW 3 SW 1 | | + | Base I/O Base Addr. Station | | + | address | | + | ______ switch | | + | | | | | + | | | |___| + | | | ______ |___._ + | |______| |______| ____| BNC + | Jumper- _____| Connector + | Main chip block _ __| ' + | | | | RJ Connector + | |_| | with 110 Ohm + | |__ Terminator + | ___________ __| + | |...........| | RJ-jack + | |...........| _____ | (unused) + | |___________| |_____| |__ + | Boot PROM socket IRQ-jumpers |_ Diagnostic + |________ __ _| LED (red) + | | | | | | | | | | | | | | | | | | | | | | + | | | | | | | | | | | | | | | | | | | | |________| + | + | + +And here are the settings for some of the switches and jumpers on the cards. + +:: + + I/O + + 1 2 3 4 5 6 7 8 + + 2E0----- 0 0 0 1 0 0 0 1 + 2F0----- 0 0 0 1 0 0 0 0 + 300----- 0 0 0 0 1 1 1 1 + 350----- 0 0 0 0 1 1 1 0 + +"0" in the above example means switch is off "1" means that it is on. + +:: + + ShMem address. + + 1 2 3 4 5 6 7 8 + + CX00--0 0 1 1 | | | + DX00--0 0 1 0 | + X000--------- 1 1 | + X400--------- 1 0 | + X800--------- 0 1 | + XC00--------- 0 0 + ENHANCED----------- 1 + COMPATIBLE--------- 0 + +:: + + IRQ + + + 3 4 5 7 2 + . . . . . + . . . . . + + +There is a DIP-switch with 8 switches, used to set the shared memory address +to be used. The first 6 switches set the address, the 7th doesn't have any +function, and the 8th switch is used to select "compatible" or "enhanced". +When I got my two cards, one of them had this switch set to "enhanced". That +card didn't work at all, it wasn't even recognized by the driver. The other +card had this switch set to "compatible" and it behaved absolutely normally. I +guess that the switch on one of the cards, must have been changed accidentally +when the card was taken out of its former host. The question remains +unanswered, what is the purpose of the "enhanced" position? + +[Avery's note: "enhanced" probably either disables shared memory (use IO +ports instead) or disables IO ports (use memory addresses instead). This +varies by the type of card involved. I fail to see how either of these +enhance anything. Send me more detailed information about this mode, or +just use "compatible" mode instead.] + +Waterloo Microsystems Inc. ?? +============================= + +8-bit card (C) 1985 +------------------- + - from Robert Michael Best + +[Avery's note: these don't work with my driver for some reason. These cards +SEEM to have settings similar to the PDI508Plus, which is +software-configured and doesn't work with my driver either. The "Waterloo +chip" is a boot PROM, probably designed specifically for the University of +Waterloo. If you have any further information about this card, please +e-mail me.] + +The probe has not been able to detect the card on any of the J2 settings, +and I tried them again with the "Waterloo" chip removed. + +:: + + _____________________________________________________________________ + | \/ \/ ___ __ __ | + | C4 C4 |^| | M || ^ ||^| | + | -- -- |_| | 5 || || | C3 | + | \/ \/ C10 |___|| ||_| | + | C4 C4 _ _ | | ?? | + | -- -- | \/ || | | + | | || | | + | | || C1 | | + | | || | \/ _____| + | | C6 || | C9 | |___ + | | || | -- | BNC |___| + | | || | >C7| |_____| + | | || | | + | __ __ |____||_____| 1 2 3 6 | + || ^ | >C4| |o|o|o|o|o|o| J2 >C4| | + || | |o|o|o|o|o|o| | + || C2 | >C4| >C4| | + || | >C8| | + || | 2 3 4 5 6 7 IRQ >C4| | + ||_____| |o|o|o|o|o|o| J3 | + |_______ |o|o|o|o|o|o| _______________| + | | + |_____________________________________________| + + C1 -- "COM9026 + SMC 8638" + In a chip socket. + + C2 -- "@Copyright + Waterloo Microsystems Inc. + 1985" + In a chip Socket with info printed on a label covering a round window + showing the circuit inside. (The window indicates it is an EPROM chip.) + + C3 -- "COM9032 + SMC 8643" + In a chip socket. + + C4 -- "74LS" + 9 total no sockets. + + M5 -- "50006-136 + 20.000000 MHZ + MTQ-T1-S3 + 0 M-TRON 86-40" + Metallic case with 4 pins, no socket. + + C6 -- "MOSTEK@TC8643 + MK6116N-20 + MALAYSIA" + No socket. + + C7 -- No stamp or label but in a 20 pin chip socket. + + C8 -- "PAL10L8CN + 8623" + In a 20 pin socket. + + C9 -- "PAl16R4A-2CN + 8641" + In a 20 pin socket. + + C10 -- "M8640 + NMC + 9306N" + In an 8 pin socket. + + ?? -- Some components on a smaller board and attached with 20 pins all + along the side closest to the BNC connector. The are coated in a dark + resin. + +On the board there are two jumper banks labeled J2 and J3. The +manufacturer didn't put a J1 on the board. The two boards I have both +came with a jumper box for each bank. + +:: + + J2 -- Numbered 1 2 3 4 5 6. + 4 and 5 are not stamped due to solder points. + + J3 -- IRQ 2 3 4 5 6 7 + +The board itself has a maple leaf stamped just above the irq jumpers +and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986 +CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector. +Below that "MADE IN CANADA" + +No Name +======= + +8-bit cards, 16-bit cards +------------------------- + + - from Juergen Seifert + +I have named this ARCnet card "NONAME", since there is no name of any +manufacturer on the Installation manual nor on the shipping box. The only +hint to the existence of a manufacturer at all is written in copper, +it is "Made in Taiwan" + +This description has been written by Juergen Seifert +using information from the Original + + "ARCnet Installation Manual" + +:: + + ________________________________________________________________ + | |STAR| BUS| T/P| | + | |____|____|____| | + | _____________________ | + | | | | + | | | | + | | | | + | | SMC | | + | | | | + | | COM90C65 | | + | | | | + | | | | + | |__________-__________| | + | _____| + | _______________ | CN | + | | PROM | |_____| + | > SOCKET | | + | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 | + | _______________ _______________ | + | |o|o|o|o|o|o|o|o| | SW1 || SW2 || + | |o|o|o|o|o|o|o|o| |_______________||_______________|| + |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____| + | \ IRQ / T T O | + |__________________1_2_M______________________| + +Legend:: + + COM90C65: ARCnet Probe + S1 1-8: Node ID Select + S2 1-3: I/O Base Address Select + 4-6: Memory Base Address Select + 7-8: RAM Offset Select + ET1, ET2 Extended Timeout Select + ROM ROM Enable Select + CN RG62 Coax Connector + STAR| BUS | T/P Three fields for placing a sign (colored circle) + indicating the topology of the card + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW1 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 8 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 8 | 1 + 7 | 2 + 6 | 4 + 5 | 8 + 4 | 16 + 3 | 32 + 2 | 64 + 1 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 1 2 3 4 5 6 7 8 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first three switches in switch group SW2 are used to select one +of eight possible I/O Base addresses using the following table:: + + Switch | Hex I/O + 1 2 3 | Address + ------------|-------- + ON ON ON | 260 + ON ON OFF | 290 + ON OFF ON | 2E0 (Manufacturer's default) + ON OFF OFF | 2F0 + OFF ON ON | 300 + OFF ON OFF | 350 + OFF OFF ON | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 4-6 of switch group SW2 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 7 and 8 of group SW2. + +:: + + Switch | Hex RAM | Hex ROM + 4 5 6 7 8 | Address | Address *) + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + *) To enable the 8K Boot PROM install the jumper ROM. + The default is jumper ROM not installed. + + +Setting Interrupt Request Lines (IRQ) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level set one (only one!) of the jumpers +IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2. + + +Setting the Timeouts +^^^^^^^^^^^^^^^^^^^^ + +The two jumpers labeled ET1 and ET2 are used to determine the timeout +parameters (response and reconfiguration time). Every node in a network +must be set to the same timeout values. + +:: + + ET1 ET2 | Response Time (us) | Reconfiguration Time (ms) + --------|--------------------|-------------------------- + Off Off | 78 | 840 (Default) + Off On | 285 | 1680 + On Off | 563 | 1680 + On On | 1130 | 1680 + +On means jumper installed, Off means jumper not installed + + +16-BIT ARCNET +------------- + +The manual of my 8-Bit NONAME ARCnet Card contains another description +of a 16-Bit Coax / Twisted Pair Card. This description is incomplete, +because there are missing two pages in the manual booklet. (The table +of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside +the booklet there is a different way of counting ... 2-9, 2-10, A-1, +(empty page), 3-1, ..., 3-18, A-1 (again), A-2) +Also the picture of the board layout is not as good as the picture of +8-Bit card, because there isn't any letter like "SW1" written to the +picture. + +Should somebody have such a board, please feel free to complete this +description or to send a mail to me! + +This description has been written by Juergen Seifert +using information from the Original + + "ARCnet Installation Manual" + +:: + + ___________________________________________________________________ + < _________________ _________________ | + > | SW? || SW? | | + < |_________________||_________________| | + > ____________________ | + < | | | + > | | | + < | | | + > | | | + < | | | + > | | | + < | | | + > |____________________| | + < ____| + > ____________________ | | + < | | | J1 | + > | < | | + < |____________________| ? ? ? ? ? ? |____| + > |o|o|o|o|o|o| | + < |o|o|o|o|o|o| | + > | + < __ ___________| + > | | | + <____________| |_______________________________________| + + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW2 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 8 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 8 | 1 + 7 | 2 + 6 | 4 + 5 | 8 + 4 | 16 + 3 | 32 + 2 | 64 + 1 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 1 2 3 4 5 6 7 8 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first three switches in switch group SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + Switch | Hex I/O + 3 2 1 | Address + ------------|-------- + ON ON ON | 260 + ON ON OFF | 290 + ON OFF ON | 2E0 (Manufacturer's default) + ON OFF OFF | 2F0 + OFF ON ON | 300 + OFF ON OFF | 350 + OFF OFF ON | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 6-8 of switch group SW1 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 4 and 5 of group SW1:: + + Switch | Hex RAM | Hex ROM + 8 7 6 5 4 | Address | Address + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + +Setting Interrupt Request Lines (IRQ) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +?????????????????????????????????????? + + +Setting the Timeouts +^^^^^^^^^^^^^^^^^^^^ + +?????????????????????????????????????? + + +8-bit cards ("Made in Taiwan R.O.C.") +------------------------------------- + + - from Vojtech Pavlik + +I have named this ARCnet card "NONAME", since I got only the card with +no manual at all and the only text identifying the manufacturer is +"MADE IN TAIWAN R.O.C" printed on the card. + +:: + + ____________________________________________________________ + | 1 2 3 4 5 6 7 8 | + | |o|o| JP1 o|o|o|o|o|o|o|o| ON | + | + o|o|o|o|o|o|o|o| ___| + | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7 + | | | SW1 | | | | ID6 + | > RAM (2k) | ____________________ | H | | S | ID5 + | |_____________| | || y | | W | ID4 + | | || b | | 2 | ID3 + | | || r | | | ID2 + | | || i | | | ID1 + | | 90C65 || d | |___| ID0 + | SW3 | || | | + | |o|o|o|o|o|o|o|o| ON | || I | | + | |o|o|o|o|o|o|o|o| | || C | | + | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____| + | 1 2 3 4 5 6 7 8 | | | |___ + | ______________ | | | BNC |___| + | | | |_____| |_____| + | > EPROM SOCKET | | + | |______________| | + | ______________| + | | + |_____________________________________________| + +Legend:: + + 90C65 ARCNET Chip + SW1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select + SW2 1-8: Node ID Select (ID0-ID7) + SW3 1-5: IRQ Select + 6-7: Extra Timeout + 8 : ROM Enable + JP1 Led connector + BNC Coax connector + +Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not +switches. + +Setting the jumpers to ON means connecting the upper two pins, off the bottom +two - or - in case of IRQ setting, connecting none of them at all. + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. + +Jumpers 3-5 of jumper block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON. + +The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders. + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Jumpers 1-5 of the jumper block SW3 control the IRQ level:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 2 + OFF ON OFF OFF OFF | 3 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 5 + OFF OFF OFF OFF ON | 7 + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers 6-7 of the jumper block SW3 are used to determine the timeout +parameters. These two jumpers are normally left in the OFF position. + + + +(Generic Model 9058) +-------------------- + - from Andrew J. Kroll + - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a + year!) + +:: + + _____ + | < + | .---' + ________________________________________________________________ | | + | | SW2 | | | + | ___________ |_____________| | | + | | | 1 2 3 4 5 6 ___| | + | > 6116 RAM | _________ 8 | | | + | |___________| |20MHzXtal| 7 | | | + | |_________| __________ 6 | S | | + | 74LS373 | |- 5 | W | | + | _________ | E |- 4 | | | + | >_______| ______________|..... P |- 3 | 3 | | + | | | : O |- 2 | | | + | | | : X |- 1 |___| | + | ________________ | | : Y |- | | + | | SW1 | | SL90C65 | : |- | | + | |________________| | | : B |- | | + | 1 2 3 4 5 6 7 8 | | : O |- | | + | |_________o____|..../ A |- _______| | + | ____________________ | R |- | |------, + | | | | D |- | BNC | # | + | > 2764 PROM SOCKET | |__________|- |_______|------' + | |____________________| _________ | | + | >________| <- 74LS245 | | + | | | + |___ ______________| | + |H H H H H H H H H H H H H H H H H H H H H H H| | | + |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | | + \| + +Legend:: + + SL90C65 ARCNET Controller / Transceiver /Logic + SW1 1-5: IRQ Select + 6: ET1 + 7: ET2 + 8: ROM ENABLE + SW2 1-3: Memory Buffer/PROM Address + 3-6: I/O Address Map + SW3 1-8: Node ID Select + BNC BNC RG62/U Connection + *I* have had success using RG59B/U with *NO* terminators! + What gives?! + +SW1: Timeouts, Interrupt and ROM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level set one (only one!) of the dip switches +up (on) SW1...(switches 1-5) +IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2. + +The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7) +are used to determine the timeout parameters. These two dip switches +are normally left off (down). + + To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM. + The default is jumper ROM not installed. + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch group SW2 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 4 5 6 | Address + -------|-------- + 0 0 0 | 260 + 0 0 1 | 290 + 0 1 0 | 2E0 (Manufacturer's default) + 0 1 1 | 2F0 + 1 0 0 | 300 + 1 0 1 | 350 + 1 1 0 | 380 + 1 1 1 | 3E0 + + +Setting the Base Memory Address (RAM & ROM) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 1-3 of switch group SW2 select the Base of the 16K block. +(0 = DOWN, 1 = UP) +I could, however, only verify two settings... + + +:: + + Switch| Hex RAM | Hex ROM + 1 2 3 | Address | Address + ------|---------|----------- + 0 0 0 | E0000 | E2000 + 0 0 1 | D0000 | D2000 (Manufacturer's default) + 0 1 0 | ????? | ????? + 0 1 1 | ????? | ????? + 1 0 0 | ????? | ????? + 1 0 1 | ????? | ????? + 1 1 0 | ????? | ????? + 1 1 1 | ????? | ????? + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW3 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 1 serves as the least significant bit (LSB). +switches in the DOWN position are OFF (0) and in the UP position are ON (1) + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Some Examples:: + + Switch# | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed <-. + 0 0 0 0 0 0 0 1 | 1 | 1 | + 0 0 0 0 0 0 1 0 | 2 | 2 | + 0 0 0 0 0 0 1 1 | 3 | 3 | + . . . | | | + 0 1 0 1 0 1 0 1 | 55 | 85 | + . . . | | + Don't use 0 or 255! + 1 0 1 0 1 0 1 0 | AA | 170 | + . . . | | | + 1 1 1 1 1 1 0 1 | FD | 253 | + 1 1 1 1 1 1 1 0 | FE | 254 | + 1 1 1 1 1 1 1 1 | FF | 255 <-' + + +Tiara +===== + +(model unknown) +--------------- + + - from Christoph Lameter + + +Here is information about my card as far as I could figure it out:: + + + ----------------------------------------------- tiara + Tiara LanCard of Tiara Computer Systems. + + +----------------------------------------------+ + ! ! Transmitter Unit ! ! + ! +------------------+ ------- + ! MEM Coax Connector + ! ROM 7654321 <- I/O ------- + ! : : +--------+ ! + ! : : ! 90C66LJ! +++ + ! : : ! ! !D Switch to set + ! : : ! ! !I the Nodenumber + ! : : +--------+ !P + ! !++ + ! 234567 <- IRQ ! + +------------!!!!!!!!!!!!!!!!!!!!!!!!--------+ + !!!!!!!!!!!!!!!!!!!!!!!! + +- 0 = Jumper Installed +- 1 = Open + +Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O + +Settings for Memory Location (Top Jumper Line) + +=== ================ +456 Address selected +=== ================ +000 C0000 +001 C4000 +010 CC000 +011 D0000 +100 D4000 +101 D8000 +110 DC000 +111 E0000 +=== ================ + +Settings for I/O Address (Top Jumper Line) + +=== ==== +123 Port +=== ==== +000 260 +001 290 +010 2E0 +011 2F0 +100 300 +101 350 +110 380 +111 3E0 +=== ==== + +Settings for IRQ Selection (Lower Jumper Line) + +====== ===== +234567 +====== ===== +011111 IRQ 2 +101111 IRQ 3 +110111 IRQ 4 +111011 IRQ 5 +111110 IRQ 7 +====== ===== + +Other Cards +=========== + +I have no information on other models of ARCnet cards at the moment. Please +send any and all info to: + + apenwarr@worldvisions.ca + +Thanks. diff --git a/Documentation/networking/arcnet-hardware.txt b/Documentation/networking/arcnet-hardware.txt deleted file mode 100644 index 731de411513c..000000000000 --- a/Documentation/networking/arcnet-hardware.txt +++ /dev/null @@ -1,3133 +0,0 @@ - ------------------------------------------------------------------------------ -1) This file is a supplement to arcnet.txt. Please read that for general - driver configuration help. ------------------------------------------------------------------------------ -2) This file is no longer Linux-specific. It should probably be moved out of - the kernel sources. Ideas? ------------------------------------------------------------------------------ - -Because so many people (myself included) seem to have obtained ARCnet cards -without manuals, this file contains a quick introduction to ARCnet hardware, -some cabling tips, and a listing of all jumper settings I can find. Please -e-mail apenwarr@worldvisions.ca with any settings for your particular card, -or any other information you have! - - -INTRODUCTION TO ARCNET ----------------------- - -ARCnet is a network type which works in a way similar to popular Ethernet -networks but which is also different in some very important ways. - -First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps -(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact, -there are others as well, but these are less common. The different hardware -types, as far as I'm aware, are not compatible and so you cannot wire a -100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does -work with 100 Mbps cards, but I haven't been able to verify this myself, -since I only have the 2.5 Mbps variety. It is probably not going to saturate -your 100 Mbps card. Stop complaining. :) - -You also cannot connect an ARCnet card to any kind of Ethernet card and -expect it to work. - -There are two "types" of ARCnet - STAR topology and BUS topology. This -refers to how the cards are meant to be wired together. According to most -available documentation, you can only connect STAR cards to STAR cards and -BUS cards to BUS cards. That makes sense, right? Well, it's not quite -true; see below under "Cabling." - -Once you get past these little stumbling blocks, ARCnet is actually quite a -well-designed standard. It uses something called "modified token passing" -which makes it completely incompatible with so-called "Token Ring" cards, -but which makes transfers much more reliable than Ethernet does. In fact, -ARCnet will guarantee that a packet arrives safely at the destination, and -even if it can't possibly be delivered properly (ie. because of a cable -break, or because the destination computer does not exist) it will at least -tell the sender about it. - -Because of the carefully defined action of the "token", it will always make -a pass around the "ring" within a maximum length of time. This makes it -useful for realtime networks. - -In addition, all known ARCnet cards have an (almost) identical programming -interface. This means that with one ARCnet driver you can support any -card, whereas with Ethernet each manufacturer uses what is sometimes a -completely different programming interface, leading to a lot of different, -sometimes very similar, Ethernet drivers. Of course, always using the same -programming interface also means that when high-performance hardware -facilities like PCI bus mastering DMA appear, it's hard to take advantage of -them. Let's not go into that. - -One thing that makes ARCnet cards difficult to program for, however, is the -limit on their packet sizes; standard ARCnet can only send packets that are -up to 508 bytes in length. This is smaller than the Internet "bare minimum" -of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra -level of encapsulation is defined by RFC1201, which I call "packet -splitting," that allows "virtual packets" to grow as large as 64K each, -although they are generally kept down to the Ethernet-style 1500 bytes. - -For more information on the advantages and disadvantages (mostly the -advantages) of ARCnet networks, you might try the "ARCnet Trade Association" -WWW page: - http://www.arcnet.com - - -CABLING ARCNET NETWORKS ------------------------ - -This section was rewritten by - Vojtech Pavlik -using information from several people, including: - Avery Pennraun - Stephen A. Wood - John Paul Morrison - Joachim Koenig -and Avery touched it up a bit, at Vojtech's request. - -ARCnet (the classic 2.5 Mbps version) can be connected by two different -types of cabling: coax and twisted pair. The other ARCnet-type networks -(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of -cabling (Type1, Fiber, C1, C4, C5). - -For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables -also work fine, because ARCnet is a very stable network. I personally use 75 -Ohm TV antenna cable. - -Cards for coax cabling are shipped in two different variants: for BUS and -STAR network topologies. They are mostly the same. The only difference -lies in the hybrid chip installed. BUS cards use high impedance output, -while STAR use low impedance. Low impedance card (STAR) is electrically -equal to a high impedance one with a terminator installed. - -Usually, the ARCnet networks are built up from STAR cards and hubs. There -are two types of hubs - active and passive. Passive hubs are small boxes -with four BNC connectors containing four 47 Ohm resistors: - - | | wires - R + junction --R-+-R- R 47 Ohm resistors - R - | - -The shielding is connected together. Active hubs are much more complicated; -they are powered and contain electronics to amplify the signal and send it -to other segments of the net. They usually have eight connectors. Active -hubs come in two variants - dumb and smart. The dumb variant just -amplifies, but the smart one decodes to digital and encodes back all packets -coming through. This is much better if you have several hubs in the net, -since many dumb active hubs may worsen the signal quality. - -And now to the cabling. What you can connect together: - -1. A card to a card. This is the simplest way of creating a 2-computer - network. - -2. A card to a passive hub. Remember that all unused connectors on the hub - must be properly terminated with 93 Ohm (or something else if you don't - have the right ones) terminators. - (Avery's note: oops, I didn't know that. Mine (TV cable) works - anyway, though.) - -3. A card to an active hub. Here is no need to terminate the unused - connectors except some kind of aesthetic feeling. But, there may not be - more than eleven active hubs between any two computers. That of course - doesn't limit the number of active hubs on the network. - -4. An active hub to another. - -5. An active hub to passive hub. - -Remember that you cannot connect two passive hubs together. The power loss -implied by such a connection is too high for the net to operate reliably. - -An example of a typical ARCnet network: - - R S - STAR type card - S------H--------A-------S R - Terminator - | | H - Hub - | | A - Active hub - | S----H----S - S | - | - S - -The BUS topology is very similar to the one used by Ethernet. The only -difference is in cable and terminators: they should be 93 Ohm. Ethernet -uses 50 Ohm impedance. You use T connectors to put the computers on a single -line of cable, the bus. You have to put terminators at both ends of the -cable. A typical BUS ARCnet network looks like: - - RT----T------T------T------T------TR - B B B B B B - - B - BUS type card - R - Terminator - T - T connector - -But that is not all! The two types can be connected together. According to -the official documentation the only way of connecting them is using an active -hub: - - A------T------T------TR - | B B B - S---H---S - | - S - -The official docs also state that you can use STAR cards at the ends of -BUS network in place of a BUS card and a terminator: - - S------T------T------S - B B - -But, according to my own experiments, you can simply hang a BUS type card -anywhere in middle of a cable in a STAR topology network. And more - you -can use the bus card in place of any star card if you use a terminator. Then -you can build very complicated networks fulfilling all your needs! An -example: - - S - | - RT------T-------T------H------S - B B B | - | R - S------A------T-------T-------A-------H------TR - | B B | | B - | S BT | - | | | S----A-----S - S------H---A----S | | - | | S------T----H---S | - S S B R S - -A basically different cabling scheme is used with Twisted Pair cabling. Each -of the TP cards has two RJ (phone-cord style) connectors. The cards are -then daisy-chained together using a cable connecting every two neighboring -cards. The ends are terminated with RJ 93 Ohm terminators which plug into -the empty connectors of cards on the ends of the chain. An example: - - ___________ ___________ - _R_|_ _|_|_ _|_R_ - | | | | | | - |Card | |Card | |Card | - |_____| |_____| |_____| - - -There are also hubs for the TP topology. There is nothing difficult -involved in using them; you just connect a TP chain to a hub on any end or -even at both. This way you can create almost any network configuration. -The maximum of 11 hubs between any two computers on the net applies here as -well. An example: - - RP-------P--------P--------H-----P------P-----PR - | - RP-----H--------P--------H-----P------PR - | | - PR PR - - R - RJ Terminator - P - TP Card - H - TP Hub - -Like any network, ARCnet has a limited cable length. These are the maximum -cable lengths between two active ends (an active end being an active hub or -a STAR card). - - RG-62 93 Ohm up to 650 m - RG-59/U 75 Ohm up to 457 m - RG-11/U 75 Ohm up to 533 m - IBM Type 1 150 Ohm up to 200 m - IBM Type 3 100 Ohm up to 100 m - -The maximum length of all cables connected to a passive hub is limited to 65 -meters for RG-62 cabling; less for others. You can see that using passive -hubs in a large network is a bad idea. The maximum length of a single "BUS -Trunk" is about 300 meters for RG-62. The maximum distance between the two -most distant points of the net is limited to 3000 meters. The maximum length -of a TP cable between two cards/hubs is 650 meters. - - -SETTING THE JUMPERS -------------------- - -All ARCnet cards should have a total of four or five different settings: - - - the I/O address: this is the "port" your ARCnet card is on. Probed - values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If - your card has additional ones, which is possible, please tell me.) This - should not be the same as any other device on your system. According to - a doc I got from Novell, MS Windows prefers values of 0x300 or more, - eating net connections on my system (at least) otherwise. My guess is - this may be because, if your card is at 0x2E0, probing for a serial port - at 0x2E8 will reset the card and probably mess things up royally. - - Avery's favourite: 0x300. - - - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7. - on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15. - - Make sure this is different from any other card on your system. Note - that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can - "cat /proc/interrupts" for a somewhat complete list of which ones are in - use at any given time. Here is a list of common usages from Vojtech - Pavlik : - ("Not on bus" means there is no way for a card to generate this - interrupt) - IRQ 0 - Timer 0 (Not on bus) - IRQ 1 - Keyboard (Not on bus) - IRQ 2 - IRQ Controller 2 (Not on bus, nor does interrupt the CPU) - IRQ 3 - COM2 - IRQ 4 - COM1 - IRQ 5 - FREE (LPT2 if you have it; sometimes COM3; maybe PLIP) - IRQ 6 - Floppy disk controller - IRQ 7 - FREE (LPT1 if you don't use the polling driver; PLIP) - IRQ 8 - Realtime Clock Interrupt (Not on bus) - IRQ 9 - FREE (VGA vertical sync interrupt if enabled) - IRQ 10 - FREE - IRQ 11 - FREE - IRQ 12 - FREE - IRQ 13 - Numeric Coprocessor (Not on bus) - IRQ 14 - Fixed Disk Controller - IRQ 15 - FREE (Fixed Disk Controller 2 if you have it) - - Note: IRQ 9 is used on some video cards for the "vertical retrace" - interrupt. This interrupt would have been handy for things like - video games, as it occurs exactly once per screen refresh, but - unfortunately IBM cancelled this feature starting with the original - VGA and thus many VGA/SVGA cards do not support it. For this - reason, no modern software uses this interrupt and it can almost - always be safely disabled, if your video card supports it at all. - - If your card for some reason CANNOT disable this IRQ (usually there - is a jumper), one solution would be to clip the printed circuit - contact on the board: it's the fourth contact from the left on the - back side. I take no responsibility if you try this. - - - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though. - - - the memory address: Unlike most cards, ARCnets use "shared memory" for - copying buffers around. Make SURE it doesn't conflict with any other - used memory in your system! - A0000 - VGA graphics memory (ok if you don't have VGA) - B0000 - Monochrome text mode - C0000 \ One of these is your VGA BIOS - usually C0000. - E0000 / - F0000 - System BIOS - - Anything less than 0xA0000 is, well, a BAD idea since it isn't above - 640k. - - Avery's favourite: 0xD0000 - - - the station address: Every ARCnet card has its own "unique" network - address from 0 to 255. Unlike Ethernet, you can set this address - yourself with a jumper or switch (or on some cards, with special - software). Since it's only 8 bits, you can only have 254 ARCnet cards - on a network. DON'T use 0 or 255, since these are reserved (although - neat stuff will probably happen if you DO use them). By the way, if you - haven't already guessed, don't set this the same as any other ARCnet on - your network! - - Avery's favourite: 3 and 4. Not that it matters. - - - There may be ETS1 and ETS2 settings. These may or may not make a - difference on your card (many manuals call them "reserved"), but are - used to change the delays used when powering up a computer on the - network. This is only necessary when wiring VERY long range ARCnet - networks, on the order of 4km or so; in any case, the only real - requirement here is that all cards on the network with ETS1 and ETS2 - jumpers have them in the same position. Chris Hindy - sent in a chart with actual values for this: - ET1 ET2 Response Time Reconfiguration Time - --- --- ------------- -------------------- - open open 74.7us 840us - open closed 283.4us 1680us - closed open 561.8us 1680us - closed closed 1118.6us 1680us - - Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your - network. - -Also, on many cards (not mine, though) there are red and green LED's. -Vojtech Pavlik tells me this is what they mean: - GREEN RED Status - ----- --- ------ - OFF OFF Power off - OFF Short flashes Cabling problems (broken cable or not - terminated) - OFF (short) ON Card init - ON ON Normal state - everything OK, nothing - happens - ON Long flashes Data transfer - ON OFF Never happens (maybe when wrong ID) - - -The following is all the specific information people have sent me about -their own particular ARCnet cards. It is officially a mess, and contains -huge amounts of duplicated information. I have no time to fix it. If you -want to, PLEASE DO! Just send me a 'diff -u' of all your changes. - -The model # is listed right above specifics for that card, so you should be -able to use your text viewer's "search" function to find the entry you want. -If you don't KNOW what kind of card you have, try looking through the -various diagrams to see if you can tell. - -If your model isn't listed and/or has different settings, PLEASE PLEASE -tell me. I had to figure mine out without the manual, and it WASN'T FUN! - -Even if your ARCnet model isn't listed, but has the same jumpers as another -model that is, please e-mail me to say so. - -Cards Listed in this file (in this order, mostly): - - Manufacturer Model # Bits - ------------ ------- ---- - SMC PC100 8 - SMC PC110 8 - SMC PC120 8 - SMC PC130 8 - SMC PC270E 8 - SMC PC500 16 - SMC PC500Longboard 16 - SMC PC550Longboard 16 - SMC PC600 16 - SMC PC710 8 - SMC? LCS-8830(-T) 8/16 - Puredata PDI507 8 - CNet Tech CN120-Series 8 - CNet Tech CN160-Series 16 - Lantech? UM9065L chipset 8 - Acer 5210-003 8 - Datapoint? LAN-ARC-8 8 - Topware TA-ARC/10 8 - Thomas-Conrad 500-6242-0097 REV A 8 - Waterloo? (C)1985 Waterloo Micro. 8 - No Name -- 8/16 - No Name Taiwan R.O.C? 8 - No Name Model 9058 8 - Tiara Tiara Lancard? 8 - - -** SMC = Standard Microsystems Corp. -** CNet Tech = CNet Technology, Inc. - - -Unclassified Stuff ------------------- - - Please send any other information you can find. - - - And some other stuff (more info is welcome!): - From: root@ultraworld.xs4all.nl (Timo Hilbrink) - To: apenwarr@foxnet.net (Avery Pennarun) - Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT) - Reply-To: timoh@xs4all.nl - - [...parts deleted...] - - About the jumpers: On my PC130 there is one more jumper, located near the - cable-connector and it's for changing to star or bus topology; - closed: star - open: bus - On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI - and another with ALE,LA17,LA18,LA19 these are undocumented.. - - [...more parts deleted...] - - --- CUT --- - - -** Standard Microsystems Corp (SMC) ** -PC100, PC110, PC120, PC130 (8-bit cards) -PC500, PC600 (16-bit cards) ---------------------------------- - - mainly from Avery Pennarun . Values depicted - are from Avery's setup. - - special thanks to Timo Hilbrink for noting that PC120, - 130, 500, and 600 all have the same switches as Avery's PC100. - PC500/600 have several extra, undocumented pins though. (?) - - PC110 settings were verified by Stephen A. Wood - - Also, the JP- and S-numbers probably don't match your card exactly. Try - to find jumpers/switches with the same number of settings - it's - probably more reliable. - - - JP5 [|] : : : : -(IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7 - Put exactly one jumper on exactly one set of pins. - - - 1 2 3 4 5 6 7 8 9 10 - S1 /----------------------------------\ -(I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 | - addresses) \----------------------------------/ - |--| |--------| |--------| - (a) (b) (m) - - WARNING. It's very important when setting these which way - you're holding the card, and which way you think is '1'! - - If you suspect that your settings are not being made - correctly, try reversing the direction or inverting the - switch positions. - - a: The first digit of the I/O address. - Setting Value - ------- ----- - 00 0 - 01 1 - 10 2 - 11 3 - - b: The second digit of the I/O address. - Setting Value - ------- ----- - 0000 0 - 0001 1 - 0010 2 - ... ... - 1110 E - 1111 F - - The I/O address is in the form ab0. For example, if - a is 0x2 and b is 0xE, the address will be 0x2E0. - - DO NOT SET THIS LESS THAN 0x200!!!!! - - - m: The first digit of the memory address. - Setting Value - ------- ----- - 0000 0 - 0001 1 - 0010 2 - ... ... - 1110 E - 1111 F - - The memory address is in the form m0000. For example, if - m is D, the address will be 0xD0000. - - DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000! - - 1 2 3 4 5 6 7 8 - S2 /--------------------------\ -(Station Address) | 1 1 0 0 0 0 0 0 | - \--------------------------/ - - Setting Value - ------- ----- - 00000000 00 - 10000000 01 - 01000000 02 - ... - 01111111 FE - 11111111 FF - - Note that this is binary with the digits reversed! - - DO NOT SET THIS TO 0 OR 255 (0xFF)! - - -***************************************************************************** - -** Standard Microsystems Corp (SMC) ** -PC130E/PC270E (8-bit cards) ---------------------------- - - from Juergen Seifert - - -STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET(R)-PC130E/PC270E -=============================================================== - -This description has been written by Juergen Seifert -using information from the following Original SMC Manual - - "Configuration Guide for - ARCNET(R)-PC130E/PC270 - Network Controller Boards - Pub. # 900.044A - June, 1989" - -ARCNET is a registered trademark of the Datapoint Corporation -SMC is a registered trademark of the Standard Microsystems Corporation - -The PC130E is an enhanced version of the PC130 board, is equipped with a -standard BNC female connector for connection to RG-62/U coax cable. -Since this board is designed both for point-to-point connection in star -networks and for connection to bus networks, it is downwardly compatible -with all the other standard boards designed for coax networks (that is, -the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and -PC200 bus topology boards). - -The PC270E is an enhanced version of the PC260 board, is equipped with two -modular RJ11-type jacks for connection to twisted pair wiring. -It can be used in a star or a daisy-chained network. - - - 8 7 6 5 4 3 2 1 - ________________________________________________________________ - | | S1 | | - | |_________________| | - | Offs|Base |I/O Addr | - | RAM Addr | ___| - | ___ ___ CR3 |___| - | | \/ | CR4 |___| - | | PROM | ___| - | | | N | | 8 - | | SOCKET | o | | 7 - | |________| d | | 6 - | ___________________ e | | 5 - | | | A | S | 4 - | |oo| EXT2 | | d | 2 | 3 - | |oo| EXT1 | SMC | d | | 2 - | |oo| ROM | 90C63 | r |___| 1 - | |oo| IRQ7 | | |o| _____| - | |oo| IRQ5 | | |o| | J1 | - | |oo| IRQ4 | | STAR |_____| - | |oo| IRQ3 | | | J2 | - | |oo| IRQ2 |___________________| |_____| - |___ ______________| - | | - |_____________________________________________| - -Legend: - -SMC 90C63 ARCNET Controller / Transceiver /Logic -S1 1-3: I/O Base Address Select - 4-6: Memory Base Address Select - 7-8: RAM Offset Select -S2 1-8: Node ID Select -EXT Extended Timeout Select -ROM ROM Enable Select -STAR Selected - Star Topology (PC130E only) - Deselected - Bus Topology (PC130E only) -CR3/CR4 Diagnostic LEDs -J1 BNC RG62/U Connector (PC130E only) -J1 6-position Telephone Jack (PC270E only) -J2 6-position Telephone Jack (PC270E only) - -Setting one of the switches to Off/Open means "1", On/Closed means "0". - - -Setting the Node ID -------------------- - -The eight switches in group S2 are used to set the node ID. -These switches work in a way similar to the PC100-series cards; see that -entry for more information. - - -Setting the I/O Base Address ----------------------------- - -The first three switches in switch group S1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 1 2 3 | Address - -------|-------- - 0 0 0 | 260 - 0 0 1 | 290 - 0 1 0 | 2E0 (Manufacturer's default) - 0 1 1 | 2F0 - 1 0 0 | 300 - 1 0 1 | 350 - 1 1 0 | 380 - 1 1 1 | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 4-6 of switch group S1 select the Base of the 16K block. -Within that 16K address space, the buffer may be assigned any one of four -positions, determined by the offset, switches 7 and 8 of group S1. - - Switch | Hex RAM | Hex ROM - 4 5 6 7 8 | Address | Address *) - -----------|---------|----------- - 0 0 0 0 0 | C0000 | C2000 - 0 0 0 0 1 | C0800 | C2000 - 0 0 0 1 0 | C1000 | C2000 - 0 0 0 1 1 | C1800 | C2000 - | | - 0 0 1 0 0 | C4000 | C6000 - 0 0 1 0 1 | C4800 | C6000 - 0 0 1 1 0 | C5000 | C6000 - 0 0 1 1 1 | C5800 | C6000 - | | - 0 1 0 0 0 | CC000 | CE000 - 0 1 0 0 1 | CC800 | CE000 - 0 1 0 1 0 | CD000 | CE000 - 0 1 0 1 1 | CD800 | CE000 - | | - 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) - 0 1 1 0 1 | D0800 | D2000 - 0 1 1 1 0 | D1000 | D2000 - 0 1 1 1 1 | D1800 | D2000 - | | - 1 0 0 0 0 | D4000 | D6000 - 1 0 0 0 1 | D4800 | D6000 - 1 0 0 1 0 | D5000 | D6000 - 1 0 0 1 1 | D5800 | D6000 - | | - 1 0 1 0 0 | D8000 | DA000 - 1 0 1 0 1 | D8800 | DA000 - 1 0 1 1 0 | D9000 | DA000 - 1 0 1 1 1 | D9800 | DA000 - | | - 1 1 0 0 0 | DC000 | DE000 - 1 1 0 0 1 | DC800 | DE000 - 1 1 0 1 0 | DD000 | DE000 - 1 1 0 1 1 | DD800 | DE000 - | | - 1 1 1 0 0 | E0000 | E2000 - 1 1 1 0 1 | E0800 | E2000 - 1 1 1 1 0 | E1000 | E2000 - 1 1 1 1 1 | E1800 | E2000 - -*) To enable the 8K Boot PROM install the jumper ROM. - The default is jumper ROM not installed. - - -Setting the Timeouts and Interrupt ----------------------------------- - -The jumpers labeled EXT1 and EXT2 are used to determine the timeout -parameters. These two jumpers are normally left open. - -To select a hardware interrupt level set one (only one!) of the jumpers -IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2. - - -Configuring the PC130E for Star or Bus Topology ------------------------------------------------ - -The single jumper labeled STAR is used to configure the PC130E board for -star or bus topology. -When the jumper is installed, the board may be used in a star network, when -it is removed, the board can be used in a bus topology. - - -Diagnostic LEDs ---------------- - -Two diagnostic LEDs are visible on the rear bracket of the board. -The green LED monitors the network activity: the red one shows the -board activity: - - Green | Status Red | Status - -------|------------------- ---------|------------------- - on | normal activity flash/on | data transfer - blink | reconfiguration off | no data transfer; - off | defective board or | incorrect memory or - | node ID is zero | I/O address - - -***************************************************************************** - -** Standard Microsystems Corp (SMC) ** -PC500/PC550 Longboard (16-bit cards) -------------------------------------- - - from Juergen Seifert - - -STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET-PC500/PC550 Long Board -===================================================================== - -Note: There is another Version of the PC500 called Short Version, which - is different in hard- and software! The most important differences - are: - - The long board has no Shared memory. - - On the long board the selection of the interrupt is done by binary - coded switch, on the short board directly by jumper. - -[Avery's note: pay special attention to that: the long board HAS NO SHARED -MEMORY. This means the current Linux-ARCnet driver can't use these cards. -I have obtained a PC500Longboard and will be doing some experiments on it in -the future, but don't hold your breath. Thanks again to Juergen Seifert for -his advice about this!] - -This description has been written by Juergen Seifert -using information from the following Original SMC Manual - - "Configuration Guide for - SMC ARCNET-PC500/PC550 - Series Network Controller Boards - Pub. # 900.033 Rev. A - November, 1989" - -ARCNET is a registered trademark of the Datapoint Corporation -SMC is a registered trademark of the Standard Microsystems Corporation - -The PC500 is equipped with a standard BNC female connector for connection -to RG-62/U coax cable. -The board is designed both for point-to-point connection in star networks -and for connection to bus networks. - -The PC550 is equipped with two modular RJ11-type jacks for connection -to twisted pair wiring. -It can be used in a star or a daisy-chained (BUS) network. - - 1 - 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1 - ____________________________________________________________________ - < | SW1 | | SW2 | | - > |_____________________| |_____________| | - < IRQ |I/O Addr | - > ___| - < CR4 |___| - > CR3 |___| - < ___| - > N | | 8 - < o | | 7 - > d | S | 6 - < e | W | 5 - > A | 3 | 4 - < d | | 3 - > d | | 2 - < r |___| 1 - > |o| _____| - < |o| | J1 | - > 3 1 JP6 |_____| - < |o|o| JP2 | J2 | - > |o|o| |_____| - < 4 2__ ______________| - > | | | - <____| |_____________________________________________| - -Legend: - -SW1 1-6: I/O Base Address Select - 7-10: Interrupt Select -SW2 1-6: Reserved for Future Use -SW3 1-8: Node ID Select -JP2 1-4: Extended Timeout Select -JP6 Selected - Star Topology (PC500 only) - Deselected - Bus Topology (PC500 only) -CR3 Green Monitors Network Activity -CR4 Red Monitors Board Activity -J1 BNC RG62/U Connector (PC500 only) -J1 6-position Telephone Jack (PC550 only) -J2 6-position Telephone Jack (PC550 only) - -Setting one of the switches to Off/Open means "1", On/Closed means "0". - - -Setting the Node ID -------------------- - -The eight switches in group SW3 are used to set the node ID. Each node -attached to the network must have an unique node ID which must be -different from 0. -Switch 1 serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first six switches in switch group SW1 are used to select one -of 32 possible I/O Base addresses using the following table - - Switch | Hex I/O - 6 5 4 3 2 1 | Address - -------------|-------- - 0 1 0 0 0 0 | 200 - 0 1 0 0 0 1 | 210 - 0 1 0 0 1 0 | 220 - 0 1 0 0 1 1 | 230 - 0 1 0 1 0 0 | 240 - 0 1 0 1 0 1 | 250 - 0 1 0 1 1 0 | 260 - 0 1 0 1 1 1 | 270 - 0 1 1 0 0 0 | 280 - 0 1 1 0 0 1 | 290 - 0 1 1 0 1 0 | 2A0 - 0 1 1 0 1 1 | 2B0 - 0 1 1 1 0 0 | 2C0 - 0 1 1 1 0 1 | 2D0 - 0 1 1 1 1 0 | 2E0 (Manufacturer's default) - 0 1 1 1 1 1 | 2F0 - 1 1 0 0 0 0 | 300 - 1 1 0 0 0 1 | 310 - 1 1 0 0 1 0 | 320 - 1 1 0 0 1 1 | 330 - 1 1 0 1 0 0 | 340 - 1 1 0 1 0 1 | 350 - 1 1 0 1 1 0 | 360 - 1 1 0 1 1 1 | 370 - 1 1 1 0 0 0 | 380 - 1 1 1 0 0 1 | 390 - 1 1 1 0 1 0 | 3A0 - 1 1 1 0 1 1 | 3B0 - 1 1 1 1 0 0 | 3C0 - 1 1 1 1 0 1 | 3D0 - 1 1 1 1 1 0 | 3E0 - 1 1 1 1 1 1 | 3F0 - - -Setting the Interrupt ---------------------- - -Switches seven through ten of switch group SW1 are used to select the -interrupt level. The interrupt level is binary coded, so selections -from 0 to 15 would be possible, but only the following eight values will -be supported: 3, 4, 5, 7, 9, 10, 11, 12. - - Switch | IRQ - 10 9 8 7 | - ---------|-------- - 0 0 1 1 | 3 - 0 1 0 0 | 4 - 0 1 0 1 | 5 - 0 1 1 1 | 7 - 1 0 0 1 | 9 (=2) (default) - 1 0 1 0 | 10 - 1 0 1 1 | 11 - 1 1 0 0 | 12 - - -Setting the Timeouts --------------------- - -The two jumpers JP2 (1-4) are used to determine the timeout parameters. -These two jumpers are normally left open. -Refer to the COM9026 Data Sheet for alternate configurations. - - -Configuring the PC500 for Star or Bus Topology ----------------------------------------------- - -The single jumper labeled JP6 is used to configure the PC500 board for -star or bus topology. -When the jumper is installed, the board may be used in a star network, when -it is removed, the board can be used in a bus topology. - - -Diagnostic LEDs ---------------- - -Two diagnostic LEDs are visible on the rear bracket of the board. -The green LED monitors the network activity: the red one shows the -board activity: - - Green | Status Red | Status - -------|------------------- ---------|------------------- - on | normal activity flash/on | data transfer - blink | reconfiguration off | no data transfer; - off | defective board or | incorrect memory or - | node ID is zero | I/O address - - -***************************************************************************** - -** SMC ** -PC710 (8-bit card) ------------------- - - from J.S. van Oosten - -Note: this data is gathered by experimenting and looking at info of other -cards. However, I'm sure I got 99% of the settings right. - -The SMC710 card resembles the PC270 card, but is much more basic (i.e. no -LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing: - - _______________________________________ - | +---------+ +---------+ |____ - | | S2 | | S1 | | - | +---------+ +---------+ | - | | - | +===+ __ | - | | R | | | X-tal ###___ - | | O | |__| ####__'| - | | M | || ### - | +===+ | - | | - | .. JP1 +----------+ | - | .. | big chip | | - | .. | 90C63 | | - | .. | | | - | .. +----------+ | - ------- ----------- - ||||||||||||||||||||| - -The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes -labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM, -IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) ) - -S1 and S2 perform the same function as on the PC270, only their numbers -are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address). - -I know it works when connected to a PC110 type ARCnet board. - - -***************************************************************************** - -** Possibly SMC ** -LCS-8830(-T) (8 and 16-bit cards) ---------------------------------- - - from Mathias Katzer - - Marek Michalkiewicz says the - LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS - only (the JP0 jumper is hardwired), and BNC only. - -This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC, -nowhere else, not even on the few Xeroxed sheets from the manual). - -SMC ARCnet Board Type LCS-8830-T - - ------------------------------------ - | | - | JP3 88 8 JP2 | - | ##### | \ | - | ##### ET1 ET2 ###| - | 8 ###| - | U3 SW 1 JP0 ###| Phone Jacks - | -- ###| - | | | | - | | | SW2 | - | | | | - | | | ##### | - | -- ##### #### BNC Connector - | #### - | 888888 JP1 | - | 234567 | - -- ------- - ||||||||||||||||||||||||||| - -------------------------- - - -SW1: DIP-Switches for Station Address -SW2: DIP-Switches for Memory Base and I/O Base addresses - -JP0: If closed, internal termination on (default open) -JP1: IRQ Jumpers -JP2: Boot-ROM enabled if closed -JP3: Jumpers for response timeout - -U3: Boot-ROM Socket - - -ET1 ET2 Response Time Idle Time Reconfiguration Time - - 78 86 840 - X 285 316 1680 - X 563 624 1680 - X X 1130 1237 1680 - -(X means closed jumper) - -(DIP-Switch downwards means "0") - -The station address is binary-coded with SW1. - -The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2: - -Switches Base -678 Address -000 260-26f -100 290-29f -010 2e0-2ef -110 2f0-2ff -001 300-30f -101 350-35f -011 380-38f -111 3e0-3ef - - -DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range: - -Switches RAM ROM -12345 Address Range Address Range -00000 C:0000-C:07ff C:2000-C:3fff -10000 C:0800-C:0fff -01000 C:1000-C:17ff -11000 C:1800-C:1fff -00100 C:4000-C:47ff C:6000-C:7fff -10100 C:4800-C:4fff -01100 C:5000-C:57ff -11100 C:5800-C:5fff -00010 C:C000-C:C7ff C:E000-C:ffff -10010 C:C800-C:Cfff -01010 C:D000-C:D7ff -11010 C:D800-C:Dfff -00110 D:0000-D:07ff D:2000-D:3fff -10110 D:0800-D:0fff -01110 D:1000-D:17ff -11110 D:1800-D:1fff -00001 D:4000-D:47ff D:6000-D:7fff -10001 D:4800-D:4fff -01001 D:5000-D:57ff -11001 D:5800-D:5fff -00101 D:8000-D:87ff D:A000-D:bfff -10101 D:8800-D:8fff -01101 D:9000-D:97ff -11101 D:9800-D:9fff -00011 D:C000-D:c7ff D:E000-D:ffff -10011 D:C800-D:cfff -01011 D:D000-D:d7ff -11011 D:D800-D:dfff -00111 E:0000-E:07ff E:2000-E:3fff -10111 E:0800-E:0fff -01111 E:1000-E:17ff -11111 E:1800-E:1fff - - -***************************************************************************** - -** PureData Corp ** -PDI507 (8-bit card) --------------------- - - from Mark Rejhon (slight modifications by Avery) - - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards) - are mostly the same as this. PDI508Plus cards appear to be mainly - software-configured. - -Jumpers: - There is a jumper array at the bottom of the card, near the edge - connector. This array is labelled J1. They control the IRQs and - something else. Put only one jumper on the IRQ pins. - - ETS1, ETS2 are for timing on very long distance networks. See the - more general information near the top of this file. - - There is a J2 jumper on two pins. A jumper should be put on them, - since it was already there when I got the card. I don't know what - this jumper is for though. - - There is a two-jumper array for J3. I don't know what it is for, - but there were already two jumpers on it when I got the card. It's - a six pin grid in a two-by-three fashion. The jumpers were - configured as follows: - - .-------. - o | o o | - :-------: ------> Accessible end of card with connectors - o | o o | in this direction -------> - `-------' - -Carl de Billy explains J3 and J4: - - J3 Diagram: - - .-------. - o | o o | - :-------: TWIST Technology - o | o o | - `-------' - .-------. - | o o | o - :-------: COAX Technology - | o o | o - `-------' - - - If using coax cable in a bus topology the J4 jumper must be removed; - place it on one pin. - - - If using bus topology with twisted pair wiring move the J3 - jumpers so they connect the middle pin and the pins closest to the RJ11 - Connectors. Also the J4 jumper must be removed; place it on one pin of - J4 jumper for storage. - - - If using star topology with twisted pair wiring move the J3 - jumpers so they connect the middle pin and the pins closest to the RJ11 - connectors. - - -DIP Switches: - - The DIP switches accessible on the accessible end of the card while - it is installed, is used to set the ARCnet address. There are 8 - switches. Use an address from 1 to 254. - - Switch No. - 12345678 ARCnet address - ----------------------------------------- - 00000000 FF (Don't use this!) - 00000001 FE - 00000010 FD - .... - 11111101 2 - 11111110 1 - 11111111 0 (Don't use this!) - - There is another array of eight DIP switches at the top of the - card. There are five labelled MS0-MS4 which seem to control the - memory address, and another three labelled IO0-IO2 which seem to - control the base I/O address of the card. - - This was difficult to test by trial and error, and the I/O addresses - are in a weird order. This was tested by setting the DIP switches, - rebooting the computer, and attempting to load ARCETHER at various - addresses (mostly between 0x200 and 0x400). The address that caused - the red transmit LED to blink, is the one that I thought works. - - Also, the address 0x3D0 seem to have a special meaning, since the - ARCETHER packet driver loaded fine, but without the red LED - blinking. I don't know what 0x3D0 is for though. I recommend using - an address of 0x300 since Windows may not like addresses below - 0x300. - - IO Switch No. - 210 I/O address - ------------------------------- - 111 0x260 - 110 0x290 - 101 0x2E0 - 100 0x2F0 - 011 0x300 - 010 0x350 - 001 0x380 - 000 0x3E0 - - The memory switches set a reserved address space of 0x1000 bytes - (0x100 segment units, or 4k). For example if I set an address of - 0xD000, it will use up addresses 0xD000 to 0xD100. - - The memory switches were tested by booting using QEMM386 stealth, - and using LOADHI to see what address automatically became excluded - from the upper memory regions, and then attempting to load ARCETHER - using these addresses. - - I recommend using an ARCnet memory address of 0xD000, and putting - the EMS page frame at 0xC000 while using QEMM stealth mode. That - way, you get contiguous high memory from 0xD100 almost all the way - the end of the megabyte. - - Memory Switch 0 (MS0) didn't seem to work properly when set to OFF - on my card. It could be malfunctioning on my card. Experiment with - it ON first, and if it doesn't work, set it to OFF. (It may be a - modifier for the 0x200 bit?) - - MS Switch No. - 43210 Memory address - -------------------------------- - 00001 0xE100 (guessed - was not detected by QEMM) - 00011 0xE000 (guessed - was not detected by QEMM) - 00101 0xDD00 - 00111 0xDC00 - 01001 0xD900 - 01011 0xD800 - 01101 0xD500 - 01111 0xD400 - 10001 0xD100 - 10011 0xD000 - 10101 0xCD00 - 10111 0xCC00 - 11001 0xC900 (guessed - crashes tested system) - 11011 0xC800 (guessed - crashes tested system) - 11101 0xC500 (guessed - crashes tested system) - 11111 0xC400 (guessed - crashes tested system) - - -***************************************************************************** - -** CNet Technology Inc. ** -120 Series (8-bit cards) ------------------------- - - from Juergen Seifert - - -CNET TECHNOLOGY INC. (CNet) ARCNET 120A SERIES -============================================== - -This description has been written by Juergen Seifert -using information from the following Original CNet Manual - - "ARCNET - USER'S MANUAL - for - CN120A - CN120AB - CN120TP - CN120ST - CN120SBT - P/N:12-01-0007 - Revision 3.00" - -ARCNET is a registered trademark of the Datapoint Corporation - -P/N 120A ARCNET 8 bit XT/AT Star -P/N 120AB ARCNET 8 bit XT/AT Bus -P/N 120TP ARCNET 8 bit XT/AT Twisted Pair -P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair -P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair - - __________________________________________________________________ - | | - | ___| - | LED |___| - | ___| - | N | | ID7 - | o | | ID6 - | d | S | ID5 - | e | W | ID4 - | ___________________ A | 2 | ID3 - | | | d | | ID2 - | | | 1 2 3 4 5 6 7 8 d | | ID1 - | | | _________________ r |___| ID0 - | | 90C65 || SW1 | ____| - | JP 8 7 | ||_________________| | | - | |o|o| JP1 | | | J2 | - | |o|o| |oo| | | JP 1 1 1 | | - | ______________ | | 0 1 2 |____| - | | PROM | |___________________| |o|o|o| _____| - | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 | - | |______________| |o|o|o|o|o| |o|o|o| |_____| - |_____ |o|o|o|o|o| ______________| - | | - |_____________________________________________| - -Legend: - -90C65 ARCNET Probe -S1 1-5: Base Memory Address Select - 6-8: Base I/O Address Select -S2 1-8: Node ID Select (ID0-ID7) -JP1 ROM Enable Select -JP2 IRQ2 -JP3 IRQ3 -JP4 IRQ4 -JP5 IRQ5 -JP6 IRQ7 -JP7/JP8 ET1, ET2 Timeout Parameters -JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only) -JP12 Terminator Select (CN120AB/ST/SBT only) -J1 BNC RG62/U Connector (all except CN120TP) -J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only) - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must be different from 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 - OFF ON ON | 290 - ON OFF ON | 2E0 (Manufacturer's default) - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 8K or memory base + 0x2000. -Switches 1-5 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM install the jumper JP1 - -Note: Since the switches 1 and 2 are always set to ON it may be possible - that they can be used to add an offset of 2K, 4K or 6K to the base - address, but this feature is not documented in the manual and I - haven't tested it yet. - - -Setting the Interrupt Line --------------------------- - -To select a hardware interrupt level install one (only one!) of the jumpers -JP2, JP3, JP4, JP5, JP6. JP2 is the default. - - Jumper | IRQ - -------|----- - 2 | 2 - 3 | 3 - 4 | 4 - 5 | 5 - 6 | 7 - - -Setting the Internal Terminator on CN120AB/TP/SBT --------------------------------------------------- - -The jumper JP12 is used to enable the internal terminator. - - ----- - 0 | 0 | - ----- ON | | ON - | 0 | | 0 | - | | OFF ----- OFF - | 0 | 0 - ----- - Terminator Terminator - disabled enabled - - -Selecting the Connector Type on CN120ST/SBT -------------------------------------------- - - JP10 JP11 JP10 JP11 - ----- ----- - 0 0 | 0 | | 0 | - ----- ----- | | | | - | 0 | | 0 | | 0 | | 0 | - | | | | ----- ----- - | 0 | | 0 | 0 0 - ----- ----- - Coaxial Cable Twisted Pair Cable - (Default) - - -Setting the Timeout Parameters ------------------------------- - -The jumpers labeled EXT1 and EXT2 are used to determine the timeout -parameters. These two jumpers are normally left open. - - - -***************************************************************************** - -** CNet Technology Inc. ** -160 Series (16-bit cards) -------------------------- - - from Juergen Seifert - -CNET TECHNOLOGY INC. (CNet) ARCNET 160A SERIES -============================================== - -This description has been written by Juergen Seifert -using information from the following Original CNet Manual - - "ARCNET - USER'S MANUAL - for - CN160A - CN160AB - CN160TP - P/N:12-01-0006 - Revision 3.00" - -ARCNET is a registered trademark of the Datapoint Corporation - -P/N 160A ARCNET 16 bit XT/AT Star -P/N 160AB ARCNET 16 bit XT/AT Bus -P/N 160TP ARCNET 16 bit XT/AT Twisted Pair - - ___________________________________________________________________ - < _________________________ ___| - > |oo| JP2 | | LED |___| - < |oo| JP1 | 9026 | LED |___| - > |_________________________| ___| - < N | | ID7 - > 1 o | | ID6 - < 1 2 3 4 5 6 7 8 9 0 d | S | ID5 - > _______________ _____________________ e | W | ID4 - < | PROM | | SW1 | A | 2 | ID3 - > > SOCKET | |_____________________| d | | ID2 - < |_______________| | IO-Base | MEM | d | | ID1 - > r |___| ID0 - < ____| - > | | - < | J1 | - > | | - < |____| - > 1 1 1 1 | - < 3 4 5 6 7 JP 8 9 0 1 2 3 | - > |o|o|o|o|o| |o|o|o|o|o|o| | - < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________| - > | | | - <____________| |_______________________________________| - -Legend: - -9026 ARCNET Probe -SW1 1-6: Base I/O Address Select - 7-10: Base Memory Address Select -SW2 1-8: Node ID Select (ID0-ID7) -JP1/JP2 ET1, ET2 Timeout Parameters -JP3-JP13 Interrupt Select -J1 BNC RG62/U Connector (CN160A/AB only) -J1 Two 6-position Telephone Jack (CN160TP only) -LED - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must be different from 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first six switches in switch block SW1 are used to select the I/O Base -address using the following table: - - Switch | Hex I/O - 1 2 3 4 5 6 | Address - ------------------------|-------- - OFF ON ON OFF OFF ON | 260 - OFF ON OFF ON ON OFF | 290 - OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default) - OFF ON OFF OFF OFF OFF | 2F0 - OFF OFF ON ON ON ON | 300 - OFF OFF ON OFF ON OFF | 350 - OFF OFF OFF ON ON ON | 380 - OFF OFF OFF OFF OFF ON | 3E0 - -Note: Other IO-Base addresses seem to be selectable, but only the above - combinations are documented. - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The switches 7-10 of switch block SW1 are used to select the Memory -Base address of the RAM (2K) and the PROM. - - Switch | Hex RAM | Hex ROM - 7 8 9 10 | Address | Address - ----------------|---------|----------- - OFF OFF ON ON | C0000 | C8000 - OFF OFF ON OFF | D0000 | D8000 (Default) - OFF OFF OFF ON | E0000 | E8000 - -Note: Other MEM-Base addresses seem to be selectable, but only the above - combinations are documented. - - -Setting the Interrupt Line --------------------------- - -To select a hardware interrupt level install one (only one!) of the jumpers -JP3 through JP13 using the following table: - - Jumper | IRQ - -------|----------------- - 3 | 14 - 4 | 15 - 5 | 12 - 6 | 11 - 7 | 10 - 8 | 3 - 9 | 4 - 10 | 5 - 11 | 6 - 12 | 7 - 13 | 2 (=9) Default! - -Note: - Do not use JP11=IRQ6, it may conflict with your Floppy Disk - Controller - - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL- - Hard Disk, it may conflict with their controllers - - -Setting the Timeout Parameters ------------------------------- - -The jumpers labeled JP1 and JP2 are used to determine the timeout -parameters. These two jumpers are normally left open. - - -***************************************************************************** - -** Lantech ** -8-bit card, unknown model -------------------------- - - from Vlad Lungu - his e-mail address seemed broken at - the time I tried to reach him. Sorry Vlad, if you didn't get my reply. - - ________________________________________________________________ - | 1 8 | - | ___________ __| - | | SW1 | LED |__| - | |__________| | - | ___| - | _____________________ |S | 8 - | | | |W | - | | | |2 | - | | | |__| 1 - | | UM9065L | |o| JP4 ____|____ - | | | |o| | CN | - | | | |________| - | | | | - | |___________________| | - | | - | | - | _____________ | - | | | | - | | PROM | |ooooo| JP6 | - | |____________| |ooooo| | - |_____________ _ _| - |____________________________________________| |__| - - -UM9065L : ARCnet Controller - -SW 1 : Shared Memory Address and I/O Base - - ON=0 - - 12345|Memory Address - -----|-------------- - 00001| D4000 - 00010| CC000 - 00110| D0000 - 01110| D1000 - 01101| D9000 - 10010| CC800 - 10011| DC800 - 11110| D1800 - -It seems that the bits are considered in reverse order. Also, you must -observe that some of those addresses are unusual and I didn't probe them; I -used a memory dump in DOS to identify them. For the 00000 configuration and -some others that I didn't write here the card seems to conflict with the -video card (an S3 GENDAC). I leave the full decoding of those addresses to -you. - - 678| I/O Address - ---|------------ - 000| 260 - 001| failed probe - 010| 2E0 - 011| 380 - 100| 290 - 101| 350 - 110| failed probe - 111| 3E0 - -SW 2 : Node ID (binary coded) - -JP 4 : Boot PROM enable CLOSE - enabled - OPEN - disabled - -JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6) - - -***************************************************************************** - -** Acer ** -8-bit card, Model 5210-003 --------------------------- - - from Vojtech Pavlik using portions of the existing - arcnet-hardware file. - -This is a 90C26 based card. Its configuration seems similar to the SMC -PC100, but has some additional jumpers I don't know the meaning of. - - __ - | | - ___________|__|_________________________ - | | | | - | | BNC | | - | |______| ___| - | _____________________ |___ - | | | | - | | Hybrid IC | | - | | | o|o J1 | - | |_____________________| 8|8 | - | 8|8 J5 | - | o|o | - | 8|8 | - |__ 8|8 | - (|__| LED o|o | - | 8|8 | - | 8|8 J15 | - | | - | _____ | - | | | _____ | - | | | | | ___| - | | | | | | - | _____ | ROM | | UFS | | - | | | | | | | | - | | | ___ | | | | | - | | | | | |__.__| |__.__| | - | | NCR | |XTL| _____ _____ | - | | | |___| | | | | | - | |90C26| | | | | | - | | | | RAM | | UFS | | - | | | J17 o|o | | | | | - | | | J16 o|o | | | | | - | |__.__| |__.__| |__.__| | - | ___ | - | | |8 | - | |SW2| | - | | | | - | |___|1 | - | ___ | - | | |10 J18 o|o | - | | | o|o | - | |SW1| o|o | - | | | J21 o|o | - | |___|1 | - | | - |____________________________________| - - -Legend: - -90C26 ARCNET Chip -XTL 20 MHz Crystal -SW1 1-6 Base I/O Address Select - 7-10 Memory Address Select -SW2 1-8 Node ID Select (ID0-ID7) -J1-J5 IRQ Select -J6-J21 Unknown (Probably extra timeouts & ROM enable ...) -LED1 Activity LED -BNC Coax connector (STAR ARCnet) -RAM 2k of SRAM -ROM Boot ROM socket -UFS Unidentified Flying Sockets - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must not be 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -Setting one of the switches to OFF means "1", ON means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - -Don't set this to 0 or 255; these values are reserved. - - -Setting the I/O Base Address ----------------------------- - -The switches 1 to 6 of switch block SW1 are used to select one -of 32 possible I/O Base addresses using the following tables - - | Hex - Switch | Value - -------|------- - 1 | 200 - 2 | 100 - 3 | 80 - 4 | 40 - 5 | 20 - 6 | 10 - -The I/O address is sum of all switches set to "1". Remember that -the I/O address space bellow 0x200 is RESERVED for mainboard, so -switch 1 should be ALWAYS SET TO OFF. - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of sixteen positions. However, the addresses below -A0000 are likely to cause system hang because there's main RAM. - -Jumpers 7-10 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM - 7 8 9 10 | Address - ----------------|--------- - OFF OFF OFF OFF | F0000 (conflicts with main BIOS) - OFF OFF OFF ON | E0000 - OFF OFF ON OFF | D0000 - OFF OFF ON ON | C0000 (conflicts with video BIOS) - OFF ON OFF OFF | B0000 (conflicts with mono video) - OFF ON OFF ON | A0000 (conflicts with graphics) - - -Setting the Interrupt Line --------------------------- - -Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means -shorted, OFF means open. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 7 - OFF ON OFF OFF OFF | 5 - OFF OFF ON OFF OFF | 4 - OFF OFF OFF ON OFF | 3 - OFF OFF OFF OFF ON | 2 - - -Unknown jumpers & sockets -------------------------- - -I know nothing about these. I just guess that J16&J17 are timeout -jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and -J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't -guess the purpose. - - -***************************************************************************** - -** Datapoint? ** -LAN-ARC-8, an 8-bit card ------------------------- - - from Vojtech Pavlik - -This is another SMC 90C65-based ARCnet card. I couldn't identify the -manufacturer, but it might be DataPoint, because the card has the -original arcNet logo in its upper right corner. - - _______________________________________________________ - | _________ | - | | SW2 | ON arcNet | - | |_________| OFF ___| - | _____________ 1 ______ 8 | | 8 - | | | SW1 | XTAL | ____________ | S | - | > RAM (2k) | |______|| | | W | - | |_____________| | H | | 3 | - | _________|_____ y | |___| 1 - | _________ | | |b | | - | |_________| | | |r | | - | | SMC | |i | | - | | 90C65| |d | | - | _________ | | | | | - | | SW1 | ON | | |I | | - | |_________| OFF |_________|_____/C | _____| - | 1 8 | | | |___ - | ______________ | | | BNC |___| - | | | |____________| |_____| - | > EPROM SOCKET | _____________ | - | |______________| |_____________| | - | ______________| - | | - |________________________________________| - -Legend: - -90C65 ARCNET Chip -SW1 1-5: Base Memory Address Select - 6-8: Base I/O Address Select -SW2 1-8: Node ID Select -SW3 1-5: IRQ Select - 6-7: Extra Timeout - 8 : ROM Enable -BNC Coax connector -XTAL 20 MHz Crystal - - -Setting the Node ID -------------------- - -The eight switches in SW3 are used to set the node ID. Each node attached -to the network must have an unique node ID which must not be 0. -Switch 1 serves as the least significant bit (LSB). - -Setting one of the switches to Off means "1", On means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 - OFF ON ON | 290 - ON OFF ON | 2E0 (Manufacturer's default) - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 0x2000. -Jumpers 3-5 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON. - -The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address. - - -Setting the Interrupt Line --------------------------- - -Switches 1-5 of the switch block SW3 control the IRQ level. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 3 - OFF ON OFF OFF OFF | 4 - OFF OFF ON OFF OFF | 5 - OFF OFF OFF ON OFF | 7 - OFF OFF OFF OFF ON | 2 - - -Setting the Timeout Parameters ------------------------------- - -The switches 6-7 of the switch block SW3 are used to determine the timeout -parameters. These two switches are normally left in the OFF position. - - -***************************************************************************** - -** Topware ** -8-bit card, TA-ARC/10 -------------------------- - - from Vojtech Pavlik - -This is another very similar 90C65 card. Most of the switches and jumpers -are the same as on other clones. - - _____________________________________________________________________ -| ___________ | | ______ | -| |SW2 NODE ID| | | | XTAL | | -| |___________| | Hybrid IC | |______| | -| ___________ | | __| -| |SW1 MEM+I/O| |_________________________| LED1|__|) -| |___________| 1 2 | -| J3 |o|o| TIMEOUT ______| -| ______________ |o|o| | | -| | | ___________________ | RJ | -| > EPROM SOCKET | | \ |------| -|J2 |______________| | | | | -||o| | | |______| -||o| ROM ENABLE | SMC | _________ | -| _____________ | 90C65 | |_________| _____| -| | | | | | |___ -| > RAM (2k) | | | | BNC |___| -| |_____________| | | |_____| -| |____________________| | -| ________ IRQ 2 3 4 5 7 ___________ | -||________| |o|o|o|o|o| |___________| | -|________ J1|o|o|o|o|o| ______________| - | | - |_____________________________________________| - -Legend: - -90C65 ARCNET Chip -XTAL 20 MHz Crystal -SW1 1-5 Base Memory Address Select - 6-8 Base I/O Address Select -SW2 1-8 Node ID Select (ID0-ID7) -J1 IRQ Select -J2 ROM Enable -J3 Extra Timeout -LED1 Activity LED -BNC Coax connector (BUS ARCnet) -RJ Twisted Pair Connector (daisy chain) - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached to -the network must have an unique node ID which must not be 0. Switch 1 (ID0) -serves as the least significant bit (LSB). - -Setting one of the switches to Off means "1", On means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table: - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 (Manufacturer's default) - OFF ON ON | 290 - ON OFF ON | 2E0 - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 0x2000. -Jumpers 3-5 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default) - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM short the jumper J2. - -The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address. - - -Setting the Interrupt Line --------------------------- - -Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means -shorted, OFF means open. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 2 - OFF ON OFF OFF OFF | 3 - OFF OFF ON OFF OFF | 4 - OFF OFF OFF ON OFF | 5 - OFF OFF OFF OFF ON | 7 - - -Setting the Timeout Parameters ------------------------------- - -The jumpers J3 are used to set the timeout parameters. These two -jumpers are normally left open. - - -***************************************************************************** - -** Thomas-Conrad ** -Model #500-6242-0097 REV A (8-bit card) ---------------------------------------- - - from Lars Karlsson <100617.3473@compuserve.com> - - ________________________________________________________ - | ________ ________ |_____ - | |........| |........| | - | |________| |________| ___| - | SW 3 SW 1 | | - | Base I/O Base Addr. Station | | - | address | | - | ______ switch | | - | | | | | - | | | |___| - | | | ______ |___._ - | |______| |______| ____| BNC - | Jumper- _____| Connector - | Main chip block _ __| ' - | | | | RJ Connector - | |_| | with 110 Ohm - | |__ Terminator - | ___________ __| - | |...........| | RJ-jack - | |...........| _____ | (unused) - | |___________| |_____| |__ - | Boot PROM socket IRQ-jumpers |_ Diagnostic - |________ __ _| LED (red) - | | | | | | | | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | | | | | | |________| - | - | - -And here are the settings for some of the switches and jumpers on the cards. - - - I/O - - 1 2 3 4 5 6 7 8 - -2E0----- 0 0 0 1 0 0 0 1 -2F0----- 0 0 0 1 0 0 0 0 -300----- 0 0 0 0 1 1 1 1 -350----- 0 0 0 0 1 1 1 0 - -"0" in the above example means switch is off "1" means that it is on. - - - ShMem address. - - 1 2 3 4 5 6 7 8 - -CX00--0 0 1 1 | | | -DX00--0 0 1 0 | -X000--------- 1 1 | -X400--------- 1 0 | -X800--------- 0 1 | -XC00--------- 0 0 -ENHANCED----------- 1 -COMPATIBLE--------- 0 - - - IRQ - - - 3 4 5 7 2 - . . . . . - . . . . . - - -There is a DIP-switch with 8 switches, used to set the shared memory address -to be used. The first 6 switches set the address, the 7th doesn't have any -function, and the 8th switch is used to select "compatible" or "enhanced". -When I got my two cards, one of them had this switch set to "enhanced". That -card didn't work at all, it wasn't even recognized by the driver. The other -card had this switch set to "compatible" and it behaved absolutely normally. I -guess that the switch on one of the cards, must have been changed accidentally -when the card was taken out of its former host. The question remains -unanswered, what is the purpose of the "enhanced" position? - -[Avery's note: "enhanced" probably either disables shared memory (use IO -ports instead) or disables IO ports (use memory addresses instead). This -varies by the type of card involved. I fail to see how either of these -enhance anything. Send me more detailed information about this mode, or -just use "compatible" mode instead.] - - -***************************************************************************** - -** Waterloo Microsystems Inc. ?? ** -8-bit card (C) 1985 -------------------- - - from Robert Michael Best - -[Avery's note: these don't work with my driver for some reason. These cards -SEEM to have settings similar to the PDI508Plus, which is -software-configured and doesn't work with my driver either. The "Waterloo -chip" is a boot PROM, probably designed specifically for the University of -Waterloo. If you have any further information about this card, please -e-mail me.] - -The probe has not been able to detect the card on any of the J2 settings, -and I tried them again with the "Waterloo" chip removed. - - _____________________________________________________________________ -| \/ \/ ___ __ __ | -| C4 C4 |^| | M || ^ ||^| | -| -- -- |_| | 5 || || | C3 | -| \/ \/ C10 |___|| ||_| | -| C4 C4 _ _ | | ?? | -| -- -- | \/ || | | -| | || | | -| | || C1 | | -| | || | \/ _____| -| | C6 || | C9 | |___ -| | || | -- | BNC |___| -| | || | >C7| |_____| -| | || | | -| __ __ |____||_____| 1 2 3 6 | -|| ^ | >C4| |o|o|o|o|o|o| J2 >C4| | -|| | |o|o|o|o|o|o| | -|| C2 | >C4| >C4| | -|| | >C8| | -|| | 2 3 4 5 6 7 IRQ >C4| | -||_____| |o|o|o|o|o|o| J3 | -|_______ |o|o|o|o|o|o| _______________| - | | - |_____________________________________________| - -C1 -- "COM9026 - SMC 8638" - In a chip socket. - -C2 -- "@Copyright - Waterloo Microsystems Inc. - 1985" - In a chip Socket with info printed on a label covering a round window - showing the circuit inside. (The window indicates it is an EPROM chip.) - -C3 -- "COM9032 - SMC 8643" - In a chip socket. - -C4 -- "74LS" - 9 total no sockets. - -M5 -- "50006-136 - 20.000000 MHZ - MTQ-T1-S3 - 0 M-TRON 86-40" - Metallic case with 4 pins, no socket. - -C6 -- "MOSTEK@TC8643 - MK6116N-20 - MALAYSIA" - No socket. - -C7 -- No stamp or label but in a 20 pin chip socket. - -C8 -- "PAL10L8CN - 8623" - In a 20 pin socket. - -C9 -- "PAl16R4A-2CN - 8641" - In a 20 pin socket. - -C10 -- "M8640 - NMC - 9306N" - In an 8 pin socket. - -?? -- Some components on a smaller board and attached with 20 pins all - along the side closest to the BNC connector. The are coated in a dark - resin. - -On the board there are two jumper banks labeled J2 and J3. The -manufacturer didn't put a J1 on the board. The two boards I have both -came with a jumper box for each bank. - -J2 -- Numbered 1 2 3 4 5 6. - 4 and 5 are not stamped due to solder points. - -J3 -- IRQ 2 3 4 5 6 7 - -The board itself has a maple leaf stamped just above the irq jumpers -and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986 -CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector. -Below that "MADE IN CANADA" - - -***************************************************************************** - -** No Name ** -8-bit cards, 16-bit cards -------------------------- - - from Juergen Seifert - -NONAME 8-BIT ARCNET -=================== - -I have named this ARCnet card "NONAME", since there is no name of any -manufacturer on the Installation manual nor on the shipping box. The only -hint to the existence of a manufacturer at all is written in copper, -it is "Made in Taiwan" - -This description has been written by Juergen Seifert -using information from the Original - "ARCnet Installation Manual" - - - ________________________________________________________________ - | |STAR| BUS| T/P| | - | |____|____|____| | - | _____________________ | - | | | | - | | | | - | | | | - | | SMC | | - | | | | - | | COM90C65 | | - | | | | - | | | | - | |__________-__________| | - | _____| - | _______________ | CN | - | | PROM | |_____| - | > SOCKET | | - | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 | - | _______________ _______________ | - | |o|o|o|o|o|o|o|o| | SW1 || SW2 || - | |o|o|o|o|o|o|o|o| |_______________||_______________|| - |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____| - | \ IRQ / T T O | - |__________________1_2_M______________________| - -Legend: - -COM90C65: ARCnet Probe -S1 1-8: Node ID Select -S2 1-3: I/O Base Address Select - 4-6: Memory Base Address Select - 7-8: RAM Offset Select -ET1, ET2 Extended Timeout Select -ROM ROM Enable Select -CN RG62 Coax Connector -STAR| BUS | T/P Three fields for placing a sign (colored circle) - indicating the topology of the card - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in group SW1 are used to set the node ID. -Each node attached to the network must have an unique node ID which -must be different from 0. -Switch 8 serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 8 | 1 - 7 | 2 - 6 | 4 - 5 | 8 - 4 | 16 - 3 | 32 - 2 | 64 - 1 | 128 - -Some Examples: - - Switch | Hex | Decimal - 1 2 3 4 5 6 7 8 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first three switches in switch group SW2 are used to select one -of eight possible I/O Base addresses using the following table - - Switch | Hex I/O - 1 2 3 | Address - ------------|-------- - ON ON ON | 260 - ON ON OFF | 290 - ON OFF ON | 2E0 (Manufacturer's default) - ON OFF OFF | 2F0 - OFF ON ON | 300 - OFF ON OFF | 350 - OFF OFF ON | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 4-6 of switch group SW2 select the Base of the 16K block. -Within that 16K address space, the buffer may be assigned any one of four -positions, determined by the offset, switches 7 and 8 of group SW2. - - Switch | Hex RAM | Hex ROM - 4 5 6 7 8 | Address | Address *) - -----------|---------|----------- - 0 0 0 0 0 | C0000 | C2000 - 0 0 0 0 1 | C0800 | C2000 - 0 0 0 1 0 | C1000 | C2000 - 0 0 0 1 1 | C1800 | C2000 - | | - 0 0 1 0 0 | C4000 | C6000 - 0 0 1 0 1 | C4800 | C6000 - 0 0 1 1 0 | C5000 | C6000 - 0 0 1 1 1 | C5800 | C6000 - | | - 0 1 0 0 0 | CC000 | CE000 - 0 1 0 0 1 | CC800 | CE000 - 0 1 0 1 0 | CD000 | CE000 - 0 1 0 1 1 | CD800 | CE000 - | | - 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) - 0 1 1 0 1 | D0800 | D2000 - 0 1 1 1 0 | D1000 | D2000 - 0 1 1 1 1 | D1800 | D2000 - | | - 1 0 0 0 0 | D4000 | D6000 - 1 0 0 0 1 | D4800 | D6000 - 1 0 0 1 0 | D5000 | D6000 - 1 0 0 1 1 | D5800 | D6000 - | | - 1 0 1 0 0 | D8000 | DA000 - 1 0 1 0 1 | D8800 | DA000 - 1 0 1 1 0 | D9000 | DA000 - 1 0 1 1 1 | D9800 | DA000 - | | - 1 1 0 0 0 | DC000 | DE000 - 1 1 0 0 1 | DC800 | DE000 - 1 1 0 1 0 | DD000 | DE000 - 1 1 0 1 1 | DD800 | DE000 - | | - 1 1 1 0 0 | E0000 | E2000 - 1 1 1 0 1 | E0800 | E2000 - 1 1 1 1 0 | E1000 | E2000 - 1 1 1 1 1 | E1800 | E2000 - -*) To enable the 8K Boot PROM install the jumper ROM. - The default is jumper ROM not installed. - - -Setting Interrupt Request Lines (IRQ) -------------------------------------- - -To select a hardware interrupt level set one (only one!) of the jumpers -IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2. - - -Setting the Timeouts --------------------- - -The two jumpers labeled ET1 and ET2 are used to determine the timeout -parameters (response and reconfiguration time). Every node in a network -must be set to the same timeout values. - - ET1 ET2 | Response Time (us) | Reconfiguration Time (ms) - --------|--------------------|-------------------------- - Off Off | 78 | 840 (Default) - Off On | 285 | 1680 - On Off | 563 | 1680 - On On | 1130 | 1680 - -On means jumper installed, Off means jumper not installed - - -NONAME 16-BIT ARCNET -==================== - -The manual of my 8-Bit NONAME ARCnet Card contains another description -of a 16-Bit Coax / Twisted Pair Card. This description is incomplete, -because there are missing two pages in the manual booklet. (The table -of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside -the booklet there is a different way of counting ... 2-9, 2-10, A-1, -(empty page), 3-1, ..., 3-18, A-1 (again), A-2) -Also the picture of the board layout is not as good as the picture of -8-Bit card, because there isn't any letter like "SW1" written to the -picture. -Should somebody have such a board, please feel free to complete this -description or to send a mail to me! - -This description has been written by Juergen Seifert -using information from the Original - "ARCnet Installation Manual" - - - ___________________________________________________________________ - < _________________ _________________ | - > | SW? || SW? | | - < |_________________||_________________| | - > ____________________ | - < | | | - > | | | - < | | | - > | | | - < | | | - > | | | - < | | | - > |____________________| | - < ____| - > ____________________ | | - < | | | J1 | - > | < | | - < |____________________| ? ? ? ? ? ? |____| - > |o|o|o|o|o|o| | - < |o|o|o|o|o|o| | - > | - < __ ___________| - > | | | - <____________| |_______________________________________| - - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in group SW2 are used to set the node ID. -Each node attached to the network must have an unique node ID which -must be different from 0. -Switch 8 serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 8 | 1 - 7 | 2 - 6 | 4 - 5 | 8 - 4 | 16 - 3 | 32 - 2 | 64 - 1 | 128 - -Some Examples: - - Switch | Hex | Decimal - 1 2 3 4 5 6 7 8 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first three switches in switch group SW1 are used to select one -of eight possible I/O Base addresses using the following table - - Switch | Hex I/O - 3 2 1 | Address - ------------|-------- - ON ON ON | 260 - ON ON OFF | 290 - ON OFF ON | 2E0 (Manufacturer's default) - ON OFF OFF | 2F0 - OFF ON ON | 300 - OFF ON OFF | 350 - OFF OFF ON | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 6-8 of switch group SW1 select the Base of the 16K block. -Within that 16K address space, the buffer may be assigned any one of four -positions, determined by the offset, switches 4 and 5 of group SW1. - - Switch | Hex RAM | Hex ROM - 8 7 6 5 4 | Address | Address - -----------|---------|----------- - 0 0 0 0 0 | C0000 | C2000 - 0 0 0 0 1 | C0800 | C2000 - 0 0 0 1 0 | C1000 | C2000 - 0 0 0 1 1 | C1800 | C2000 - | | - 0 0 1 0 0 | C4000 | C6000 - 0 0 1 0 1 | C4800 | C6000 - 0 0 1 1 0 | C5000 | C6000 - 0 0 1 1 1 | C5800 | C6000 - | | - 0 1 0 0 0 | CC000 | CE000 - 0 1 0 0 1 | CC800 | CE000 - 0 1 0 1 0 | CD000 | CE000 - 0 1 0 1 1 | CD800 | CE000 - | | - 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) - 0 1 1 0 1 | D0800 | D2000 - 0 1 1 1 0 | D1000 | D2000 - 0 1 1 1 1 | D1800 | D2000 - | | - 1 0 0 0 0 | D4000 | D6000 - 1 0 0 0 1 | D4800 | D6000 - 1 0 0 1 0 | D5000 | D6000 - 1 0 0 1 1 | D5800 | D6000 - | | - 1 0 1 0 0 | D8000 | DA000 - 1 0 1 0 1 | D8800 | DA000 - 1 0 1 1 0 | D9000 | DA000 - 1 0 1 1 1 | D9800 | DA000 - | | - 1 1 0 0 0 | DC000 | DE000 - 1 1 0 0 1 | DC800 | DE000 - 1 1 0 1 0 | DD000 | DE000 - 1 1 0 1 1 | DD800 | DE000 - | | - 1 1 1 0 0 | E0000 | E2000 - 1 1 1 0 1 | E0800 | E2000 - 1 1 1 1 0 | E1000 | E2000 - 1 1 1 1 1 | E1800 | E2000 - - -Setting Interrupt Request Lines (IRQ) -------------------------------------- - -?????????????????????????????????????? - - -Setting the Timeouts --------------------- - -?????????????????????????????????????? - - -***************************************************************************** - -** No Name ** -8-bit cards ("Made in Taiwan R.O.C.") ------------ - - from Vojtech Pavlik - -I have named this ARCnet card "NONAME", since I got only the card with -no manual at all and the only text identifying the manufacturer is -"MADE IN TAIWAN R.O.C" printed on the card. - - ____________________________________________________________ - | 1 2 3 4 5 6 7 8 | - | |o|o| JP1 o|o|o|o|o|o|o|o| ON | - | + o|o|o|o|o|o|o|o| ___| - | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7 - | | | SW1 | | | | ID6 - | > RAM (2k) | ____________________ | H | | S | ID5 - | |_____________| | || y | | W | ID4 - | | || b | | 2 | ID3 - | | || r | | | ID2 - | | || i | | | ID1 - | | 90C65 || d | |___| ID0 - | SW3 | || | | - | |o|o|o|o|o|o|o|o| ON | || I | | - | |o|o|o|o|o|o|o|o| | || C | | - | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____| - | 1 2 3 4 5 6 7 8 | | | |___ - | ______________ | | | BNC |___| - | | | |_____| |_____| - | > EPROM SOCKET | | - | |______________| | - | ______________| - | | - |_____________________________________________| - -Legend: - -90C65 ARCNET Chip -SW1 1-5: Base Memory Address Select - 6-8: Base I/O Address Select -SW2 1-8: Node ID Select (ID0-ID7) -SW3 1-5: IRQ Select - 6-7: Extra Timeout - 8 : ROM Enable -JP1 Led connector -BNC Coax connector - -Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not -switches. - -Setting the jumpers to ON means connecting the upper two pins, off the bottom -two - or - in case of IRQ setting, connecting none of them at all. - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must not be 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -Setting one of the switches to Off means "1", On means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 - OFF ON ON | 290 - ON OFF ON | 2E0 (Manufacturer's default) - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 0x2000. -Jumpers 3-5 of jumper block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON. - -The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders. - -Setting the Interrupt Line --------------------------- - -Jumpers 1-5 of the jumper block SW3 control the IRQ level. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 2 - OFF ON OFF OFF OFF | 3 - OFF OFF ON OFF OFF | 4 - OFF OFF OFF ON OFF | 5 - OFF OFF OFF OFF ON | 7 - - -Setting the Timeout Parameters ------------------------------- - -The jumpers 6-7 of the jumper block SW3 are used to determine the timeout -parameters. These two jumpers are normally left in the OFF position. - - -***************************************************************************** - -** No Name ** -(Generic Model 9058) --------------------- - - from Andrew J. Kroll - - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a - year!) - _____ - | < - | .---' - ________________________________________________________________ | | - | | SW2 | | | - | ___________ |_____________| | | - | | | 1 2 3 4 5 6 ___| | - | > 6116 RAM | _________ 8 | | | - | |___________| |20MHzXtal| 7 | | | - | |_________| __________ 6 | S | | - | 74LS373 | |- 5 | W | | - | _________ | E |- 4 | | | - | >_______| ______________|..... P |- 3 | 3 | | - | | | : O |- 2 | | | - | | | : X |- 1 |___| | - | ________________ | | : Y |- | | - | | SW1 | | SL90C65 | : |- | | - | |________________| | | : B |- | | - | 1 2 3 4 5 6 7 8 | | : O |- | | - | |_________o____|..../ A |- _______| | - | ____________________ | R |- | |------, - | | | | D |- | BNC | # | - | > 2764 PROM SOCKET | |__________|- |_______|------' - | |____________________| _________ | | - | >________| <- 74LS245 | | - | | | - |___ ______________| | - |H H H H H H H H H H H H H H H H H H H H H H H| | | - |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | | - \| -Legend: - -SL90C65 ARCNET Controller / Transceiver /Logic -SW1 1-5: IRQ Select - 6: ET1 - 7: ET2 - 8: ROM ENABLE -SW2 1-3: Memory Buffer/PROM Address - 3-6: I/O Address Map -SW3 1-8: Node ID Select -BNC BNC RG62/U Connection - *I* have had success using RG59B/U with *NO* terminators! - What gives?! - -SW1: Timeouts, Interrupt and ROM ---------------------------------- - -To select a hardware interrupt level set one (only one!) of the dip switches -up (on) SW1...(switches 1-5) -IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2. - -The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7) -are used to determine the timeout parameters. These two dip switches -are normally left off (down). - - To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM. - The default is jumper ROM not installed. - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch group SW2 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 4 5 6 | Address - -------|-------- - 0 0 0 | 260 - 0 0 1 | 290 - 0 1 0 | 2E0 (Manufacturer's default) - 0 1 1 | 2F0 - 1 0 0 | 300 - 1 0 1 | 350 - 1 1 0 | 380 - 1 1 1 | 3E0 - - -Setting the Base Memory Address (RAM & ROM) -------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 1-3 of switch group SW2 select the Base of the 16K block. -(0 = DOWN, 1 = UP) -I could, however, only verify two settings... - - Switch| Hex RAM | Hex ROM - 1 2 3 | Address | Address - ------|---------|----------- - 0 0 0 | E0000 | E2000 - 0 0 1 | D0000 | D2000 (Manufacturer's default) - 0 1 0 | ????? | ????? - 0 1 1 | ????? | ????? - 1 0 0 | ????? | ????? - 1 0 1 | ????? | ????? - 1 1 0 | ????? | ????? - 1 1 1 | ????? | ????? - - -Setting the Node ID -------------------- - -The eight switches in group SW3 are used to set the node ID. -Each node attached to the network must have an unique node ID which -must be different from 0. -Switch 1 serves as the least significant bit (LSB). -switches in the DOWN position are OFF (0) and in the UP position are ON (1) - -The node ID is the sum of the values of all switches set to "1" -These values are: - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - -Some Examples: - - Switch# | Hex | Decimal -8 7 6 5 4 3 2 1 | Node ID | Node ID -----------------|---------|--------- -0 0 0 0 0 0 0 0 | not allowed <-. -0 0 0 0 0 0 0 1 | 1 | 1 | -0 0 0 0 0 0 1 0 | 2 | 2 | -0 0 0 0 0 0 1 1 | 3 | 3 | - . . . | | | -0 1 0 1 0 1 0 1 | 55 | 85 | - . . . | | + Don't use 0 or 255! -1 0 1 0 1 0 1 0 | AA | 170 | - . . . | | | -1 1 1 1 1 1 0 1 | FD | 253 | -1 1 1 1 1 1 1 0 | FE | 254 | -1 1 1 1 1 1 1 1 | FF | 255 <-' - - -***************************************************************************** - -** Tiara ** -(model unknown) -------------------------- - - from Christoph Lameter - - -Here is information about my card as far as I could figure it out: ------------------------------------------------ tiara -Tiara LanCard of Tiara Computer Systems. - -+----------------------------------------------+ -! ! Transmitter Unit ! ! -! +------------------+ ------- -! MEM Coax Connector -! ROM 7654321 <- I/O ------- -! : : +--------+ ! -! : : ! 90C66LJ! +++ -! : : ! ! !D Switch to set -! : : ! ! !I the Nodenumber -! : : +--------+ !P -! !++ -! 234567 <- IRQ ! -+------------!!!!!!!!!!!!!!!!!!!!!!!!--------+ - !!!!!!!!!!!!!!!!!!!!!!!! - -0 = Jumper Installed -1 = Open - -Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O - -Settings for Memory Location (Top Jumper Line) -456 Address selected -000 C0000 -001 C4000 -010 CC000 -011 D0000 -100 D4000 -101 D8000 -110 DC000 -111 E0000 - -Settings for I/O Address (Top Jumper Line) -123 Port -000 260 -001 290 -010 2E0 -011 2F0 -100 300 -101 350 -110 380 -111 3E0 - -Settings for IRQ Selection (Lower Jumper Line) -234567 -011111 IRQ 2 -101111 IRQ 3 -110111 IRQ 4 -111011 IRQ 5 -111110 IRQ 7 - -***************************************************************************** - - -Other Cards ------------ - -I have no information on other models of ARCnet cards at the moment. Please -send any and all info to: - apenwarr@worldvisions.ca - -Thanks. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 96ffad845fd9..5da18e024fcb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -39,6 +39,7 @@ Contents: 6lowpan 6pack altera_tse + arcnet-hardware .. only:: subproject and html -- cgit From 08bab46f00d0f0fe9709a05b7cdfe909a4258b01 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:20 +0200 Subject: docs: networking: convert arcnet.txt to ReST - add SPDX header; - use document title markup; - add notes markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/arcnet.rst | 594 ++++++++++++++++++++++++++++++++++++ Documentation/networking/arcnet.txt | 556 --------------------------------- Documentation/networking/index.rst | 1 + drivers/net/arcnet/Kconfig | 6 +- 4 files changed, 598 insertions(+), 559 deletions(-) create mode 100644 Documentation/networking/arcnet.rst delete mode 100644 Documentation/networking/arcnet.txt diff --git a/Documentation/networking/arcnet.rst b/Documentation/networking/arcnet.rst new file mode 100644 index 000000000000..e93d9820f0f1 --- /dev/null +++ b/Documentation/networking/arcnet.rst @@ -0,0 +1,594 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +ARCnet +====== + +.. note:: + + See also arcnet-hardware.txt in this directory for jumper-setting + and cabling information if you're like many of us and didn't happen to get a + manual with your ARCnet card. + +Since no one seems to listen to me otherwise, perhaps a poem will get your +attention:: + + This driver's getting fat and beefy, + But my cat is still named Fifi. + +Hmm, I think I'm allowed to call that a poem, even though it's only two +lines. Hey, I'm in Computer Science, not English. Give me a break. + +The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if +you test this and get it working. Or if you don't. Or anything. + +ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was +nice, but after that even FEWER people started writing to me because they +didn't even have to install the patch. + +Come on, be a sport! Send me a success report! + +(hey, that was even better than my original poem... this is getting bad!) + + +.. warning:: + + If you don't e-mail me about your success/failure soon, I may be forced to + start SINGING. And we don't want that, do we? + + (You know, it might be argued that I'm pushing this point a little too much. + If you think so, why not flame me in a quick little e-mail? Please also + include the type of card(s) you're using, software, size of network, and + whether it's working or not.) + + My e-mail address is: apenwarr@worldvisions.ca + +These are the ARCnet drivers for Linux. + +This new release (2.91) has been put together by David Woodhouse +, in an attempt to tidy up the driver after adding support +for yet another chipset. Now the generic support has been separated from the +individual chipset drivers, and the source files aren't quite so packed with +#ifdefs! I've changed this file a bit, but kept it in the first person from +Avery, because I didn't want to completely rewrite it. + +The previous release resulted from many months of on-and-off effort from me +(Avery Pennarun), many bug reports/fixes and suggestions from others, and in +particular a lot of input and coding from Tomasz Motylewski. Starting with +ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been +included and seems to be working fine! + + +Where do I discuss these drivers? +--------------------------------- + +Tomasz has been so kind as to set up a new and improved mailing list. +Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR +REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the +list, mail to linux-arcnet@tichy.ch.uj.edu.pl. + +There are archives of the mailing list at: + + http://epistolary.org/mailman/listinfo.cgi/arcnet + +The people on linux-net@vger.kernel.org (now defunct, replaced by +netdev@vger.kernel.org) have also been known to be very helpful, especially +when we're talking about ALPHA Linux kernels that may or may not work right +in the first place. + + +Other Drivers and Info +---------------------- + +You can try my ARCNET page on the World Wide Web at: + + http://www.qis.net/~jschmitz/arcnet/ + +Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you +might be interested in, which includes several drivers for various cards +including ARCnet. Try: + + http://www.smc.com/ + +Performance Technologies makes various network software that supports +ARCnet: + + http://www.perftech.com/ or ftp to ftp.perftech.com. + +Novell makes a networking stack for DOS which includes ARCnet drivers. Try +FTPing to ftp.novell.com. + +You can get the Crynwr packet driver collection (including arcether.com, the +one you'll want to use with ARCnet cards) from +oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+ +without patches, though, and also doesn't like several cards. Fixed +versions are available on my WWW page, or via e-mail if you don't have WWW +access. + + +Installing the Driver +--------------------- + +All you will need to do in order to install the driver is:: + + make config + (be sure to choose ARCnet in the network devices + and at least one chipset driver.) + make clean + make zImage + +If you obtained this ARCnet package as an upgrade to the ARCnet driver in +your current kernel, you will need to first copy arcnet.c over the one in +the linux/drivers/net directory. + +You will know the driver is installed properly if you get some ARCnet +messages when you reboot into the new Linux kernel. + +There are four chipset options: + + 1. Standard ARCnet COM90xx chipset. + +This is the normal ARCnet card, which you've probably got. This is the only +chipset driver which will autoprobe if not told where the card is. +It following options on the command line:: + + com90xx=[[,[,]]][,] | + +If you load the chipset support as a module, the options are:: + + io= irq= shmem= device= + +To disable the autoprobe, just specify "com90xx=" on the kernel command line. +To specify the name alone, but allow autoprobe, just put "com90xx=" + + 2. ARCnet COM20020 chipset. + +This is the new chipset from SMC with support for promiscuous mode (packet +sniffing), extra diagnostic information, etc. Unfortunately, there is no +sensible method of autoprobing for these cards. You must specify the I/O +address on the kernel command line. + +The command line options are:: + + com20020=[,[,[,backplane[,CKP[,timeout]]]]][,name] + +If you load the chipset support as a module, the options are:: + + io= irq= node= backplane= clock= + timeout= device= + +The COM20020 chipset allows you to set the node ID in software, overriding the +default which is still set in DIP switches on the card. If you don't have the +COM20020 data sheets, and you don't know what the other three options refer +to, then they won't interest you - forget them. + + 3. ARCnet COM90xx chipset in IO-mapped mode. + +This will also work with the normal ARCnet cards, but doesn't use the shared +memory. It performs less well than the above driver, but is provided in case +you have a card which doesn't support shared memory, or (strangely) in case +you have so many ARCnet cards in your machine that you run out of shmem slots. +If you don't give the IO address on the kernel command line, then the driver +will not find the card. + +The command line options are:: + + com90io=[,][,] + +If you load the chipset support as a module, the options are: + io= irq= device= + + 4. ARCnet RIM I cards. + +These are COM90xx chips which are _completely_ memory mapped. The support for +these is not tested. If you have one, please mail the author with a success +report. All options must be specified, except the device name. +Command line options:: + + arcrimi=,,[,] + +If you load the chipset support as a module, the options are:: + + shmem= irq= node= device= + + +Loadable Module Support +----------------------- + +Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet +support" and to support for your ARCnet chipset if you want to use the +loadable module. You can also say 'y' to "Generic ARCnet support" and 'm' +to the chipset support if you wish. + +:: + + make config + make clean + make zImage + make modules + +If you're using a loadable module, you need to use insmod to load it, and +you can specify various characteristics of your card on the command +line. (In recent versions of the driver, autoprobing is much more reliable +and works as a module, so most of this is now unnecessary.) + +For example:: + + cd /usr/src/linux/modules + insmod arcnet.o + insmod com90xx.o + insmod com20020.o io=0x2e0 device=eth1 + + +Using the Driver +---------------- + +If you build your kernel with ARCnet COM90xx support included, it should +probe for your card automatically when you boot. If you use a different +chipset driver complied into the kernel, you must give the necessary options +on the kernel command line, as detailed above. + +Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be +available where you picked up this driver. Think of your ARCnet as a +souped-up (or down, as the case may be) Ethernet card. + +By the way, be sure to change all references from "eth0" to "arc0" in the +HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name +is DIFFERENT. + + +Multiple Cards in One Computer +------------------------------ + +Linux has pretty good support for this now, but since I've been busy, the +ARCnet driver has somewhat suffered in this respect. COM90xx support, if +compiled into the kernel, will (try to) autodetect all the installed cards. + +If you have other cards, with support compiled into the kernel, then you can +just repeat the options on the kernel command line, e.g.:: + + LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260 + +If you have the chipset support built as a loadable module, then you need to +do something like this:: + + insmod -o arc0 com90xx + insmod -o arc1 com20020 io=0x2e0 + insmod -o arc2 com90xx + +The ARCnet drivers will now sort out their names automatically. + + +How do I get it to work with...? +-------------------------------- + +NFS: + Should be fine linux->linux, just pretend you're using Ethernet cards. + oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There + is also a DOS-based NFS server called SOSS. It doesn't multitask + quite the way Linux does (actually, it doesn't multitask AT ALL) but + you never know what you might need. + + With AmiTCP (and possibly others), you may need to set the following + options in your Amiga nfstab: MD 1024 MR 1024 MW 1024 + (Thanks to Christian Gottschling + for this.) + + Probably these refer to maximum NFS data/read/write block sizes. I + don't know why the defaults on the Amiga didn't work; write to me if + you know more. + +DOS: + If you're using the freeware arcether.com, you might want to install + the driver patch from my web page. It helps with PC/TCP, and also + can get arcether to load if it timed out too quickly during + initialization. In fact, if you use it on a 386+ you REALLY need + the patch, really. + +Windows: + See DOS :) Trumpet Winsock works fine with either the Novell or + Arcether client, assuming you remember to load winpkt of course. + +LAN Manager and Windows for Workgroups: + These programs use protocols that + are incompatible with the Internet standard. They try to pretend + the cards are Ethernet, and confuse everyone else on the network. + + However, v2.00 and higher of the Linux ARCnet driver supports this + protocol via the 'arc0e' device. See the section on "Multiprotocol + Support" for more information. + + Using the freeware Samba server and clients for Linux, you can now + interface quite nicely with TCP/IP-based WfWg or Lan Manager + networks. + +Windows 95: + Tools are included with Win95 that let you use either the LANMAN + style network drivers (NDIS) or Novell drivers (ODI) to handle your + ARCnet packets. If you use ODI, you'll need to use the 'arc0' + device with Linux. If you use NDIS, then try the 'arc0e' device. + See the "Multiprotocol Support" section below if you need arc0e, + you're completely insane, and/or you need to build some kind of + hybrid network that uses both encapsulation types. + +OS/2: + I've been told it works under Warp Connect with an ARCnet driver from + SMC. You need to use the 'arc0e' interface for this. If you get + the SMC driver to work with the TCP/IP stuff included in the + "normal" Warp Bonus Pack, let me know. + + ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client + which should use the same protocol as WfWg does. I had no luck + installing it under Warp, however. Please mail me with any results. + +NetBSD/AmiTCP: + These use an old version of the Internet standard ARCnet + protocol (RFC1051) which is compatible with the Linux driver v2.10 + ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet" + below.) ** Newer versions of NetBSD apparently support RFC1201. + + +Using Multiprotocol ARCnet +-------------------------- + +The ARCnet driver v2.10 ALPHA supports three protocols, each on its own +"virtual network device": + + ====== =============================================================== + arc0 RFC1201 protocol, the official Internet standard which just + happens to be 100% compatible with Novell's TRXNET driver. + Version 1.00 of the ARCnet driver supported _only_ this + protocol. arc0 is the fastest of the three protocols (for + whatever reason), and allows larger packets to be used + because it supports RFC1201 "packet splitting" operations. + Unless you have a specific need to use a different protocol, + I strongly suggest that you stick with this one. + + arc0e "Ethernet-Encapsulation" which sends packets over ARCnet + that are actually a lot like Ethernet packets, including the + 6-byte hardware addresses. This protocol is compatible with + Microsoft's NDIS ARCnet driver, like the one in WfWg and + LANMAN. Because the MTU of 493 is actually smaller than the + one "required" by TCP/IP (576), there is a chance that some + network operations will not function properly. The Linux + TCP/IP layer can compensate in most cases, however, by + automatically fragmenting the TCP/IP packets to make them + fit. arc0e also works slightly more slowly than arc0, for + reasons yet to be determined. (Probably it's the smaller + MTU that does it.) + + arc0s The "[s]imple" RFC1051 protocol is the "previous" Internet + standard that is completely incompatible with the new + standard. Some software today, however, continues to + support the old standard (and only the old standard) + including NetBSD and AmiTCP. RFC1051 also does not support + RFC1201's packet splitting, and the MTU of 507 is still + smaller than the Internet "requirement," so it's quite + possible that you may run into problems. It's also slower + than RFC1201 by about 25%, for the same reason as arc0e. + + The arc0s support was contributed by Tomasz Motylewski + and modified somewhat by me. Bugs are probably my fault. + ====== =============================================================== + +You can choose not to compile arc0e and arc0s into the driver if you want - +this will save you a bit of memory and avoid confusion when eg. trying to +use the "NFS-root" stuff in recent Linux kernels. + +The arc0e and arc0s devices are created automatically when you first +ifconfig the arc0 device. To actually use them, though, you need to also +ifconfig the other virtual devices you need. There are a number of ways you +can set up your network then: + + +1. Single Protocol. + + This is the simplest way to configure your network: use just one of the + two available protocols. As mentioned above, it's a good idea to use + only arc0 unless you have a good reason (like some other software, ie. + WfWg, that only works with arc0e). + + If you need only arc0, then the following commands should get you going:: + + ifconfig arc0 MY.IP.ADD.RESS + route add MY.IP.ADD.RESS arc0 + route add -net SUB.NET.ADD.RESS arc0 + [add other local routes here] + + If you need arc0e (and only arc0e), it's a little different:: + + ifconfig arc0 MY.IP.ADD.RESS + ifconfig arc0e MY.IP.ADD.RESS + route add MY.IP.ADD.RESS arc0e + route add -net SUB.NET.ADD.RESS arc0e + + arc0s works much the same way as arc0e. + + +2. More than one protocol on the same wire. + + Now things start getting confusing. To even try it, you may need to be + partly crazy. Here's what *I* did. :) Note that I don't include arc0s in + my home network; I don't have any NetBSD or AmiTCP computers, so I only + use arc0s during limited testing. + + I have three computers on my home network; two Linux boxes (which prefer + RFC1201 protocol, for reasons listed above), and one XT that can't run + Linux but runs the free Microsoft LANMAN Client instead. + + Worse, one of the Linux computers (freedom) also has a modem and acts as + a router to my Internet provider. The other Linux box (insight) also has + its own IP address and needs to use freedom as its default gateway. The + XT (patience), however, does not have its own Internet IP address and so + I assigned it one on a "private subnet" (as defined by RFC1597). + + To start with, take a simple network with just insight and freedom. + Insight needs to: + + - talk to freedom via RFC1201 (arc0) protocol, because I like it + more and it's faster. + - use freedom as its Internet gateway. + + That's pretty easy to do. Set up insight like this:: + + ifconfig arc0 insight + route add insight arc0 + route add freedom arc0 /* I would use the subnet here (like I said + to to in "single protocol" above), + but the rest of the subnet + unfortunately lies across the PPP + link on freedom, which confuses + things. */ + route add default gw freedom + + And freedom gets configured like so:: + + ifconfig arc0 freedom + route add freedom arc0 + route add insight arc0 + /* and default gateway is configured by pppd */ + + Great, now insight talks to freedom directly on arc0, and sends packets + to the Internet through freedom. If you didn't know how to do the above, + you should probably stop reading this section now because it only gets + worse. + + Now, how do I add patience into the network? It will be using LANMAN + Client, which means I need the arc0e device. It needs to be able to talk + to both insight and freedom, and also use freedom as a gateway to the + Internet. (Recall that patience has a "private IP address" which won't + work on the Internet; that's okay, I configured Linux IP masquerading on + freedom for this subnet). + + So patience (necessarily; I don't have another IP number from my + provider) has an IP address on a different subnet than freedom and + insight, but needs to use freedom as an Internet gateway. Worse, most + DOS networking programs, including LANMAN, have braindead networking + schemes that rely completely on the netmask and a 'default gateway' to + determine how to route packets. This means that to get to freedom or + insight, patience WILL send through its default gateway, regardless of + the fact that both freedom and insight (courtesy of the arc0e device) + could understand a direct transmission. + + I compensate by giving freedom an extra IP address - aliased 'gatekeeper' - + that is on my private subnet, the same subnet that patience is on. I + then define gatekeeper to be the default gateway for patience. + + To configure freedom (in addition to the commands above):: + + ifconfig arc0e gatekeeper + route add gatekeeper arc0e + route add patience arc0e + + This way, freedom will send all packets for patience through arc0e, + giving its IP address as gatekeeper (on the private subnet). When it + talks to insight or the Internet, it will use its "freedom" Internet IP + address. + + You will notice that we haven't configured the arc0e device on insight. + This would work, but is not really necessary, and would require me to + assign insight another special IP number from my private subnet. Since + both insight and patience are using freedom as their default gateway, the + two can already talk to each other. + + It's quite fortunate that I set things up like this the first time (cough + cough) because it's really handy when I boot insight into DOS. There, it + runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet. + In this mode it would be impossible for insight to communicate directly + with patience, since the Novell stack is incompatible with Microsoft's + Ethernet-Encap. Without changing any settings on freedom or patience, I + simply set freedom as the default gateway for insight (now in DOS, + remember) and all the forwarding happens "automagically" between the two + hosts that would normally not be able to communicate at all. + + For those who like diagrams, I have created two "virtual subnets" on the + same physical ARCnet wire. You can picture it like this:: + + + [RFC1201 NETWORK] [ETHER-ENCAP NETWORK] + (registered Internet subnet) (RFC1597 private subnet) + + (IP Masquerade) + /---------------\ * /---------------\ + | | * | | + | +-Freedom-*-Gatekeeper-+ | + | | | * | | + \-------+-------/ | * \-------+-------/ + | | | + Insight | Patience + (Internet) + + + +It works: what now? +------------------- + +Send mail describing your setup, preferably including driver version, kernel +version, ARCnet card model, CPU type, number of systems on your network, and +list of software in use to me at the following address: + + apenwarr@worldvisions.ca + +I do send (sometimes automated) replies to all messages I receive. My email +can be weird (and also usually gets forwarded all over the place along the +way to me), so if you don't get a reply within a reasonable time, please +resend. + + +It doesn't work: what now? +-------------------------- + +Do the same as above, but also include the output of the ifconfig and route +commands, as well as any pertinent log entries (ie. anything that starts +with "arcnet:" and has shown up since the last reboot) in your mail. + +If you want to try fixing it yourself (I strongly recommend that you mail me +about the problem first, since it might already have been solved) you may +want to try some of the debug levels available. For heavy testing on +D_DURING or more, it would be a REALLY good idea to kill your klogd daemon +first! D_DURING displays 4-5 lines for each packet sent or received. D_TX, +D_RX, and D_SKB actually DISPLAY each packet as it is sent or received, +which is obviously quite big. + +Starting with v2.40 ALPHA, the autoprobe routines have changed +significantly. In particular, they won't tell you why the card was not +found unless you turn on the D_INIT_REASONS debugging flag. + +Once the driver is running, you can run the arcdump shell script (available +from me or in the full ARCnet package, if you have it) as root to list the +contents of the arcnet buffers at any time. To make any sense at all out of +this, you should grab the pertinent RFCs. (some are listed near the top of +arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the +script. + +Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending. +Ping-pong buffers are implemented both ways. + +If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY, +the buffers are cleared to a constant value of 0x42 every time the card is +reset (which should only happen when you do an ifconfig up, or when Linux +decides that the driver is broken). During a transmit, unused parts of the +buffer will be cleared to 0x42 as well. This is to make it easier to figure +out which bytes are being used by a packet. + +You can change the debug level without recompiling the kernel by typing:: + + ifconfig arc0 down metric 1xxx + /etc/rc.d/rc.inet1 + +where "xxx" is the debug level you want. For example, "metric 1015" would put +you at debug level 15. Debug level 7 is currently the default. + +Note that the debug level is (starting with v1.90 ALPHA) a binary +combination of different debug flags; so debug level 7 is really 1+2+4 or +D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this, +resulting in debug level 23. + +If you don't understand that, you probably don't want to know anyway. +E-mail me about your problem. + + +I want to send money: what now? +------------------------------- + +Go take a nap or something. You'll feel better in the morning. diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt deleted file mode 100644 index aff97f47c05c..000000000000 --- a/Documentation/networking/arcnet.txt +++ /dev/null @@ -1,556 +0,0 @@ ----------------------------------------------------------------------------- -NOTE: See also arcnet-hardware.txt in this directory for jumper-setting -and cabling information if you're like many of us and didn't happen to get a -manual with your ARCnet card. ----------------------------------------------------------------------------- - -Since no one seems to listen to me otherwise, perhaps a poem will get your -attention: - This driver's getting fat and beefy, - But my cat is still named Fifi. - -Hmm, I think I'm allowed to call that a poem, even though it's only two -lines. Hey, I'm in Computer Science, not English. Give me a break. - -The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if -you test this and get it working. Or if you don't. Or anything. - -ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was -nice, but after that even FEWER people started writing to me because they -didn't even have to install the patch. - -Come on, be a sport! Send me a success report! - -(hey, that was even better than my original poem... this is getting bad!) - - --------- -WARNING: --------- - -If you don't e-mail me about your success/failure soon, I may be forced to -start SINGING. And we don't want that, do we? - -(You know, it might be argued that I'm pushing this point a little too much. -If you think so, why not flame me in a quick little e-mail? Please also -include the type of card(s) you're using, software, size of network, and -whether it's working or not.) - -My e-mail address is: apenwarr@worldvisions.ca - - ---------------------------------------------------------------------------- - - -These are the ARCnet drivers for Linux. - - -This new release (2.91) has been put together by David Woodhouse -, in an attempt to tidy up the driver after adding support -for yet another chipset. Now the generic support has been separated from the -individual chipset drivers, and the source files aren't quite so packed with -#ifdefs! I've changed this file a bit, but kept it in the first person from -Avery, because I didn't want to completely rewrite it. - -The previous release resulted from many months of on-and-off effort from me -(Avery Pennarun), many bug reports/fixes and suggestions from others, and in -particular a lot of input and coding from Tomasz Motylewski. Starting with -ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been -included and seems to be working fine! - - -Where do I discuss these drivers? ---------------------------------- - -Tomasz has been so kind as to set up a new and improved mailing list. -Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR -REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the -list, mail to linux-arcnet@tichy.ch.uj.edu.pl. - -There are archives of the mailing list at: - http://epistolary.org/mailman/listinfo.cgi/arcnet - -The people on linux-net@vger.kernel.org (now defunct, replaced by -netdev@vger.kernel.org) have also been known to be very helpful, especially -when we're talking about ALPHA Linux kernels that may or may not work right -in the first place. - - -Other Drivers and Info ----------------------- - -You can try my ARCNET page on the World Wide Web at: - http://www.qis.net/~jschmitz/arcnet/ - -Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you -might be interested in, which includes several drivers for various cards -including ARCnet. Try: - http://www.smc.com/ - -Performance Technologies makes various network software that supports -ARCnet: - http://www.perftech.com/ or ftp to ftp.perftech.com. - -Novell makes a networking stack for DOS which includes ARCnet drivers. Try -FTPing to ftp.novell.com. - -You can get the Crynwr packet driver collection (including arcether.com, the -one you'll want to use with ARCnet cards) from -oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+ -without patches, though, and also doesn't like several cards. Fixed -versions are available on my WWW page, or via e-mail if you don't have WWW -access. - - -Installing the Driver ---------------------- - -All you will need to do in order to install the driver is: - make config - (be sure to choose ARCnet in the network devices - and at least one chipset driver.) - make clean - make zImage - -If you obtained this ARCnet package as an upgrade to the ARCnet driver in -your current kernel, you will need to first copy arcnet.c over the one in -the linux/drivers/net directory. - -You will know the driver is installed properly if you get some ARCnet -messages when you reboot into the new Linux kernel. - -There are four chipset options: - - 1. Standard ARCnet COM90xx chipset. - -This is the normal ARCnet card, which you've probably got. This is the only -chipset driver which will autoprobe if not told where the card is. -It following options on the command line: - com90xx=[[,[,]]][,] | - -If you load the chipset support as a module, the options are: - io= irq= shmem= device= - -To disable the autoprobe, just specify "com90xx=" on the kernel command line. -To specify the name alone, but allow autoprobe, just put "com90xx=" - - 2. ARCnet COM20020 chipset. - -This is the new chipset from SMC with support for promiscuous mode (packet -sniffing), extra diagnostic information, etc. Unfortunately, there is no -sensible method of autoprobing for these cards. You must specify the I/O -address on the kernel command line. -The command line options are: - com20020=[,[,[,backplane[,CKP[,timeout]]]]][,name] - -If you load the chipset support as a module, the options are: - io= irq= node= backplane= clock= - timeout= device= - -The COM20020 chipset allows you to set the node ID in software, overriding the -default which is still set in DIP switches on the card. If you don't have the -COM20020 data sheets, and you don't know what the other three options refer -to, then they won't interest you - forget them. - - 3. ARCnet COM90xx chipset in IO-mapped mode. - -This will also work with the normal ARCnet cards, but doesn't use the shared -memory. It performs less well than the above driver, but is provided in case -you have a card which doesn't support shared memory, or (strangely) in case -you have so many ARCnet cards in your machine that you run out of shmem slots. -If you don't give the IO address on the kernel command line, then the driver -will not find the card. -The command line options are: - com90io=[,][,] - -If you load the chipset support as a module, the options are: - io= irq= device= - - 4. ARCnet RIM I cards. - -These are COM90xx chips which are _completely_ memory mapped. The support for -these is not tested. If you have one, please mail the author with a success -report. All options must be specified, except the device name. -Command line options: - arcrimi=,,[,] - -If you load the chipset support as a module, the options are: - shmem= irq= node= device= - - -Loadable Module Support ------------------------ - -Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet -support" and to support for your ARCnet chipset if you want to use the -loadable module. You can also say 'y' to "Generic ARCnet support" and 'm' -to the chipset support if you wish. - - make config - make clean - make zImage - make modules - -If you're using a loadable module, you need to use insmod to load it, and -you can specify various characteristics of your card on the command -line. (In recent versions of the driver, autoprobing is much more reliable -and works as a module, so most of this is now unnecessary.) - -For example: - cd /usr/src/linux/modules - insmod arcnet.o - insmod com90xx.o - insmod com20020.o io=0x2e0 device=eth1 - - -Using the Driver ----------------- - -If you build your kernel with ARCnet COM90xx support included, it should -probe for your card automatically when you boot. If you use a different -chipset driver complied into the kernel, you must give the necessary options -on the kernel command line, as detailed above. - -Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be -available where you picked up this driver. Think of your ARCnet as a -souped-up (or down, as the case may be) Ethernet card. - -By the way, be sure to change all references from "eth0" to "arc0" in the -HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name -is DIFFERENT. - - -Multiple Cards in One Computer ------------------------------- - -Linux has pretty good support for this now, but since I've been busy, the -ARCnet driver has somewhat suffered in this respect. COM90xx support, if -compiled into the kernel, will (try to) autodetect all the installed cards. - -If you have other cards, with support compiled into the kernel, then you can -just repeat the options on the kernel command line, e.g.: -LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260 - -If you have the chipset support built as a loadable module, then you need to -do something like this: - insmod -o arc0 com90xx - insmod -o arc1 com20020 io=0x2e0 - insmod -o arc2 com90xx -The ARCnet drivers will now sort out their names automatically. - - -How do I get it to work with...? --------------------------------- - -NFS: Should be fine linux->linux, just pretend you're using Ethernet cards. - oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There - is also a DOS-based NFS server called SOSS. It doesn't multitask - quite the way Linux does (actually, it doesn't multitask AT ALL) but - you never know what you might need. - - With AmiTCP (and possibly others), you may need to set the following - options in your Amiga nfstab: MD 1024 MR 1024 MW 1024 - (Thanks to Christian Gottschling - for this.) - - Probably these refer to maximum NFS data/read/write block sizes. I - don't know why the defaults on the Amiga didn't work; write to me if - you know more. - -DOS: If you're using the freeware arcether.com, you might want to install - the driver patch from my web page. It helps with PC/TCP, and also - can get arcether to load if it timed out too quickly during - initialization. In fact, if you use it on a 386+ you REALLY need - the patch, really. - -Windows: See DOS :) Trumpet Winsock works fine with either the Novell or - Arcether client, assuming you remember to load winpkt of course. - -LAN Manager and Windows for Workgroups: These programs use protocols that - are incompatible with the Internet standard. They try to pretend - the cards are Ethernet, and confuse everyone else on the network. - - However, v2.00 and higher of the Linux ARCnet driver supports this - protocol via the 'arc0e' device. See the section on "Multiprotocol - Support" for more information. - - Using the freeware Samba server and clients for Linux, you can now - interface quite nicely with TCP/IP-based WfWg or Lan Manager - networks. - -Windows 95: Tools are included with Win95 that let you use either the LANMAN - style network drivers (NDIS) or Novell drivers (ODI) to handle your - ARCnet packets. If you use ODI, you'll need to use the 'arc0' - device with Linux. If you use NDIS, then try the 'arc0e' device. - See the "Multiprotocol Support" section below if you need arc0e, - you're completely insane, and/or you need to build some kind of - hybrid network that uses both encapsulation types. - -OS/2: I've been told it works under Warp Connect with an ARCnet driver from - SMC. You need to use the 'arc0e' interface for this. If you get - the SMC driver to work with the TCP/IP stuff included in the - "normal" Warp Bonus Pack, let me know. - - ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client - which should use the same protocol as WfWg does. I had no luck - installing it under Warp, however. Please mail me with any results. - -NetBSD/AmiTCP: These use an old version of the Internet standard ARCnet - protocol (RFC1051) which is compatible with the Linux driver v2.10 - ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet" - below.) ** Newer versions of NetBSD apparently support RFC1201. - - -Using Multiprotocol ARCnet --------------------------- - -The ARCnet driver v2.10 ALPHA supports three protocols, each on its own -"virtual network device": - - arc0 - RFC1201 protocol, the official Internet standard which just - happens to be 100% compatible with Novell's TRXNET driver. - Version 1.00 of the ARCnet driver supported _only_ this - protocol. arc0 is the fastest of the three protocols (for - whatever reason), and allows larger packets to be used - because it supports RFC1201 "packet splitting" operations. - Unless you have a specific need to use a different protocol, - I strongly suggest that you stick with this one. - - arc0e - "Ethernet-Encapsulation" which sends packets over ARCnet - that are actually a lot like Ethernet packets, including the - 6-byte hardware addresses. This protocol is compatible with - Microsoft's NDIS ARCnet driver, like the one in WfWg and - LANMAN. Because the MTU of 493 is actually smaller than the - one "required" by TCP/IP (576), there is a chance that some - network operations will not function properly. The Linux - TCP/IP layer can compensate in most cases, however, by - automatically fragmenting the TCP/IP packets to make them - fit. arc0e also works slightly more slowly than arc0, for - reasons yet to be determined. (Probably it's the smaller - MTU that does it.) - - arc0s - The "[s]imple" RFC1051 protocol is the "previous" Internet - standard that is completely incompatible with the new - standard. Some software today, however, continues to - support the old standard (and only the old standard) - including NetBSD and AmiTCP. RFC1051 also does not support - RFC1201's packet splitting, and the MTU of 507 is still - smaller than the Internet "requirement," so it's quite - possible that you may run into problems. It's also slower - than RFC1201 by about 25%, for the same reason as arc0e. - - The arc0s support was contributed by Tomasz Motylewski - and modified somewhat by me. Bugs are probably my fault. - -You can choose not to compile arc0e and arc0s into the driver if you want - -this will save you a bit of memory and avoid confusion when eg. trying to -use the "NFS-root" stuff in recent Linux kernels. - -The arc0e and arc0s devices are created automatically when you first -ifconfig the arc0 device. To actually use them, though, you need to also -ifconfig the other virtual devices you need. There are a number of ways you -can set up your network then: - - -1. Single Protocol. - - This is the simplest way to configure your network: use just one of the - two available protocols. As mentioned above, it's a good idea to use - only arc0 unless you have a good reason (like some other software, ie. - WfWg, that only works with arc0e). - - If you need only arc0, then the following commands should get you going: - ifconfig arc0 MY.IP.ADD.RESS - route add MY.IP.ADD.RESS arc0 - route add -net SUB.NET.ADD.RESS arc0 - [add other local routes here] - - If you need arc0e (and only arc0e), it's a little different: - ifconfig arc0 MY.IP.ADD.RESS - ifconfig arc0e MY.IP.ADD.RESS - route add MY.IP.ADD.RESS arc0e - route add -net SUB.NET.ADD.RESS arc0e - - arc0s works much the same way as arc0e. - - -2. More than one protocol on the same wire. - - Now things start getting confusing. To even try it, you may need to be - partly crazy. Here's what *I* did. :) Note that I don't include arc0s in - my home network; I don't have any NetBSD or AmiTCP computers, so I only - use arc0s during limited testing. - - I have three computers on my home network; two Linux boxes (which prefer - RFC1201 protocol, for reasons listed above), and one XT that can't run - Linux but runs the free Microsoft LANMAN Client instead. - - Worse, one of the Linux computers (freedom) also has a modem and acts as - a router to my Internet provider. The other Linux box (insight) also has - its own IP address and needs to use freedom as its default gateway. The - XT (patience), however, does not have its own Internet IP address and so - I assigned it one on a "private subnet" (as defined by RFC1597). - - To start with, take a simple network with just insight and freedom. - Insight needs to: - - talk to freedom via RFC1201 (arc0) protocol, because I like it - more and it's faster. - - use freedom as its Internet gateway. - - That's pretty easy to do. Set up insight like this: - ifconfig arc0 insight - route add insight arc0 - route add freedom arc0 /* I would use the subnet here (like I said - to to in "single protocol" above), - but the rest of the subnet - unfortunately lies across the PPP - link on freedom, which confuses - things. */ - route add default gw freedom - - And freedom gets configured like so: - ifconfig arc0 freedom - route add freedom arc0 - route add insight arc0 - /* and default gateway is configured by pppd */ - - Great, now insight talks to freedom directly on arc0, and sends packets - to the Internet through freedom. If you didn't know how to do the above, - you should probably stop reading this section now because it only gets - worse. - - Now, how do I add patience into the network? It will be using LANMAN - Client, which means I need the arc0e device. It needs to be able to talk - to both insight and freedom, and also use freedom as a gateway to the - Internet. (Recall that patience has a "private IP address" which won't - work on the Internet; that's okay, I configured Linux IP masquerading on - freedom for this subnet). - - So patience (necessarily; I don't have another IP number from my - provider) has an IP address on a different subnet than freedom and - insight, but needs to use freedom as an Internet gateway. Worse, most - DOS networking programs, including LANMAN, have braindead networking - schemes that rely completely on the netmask and a 'default gateway' to - determine how to route packets. This means that to get to freedom or - insight, patience WILL send through its default gateway, regardless of - the fact that both freedom and insight (courtesy of the arc0e device) - could understand a direct transmission. - - I compensate by giving freedom an extra IP address - aliased 'gatekeeper' - - that is on my private subnet, the same subnet that patience is on. I - then define gatekeeper to be the default gateway for patience. - - To configure freedom (in addition to the commands above): - ifconfig arc0e gatekeeper - route add gatekeeper arc0e - route add patience arc0e - - This way, freedom will send all packets for patience through arc0e, - giving its IP address as gatekeeper (on the private subnet). When it - talks to insight or the Internet, it will use its "freedom" Internet IP - address. - - You will notice that we haven't configured the arc0e device on insight. - This would work, but is not really necessary, and would require me to - assign insight another special IP number from my private subnet. Since - both insight and patience are using freedom as their default gateway, the - two can already talk to each other. - - It's quite fortunate that I set things up like this the first time (cough - cough) because it's really handy when I boot insight into DOS. There, it - runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet. - In this mode it would be impossible for insight to communicate directly - with patience, since the Novell stack is incompatible with Microsoft's - Ethernet-Encap. Without changing any settings on freedom or patience, I - simply set freedom as the default gateway for insight (now in DOS, - remember) and all the forwarding happens "automagically" between the two - hosts that would normally not be able to communicate at all. - - For those who like diagrams, I have created two "virtual subnets" on the - same physical ARCnet wire. You can picture it like this: - - - [RFC1201 NETWORK] [ETHER-ENCAP NETWORK] - (registered Internet subnet) (RFC1597 private subnet) - - (IP Masquerade) - /---------------\ * /---------------\ - | | * | | - | +-Freedom-*-Gatekeeper-+ | - | | | * | | - \-------+-------/ | * \-------+-------/ - | | | - Insight | Patience - (Internet) - - - -It works: what now? -------------------- - -Send mail describing your setup, preferably including driver version, kernel -version, ARCnet card model, CPU type, number of systems on your network, and -list of software in use to me at the following address: - apenwarr@worldvisions.ca - -I do send (sometimes automated) replies to all messages I receive. My email -can be weird (and also usually gets forwarded all over the place along the -way to me), so if you don't get a reply within a reasonable time, please -resend. - - -It doesn't work: what now? --------------------------- - -Do the same as above, but also include the output of the ifconfig and route -commands, as well as any pertinent log entries (ie. anything that starts -with "arcnet:" and has shown up since the last reboot) in your mail. - -If you want to try fixing it yourself (I strongly recommend that you mail me -about the problem first, since it might already have been solved) you may -want to try some of the debug levels available. For heavy testing on -D_DURING or more, it would be a REALLY good idea to kill your klogd daemon -first! D_DURING displays 4-5 lines for each packet sent or received. D_TX, -D_RX, and D_SKB actually DISPLAY each packet as it is sent or received, -which is obviously quite big. - -Starting with v2.40 ALPHA, the autoprobe routines have changed -significantly. In particular, they won't tell you why the card was not -found unless you turn on the D_INIT_REASONS debugging flag. - -Once the driver is running, you can run the arcdump shell script (available -from me or in the full ARCnet package, if you have it) as root to list the -contents of the arcnet buffers at any time. To make any sense at all out of -this, you should grab the pertinent RFCs. (some are listed near the top of -arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the -script. - -Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending. -Ping-pong buffers are implemented both ways. - -If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY, -the buffers are cleared to a constant value of 0x42 every time the card is -reset (which should only happen when you do an ifconfig up, or when Linux -decides that the driver is broken). During a transmit, unused parts of the -buffer will be cleared to 0x42 as well. This is to make it easier to figure -out which bytes are being used by a packet. - -You can change the debug level without recompiling the kernel by typing: - ifconfig arc0 down metric 1xxx - /etc/rc.d/rc.inet1 -where "xxx" is the debug level you want. For example, "metric 1015" would put -you at debug level 15. Debug level 7 is currently the default. - -Note that the debug level is (starting with v1.90 ALPHA) a binary -combination of different debug flags; so debug level 7 is really 1+2+4 or -D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this, -resulting in debug level 23. - -If you don't understand that, you probably don't want to know anyway. -E-mail me about your problem. - - -I want to send money: what now? -------------------------------- - -Go take a nap or something. You'll feel better in the morning. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5da18e024fcb..3e0a4bb23ef9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -40,6 +40,7 @@ Contents: 6pack altera_tse arcnet-hardware + arcnet .. only:: subproject and html diff --git a/drivers/net/arcnet/Kconfig b/drivers/net/arcnet/Kconfig index 27551bf3d7e4..43eef60653b2 100644 --- a/drivers/net/arcnet/Kconfig +++ b/drivers/net/arcnet/Kconfig @@ -9,7 +9,7 @@ menuconfig ARCNET ---help--- If you have a network card of this type, say Y and check out the (arguably) beautiful poetry in - . + . You need both this driver, and the driver for the particular ARCnet chipset of your card. If you don't know, then it's probably a @@ -28,7 +28,7 @@ config ARCNET_1201 arc0 device. You need to say Y here to communicate with industry-standard RFC1201 implementations, like the arcether.com packet driver or most DOS/Windows ODI drivers. Please read the - ARCnet documentation in + ARCnet documentation in for more information about using arc0. config ARCNET_1051 @@ -42,7 +42,7 @@ config ARCNET_1051 industry-standard RFC1201 implementations, like the arcether.com packet driver or most DOS/Windows ODI drivers. RFC1201 is included automatically as the arc0 device. Please read the ARCnet - documentation in for more + documentation in for more information about using arc0e and arc0s. config ARCNET_RAW -- cgit From ff2269f16a1e1a7f8bbe72920d3d285ba3943572 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:21 +0200 Subject: docs: networking: convert atm.txt to ReST There isn't much to be done here. Just: - add SPDX header; - add a document title. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/atm.rst | 14 ++++++++++++++ Documentation/networking/atm.txt | 8 -------- Documentation/networking/index.rst | 1 + net/atm/Kconfig | 2 +- 4 files changed, 16 insertions(+), 9 deletions(-) create mode 100644 Documentation/networking/atm.rst delete mode 100644 Documentation/networking/atm.txt diff --git a/Documentation/networking/atm.rst b/Documentation/networking/atm.rst new file mode 100644 index 000000000000..c1df8c038525 --- /dev/null +++ b/Documentation/networking/atm.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=== +ATM +=== + +In order to use anything but the most primitive functions of ATM, +several user-mode programs are required to assist the kernel. These +programs and related material can be found via the ATM on Linux Web +page at http://linux-atm.sourceforge.net/ + +If you encounter problems with ATM, please report them on the ATM +on Linux mailing list. Subscription information, archives, etc., +can be found on http://linux-atm.sourceforge.net/ diff --git a/Documentation/networking/atm.txt b/Documentation/networking/atm.txt deleted file mode 100644 index 82921cee77fe..000000000000 --- a/Documentation/networking/atm.txt +++ /dev/null @@ -1,8 +0,0 @@ -In order to use anything but the most primitive functions of ATM, -several user-mode programs are required to assist the kernel. These -programs and related material can be found via the ATM on Linux Web -page at http://linux-atm.sourceforge.net/ - -If you encounter problems with ATM, please report them on the ATM -on Linux mailing list. Subscription information, archives, etc., -can be found on http://linux-atm.sourceforge.net/ diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3e0a4bb23ef9..841f3c3905d5 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -41,6 +41,7 @@ Contents: altera_tse arcnet-hardware arcnet + atm .. only:: subproject and html diff --git a/net/atm/Kconfig b/net/atm/Kconfig index 271f682e8438..e61dcc9f85b2 100644 --- a/net/atm/Kconfig +++ b/net/atm/Kconfig @@ -16,7 +16,7 @@ config ATM of your ATM card below. Note that you need a set of user-space programs to actually make use - of ATM. See the file for + of ATM. See the file for further details. config ATM_CLIP -- cgit From 20b943f075574c233de51fa2f0124a97f0298be1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:22 +0200 Subject: docs: networking: convert ax25.txt to ReST There isn't much to be done here. Just: - add SPDX header; - add a document title. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/ax25.rst | 16 ++++++++++++++++ Documentation/networking/ax25.txt | 10 ---------- Documentation/networking/index.rst | 1 + net/ax25/Kconfig | 6 +++--- 4 files changed, 20 insertions(+), 13 deletions(-) create mode 100644 Documentation/networking/ax25.rst delete mode 100644 Documentation/networking/ax25.txt diff --git a/Documentation/networking/ax25.rst b/Documentation/networking/ax25.rst new file mode 100644 index 000000000000..824afd7002db --- /dev/null +++ b/Documentation/networking/ax25.rst @@ -0,0 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +AX.25 +===== + +To use the amateur radio protocols within Linux you will need to get a +suitable copy of the AX.25 Utilities. More detailed information about +AX.25, NET/ROM and ROSE, associated programs and and utilities can be +found on http://www.linux-ax25.org. + +There is an active mailing list for discussing Linux amateur radio matters +called linux-hams@vger.kernel.org. To subscribe to it, send a message to +majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body +of the message, the subject field is ignored. You don't need to be +subscribed to post but of course that means you might miss an answer. diff --git a/Documentation/networking/ax25.txt b/Documentation/networking/ax25.txt deleted file mode 100644 index 8257dbf9be57..000000000000 --- a/Documentation/networking/ax25.txt +++ /dev/null @@ -1,10 +0,0 @@ -To use the amateur radio protocols within Linux you will need to get a -suitable copy of the AX.25 Utilities. More detailed information about -AX.25, NET/ROM and ROSE, associated programs and and utilities can be -found on http://www.linux-ax25.org. - -There is an active mailing list for discussing Linux amateur radio matters -called linux-hams@vger.kernel.org. To subscribe to it, send a message to -majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body -of the message, the subject field is ignored. You don't need to be -subscribed to post but of course that means you might miss an answer. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 841f3c3905d5..6a5858b27cf6 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -42,6 +42,7 @@ Contents: arcnet-hardware arcnet atm + ax25 .. only:: subproject and html diff --git a/net/ax25/Kconfig b/net/ax25/Kconfig index 043fd5437809..97d686d115c0 100644 --- a/net/ax25/Kconfig +++ b/net/ax25/Kconfig @@ -40,7 +40,7 @@ config AX25 radio as well as information about how to configure an AX.25 port is contained in the AX25-HOWTO, available from . You might also want to - check out the file in the + check out the file in the kernel source. More information about digital amateur radio in general is on the WWW at . @@ -88,7 +88,7 @@ config NETROM users as well as information about how to configure an AX.25 port is contained in the Linux Ham Wiki, available from . You also might want to check out the - file . More information about + file . More information about digital amateur radio in general is on the WWW at . @@ -107,7 +107,7 @@ config ROSE users as well as information about how to configure an AX.25 port is contained in the Linux Ham Wiki, available from . You also might want to check out the - file . More information about + file . More information about digital amateur radio in general is on the WWW at . -- cgit From b5fcf32d7d4b647c0f3aa612d91d25996a49bcd9 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:23 +0200 Subject: docs: networking: convert baycom.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/baycom.rst | 174 ++++++++++++++++++++++++++++++++++++ Documentation/networking/baycom.txt | 158 -------------------------------- Documentation/networking/index.rst | 1 + drivers/net/hamradio/Kconfig | 8 +- 4 files changed, 179 insertions(+), 162 deletions(-) create mode 100644 Documentation/networking/baycom.rst delete mode 100644 Documentation/networking/baycom.txt diff --git a/Documentation/networking/baycom.rst b/Documentation/networking/baycom.rst new file mode 100644 index 000000000000..fe2d010f0e86 --- /dev/null +++ b/Documentation/networking/baycom.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Linux Drivers for Baycom Modems +=============================== + +Thomas M. Sailer, HB9JNX/AE4WA, + +The drivers for the baycom modems have been split into +separate drivers as they did not share any code, and the driver +and device names have changed. + +This document describes the Linux Kernel Drivers for simple Baycom style +amateur radio modems. + +The following drivers are available: +==================================== + +baycom_ser_fdx: + This driver supports the SER12 modems either full or half duplex. + Its baud rate may be changed via the ``baud`` module parameter, + therefore it supports just about every bit bang modem on a + serial port. Its devices are called bcsf0 through bcsf3. + This is the recommended driver for SER12 type modems, + however if you have a broken UART clone that does not have working + delta status bits, you may try baycom_ser_hdx. + +baycom_ser_hdx: + This is an alternative driver for SER12 type modems. + It only supports half duplex, and only 1200 baud. Its devices + are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx + does not work with your UART. + +baycom_par: + This driver supports the par96 and picpar modems. + Its devices are called bcp0 through bcp3. + +baycom_epp: + This driver supports the EPP modem. + Its devices are called bce0 through bce3. + This driver is work-in-progress. + +The following modems are supported: + +======= ======================================================================== +ser12 This is a very simple 1200 baud AFSK modem. The modem consists only + of a modulator/demodulator chip, usually a TI TCM3105. The computer + is responsible for regenerating the receiver bit clock, as well as + for handling the HDLC protocol. The modem connects to a serial port, + hence the name. Since the serial port is not used as an async serial + port, the kernel driver for serial ports cannot be used, and this + driver only supports standard serial hardware (8250, 16450, 16550) + +par96 This is a modem for 9600 baud FSK compatible to the G3RUH standard. + The modem does all the filtering and regenerates the receiver clock. + Data is transferred from and to the PC via a shift register. + The shift register is filled with 16 bits and an interrupt is signalled. + The PC then empties the shift register in a burst. This modem connects + to the parallel port, hence the name. The modem leaves the + implementation of the HDLC protocol and the scrambler polynomial to + the PC. + +picpar This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem + is protocol compatible to par96, but uses only three low power ICs + and can therefore be fed from the parallel port and does not require + an additional power supply. Furthermore, it incorporates a carrier + detect circuitry. + +EPP This is a high-speed modem adaptor that connects to an enhanced parallel + port. + + Its target audience is users working over a high speed hub (76.8kbit/s). + +eppfpga This is a redesign of the EPP adaptor. +======= ======================================================================== + +All of the above modems only support half duplex communications. However, +the driver supports the KISS (see below) fullduplex command. It then simply +starts to send as soon as there's a packet to transmit and does not care +about DCD, i.e. it starts to send even if there's someone else on the channel. +This command is required by some implementations of the DAMA channel +access protocol. + + +The Interface of the drivers +============================ + +Unlike previous drivers, these drivers are no longer character devices, +but they are now true kernel network interfaces. Installation is therefore +simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available. +sethdlc from the ax25 utilities may be used to set driver states etc. +Users of userland AX.25 stacks may use the net2kiss utility (also available +in the ax25 utilities package) to convert packets of a network interface +to a KISS stream on a pseudo tty. There's also a patch available from +me for WAMPES which allows attaching a kernel network interface directly. + + +Configuring the driver +====================== + +Every time a driver is inserted into the kernel, it has to know which +modems it should access at which ports. This can be done with the setbaycom +utility. If you are only using one modem, you can also configure the +driver from the insmod command line (or by means of an option line in +``/etc/modprobe.d/*.conf``). + +Examples:: + + modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 + sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4 + +Both lines configure the first port to drive a ser12 modem at the first +serial port (COM1 under DOS). The * in the mode parameter instructs the driver +to use the software DCD algorithm (see below):: + + insmod baycom_par mode="picpar" iobase=0x378 + sethdlc -i bcp0 -p mode "picpar" io 0x378 + +Both lines configure the first port to drive a picpar modem at the +first parallel port (LPT1 under DOS). (Note: picpar implies +hardware DCD, par96 implies software DCD). + +The channel access parameters can be set with sethdlc -a or kissparms. +Note that both utilities interpret the values slightly differently. + + +Hardware DCD versus Software DCD +================================ + +To avoid collisions on the air, the driver must know when the channel is +busy. This is the task of the DCD circuitry/software. The driver may either +utilise a software DCD algorithm (options=1) or use a DCD signal from +the hardware (options=0). + +======= ================================================================= +ser12 if software DCD is utilised, the radio's squelch should always be + open. It is highly recommended to use the software DCD algorithm, + as it is much faster than most hardware squelch circuitry. The + disadvantage is a slightly higher load on the system. + +par96 the software DCD algorithm for this type of modem is rather poor. + The modem simply does not provide enough information to implement + a reasonable DCD algorithm in software. Therefore, if your radio + feeds the DCD input of the PAR96 modem, the use of the hardware + DCD circuitry is recommended. + +picpar the picpar modem features a builtin DCD hardware, which is highly + recommended. +======= ================================================================= + + + +Compatibility with the rest of the Linux kernel +=============================================== + +The serial driver and the baycom serial drivers compete +for the same hardware resources. Of course only one driver can access a given +interface at a time. The serial driver grabs all interfaces it can find at +startup time. Therefore the baycom drivers subsequently won't be able to +access a serial port. You might therefore find it necessary to release +a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where +# is the number of the interface. The baycom drivers do not reserve any +ports at startup, unless one is specified on the 'insmod' command line. Another +method to solve the problem is to compile all drivers as modules and +leave it to kmod to load the correct driver depending on the application. + +The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem +to arbitrate the ports between different client drivers. + +vy 73s de + +Tom Sailer, sailer@ife.ee.ethz.ch + +hb9jnx @ hb9w.ampr.org diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/baycom.txt deleted file mode 100644 index 688f18fd4467..000000000000 --- a/Documentation/networking/baycom.txt +++ /dev/null @@ -1,158 +0,0 @@ - LINUX DRIVERS FOR BAYCOM MODEMS - - Thomas M. Sailer, HB9JNX/AE4WA, - -!!NEW!! (04/98) The drivers for the baycom modems have been split into -separate drivers as they did not share any code, and the driver -and device names have changed. - -This document describes the Linux Kernel Drivers for simple Baycom style -amateur radio modems. - -The following drivers are available: - -baycom_ser_fdx: - This driver supports the SER12 modems either full or half duplex. - Its baud rate may be changed via the `baud' module parameter, - therefore it supports just about every bit bang modem on a - serial port. Its devices are called bcsf0 through bcsf3. - This is the recommended driver for SER12 type modems, - however if you have a broken UART clone that does not have working - delta status bits, you may try baycom_ser_hdx. - -baycom_ser_hdx: - This is an alternative driver for SER12 type modems. - It only supports half duplex, and only 1200 baud. Its devices - are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx - does not work with your UART. - -baycom_par: - This driver supports the par96 and picpar modems. - Its devices are called bcp0 through bcp3. - -baycom_epp: - This driver supports the EPP modem. - Its devices are called bce0 through bce3. - This driver is work-in-progress. - -The following modems are supported: - -ser12: This is a very simple 1200 baud AFSK modem. The modem consists only - of a modulator/demodulator chip, usually a TI TCM3105. The computer - is responsible for regenerating the receiver bit clock, as well as - for handling the HDLC protocol. The modem connects to a serial port, - hence the name. Since the serial port is not used as an async serial - port, the kernel driver for serial ports cannot be used, and this - driver only supports standard serial hardware (8250, 16450, 16550) - -par96: This is a modem for 9600 baud FSK compatible to the G3RUH standard. - The modem does all the filtering and regenerates the receiver clock. - Data is transferred from and to the PC via a shift register. - The shift register is filled with 16 bits and an interrupt is signalled. - The PC then empties the shift register in a burst. This modem connects - to the parallel port, hence the name. The modem leaves the - implementation of the HDLC protocol and the scrambler polynomial to - the PC. - -picpar: This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem - is protocol compatible to par96, but uses only three low power ICs - and can therefore be fed from the parallel port and does not require - an additional power supply. Furthermore, it incorporates a carrier - detect circuitry. - -EPP: This is a high-speed modem adaptor that connects to an enhanced parallel port. - Its target audience is users working over a high speed hub (76.8kbit/s). - -eppfpga: This is a redesign of the EPP adaptor. - - - -All of the above modems only support half duplex communications. However, -the driver supports the KISS (see below) fullduplex command. It then simply -starts to send as soon as there's a packet to transmit and does not care -about DCD, i.e. it starts to send even if there's someone else on the channel. -This command is required by some implementations of the DAMA channel -access protocol. - - -The Interface of the drivers - -Unlike previous drivers, these drivers are no longer character devices, -but they are now true kernel network interfaces. Installation is therefore -simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available. -sethdlc from the ax25 utilities may be used to set driver states etc. -Users of userland AX.25 stacks may use the net2kiss utility (also available -in the ax25 utilities package) to convert packets of a network interface -to a KISS stream on a pseudo tty. There's also a patch available from -me for WAMPES which allows attaching a kernel network interface directly. - - -Configuring the driver - -Every time a driver is inserted into the kernel, it has to know which -modems it should access at which ports. This can be done with the setbaycom -utility. If you are only using one modem, you can also configure the -driver from the insmod command line (or by means of an option line in -/etc/modprobe.d/*.conf). - -Examples: - modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 - sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4 - -Both lines configure the first port to drive a ser12 modem at the first -serial port (COM1 under DOS). The * in the mode parameter instructs the driver to use -the software DCD algorithm (see below). - - insmod baycom_par mode="picpar" iobase=0x378 - sethdlc -i bcp0 -p mode "picpar" io 0x378 - -Both lines configure the first port to drive a picpar modem at the -first parallel port (LPT1 under DOS). (Note: picpar implies -hardware DCD, par96 implies software DCD). - -The channel access parameters can be set with sethdlc -a or kissparms. -Note that both utilities interpret the values slightly differently. - - -Hardware DCD versus Software DCD - -To avoid collisions on the air, the driver must know when the channel is -busy. This is the task of the DCD circuitry/software. The driver may either -utilise a software DCD algorithm (options=1) or use a DCD signal from -the hardware (options=0). - -ser12: if software DCD is utilised, the radio's squelch should always be - open. It is highly recommended to use the software DCD algorithm, - as it is much faster than most hardware squelch circuitry. The - disadvantage is a slightly higher load on the system. - -par96: the software DCD algorithm for this type of modem is rather poor. - The modem simply does not provide enough information to implement - a reasonable DCD algorithm in software. Therefore, if your radio - feeds the DCD input of the PAR96 modem, the use of the hardware - DCD circuitry is recommended. - -picpar: the picpar modem features a builtin DCD hardware, which is highly - recommended. - - - -Compatibility with the rest of the Linux kernel - -The serial driver and the baycom serial drivers compete -for the same hardware resources. Of course only one driver can access a given -interface at a time. The serial driver grabs all interfaces it can find at -startup time. Therefore the baycom drivers subsequently won't be able to -access a serial port. You might therefore find it necessary to release -a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where -# is the number of the interface. The baycom drivers do not reserve any -ports at startup, unless one is specified on the 'insmod' command line. Another -method to solve the problem is to compile all drivers as modules and -leave it to kmod to load the correct driver depending on the application. - -The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem -to arbitrate the ports between different client drivers. - -vy 73s de -Tom Sailer, sailer@ife.ee.ethz.ch -hb9jnx @ hb9w.ampr.org diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 6a5858b27cf6..fbf845fbaff7 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -43,6 +43,7 @@ Contents: arcnet atm ax25 + baycom .. only:: subproject and html diff --git a/drivers/net/hamradio/Kconfig b/drivers/net/hamradio/Kconfig index bf306fed04cc..fe409819b56d 100644 --- a/drivers/net/hamradio/Kconfig +++ b/drivers/net/hamradio/Kconfig @@ -127,7 +127,7 @@ config BAYCOM_SER_FDX your serial interface chip. To configure the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and - . + . To compile this driver as a module, choose M here: the module will be called baycom_ser_fdx. This is recommended. @@ -145,7 +145,7 @@ config BAYCOM_SER_HDX the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and - . + . To compile this driver as a module, choose M here: the module will be called baycom_ser_hdx. This is recommended. @@ -160,7 +160,7 @@ config BAYCOM_PAR par96 designs. To configure the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and the file - . + . To compile this driver as a module, choose M here: the module will be called baycom_par. This is recommended. @@ -175,7 +175,7 @@ config BAYCOM_EPP designs. To configure the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and the file - . + . To compile this driver as a module, choose M here: the module will be called baycom_epp. This is recommended. -- cgit From a362032eca22d03071c4613f6ca503be982bf375 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:24 +0200 Subject: docs: networking: convert bonding.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - mark tables as such; - add notes markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/bonding.rst | 2890 ++++++++++++++++++++ Documentation/networking/bonding.txt | 2837 ------------------- .../networking/device_drivers/intel/e100.rst | 2 +- .../networking/device_drivers/intel/ixgb.rst | 2 +- Documentation/networking/index.rst | 1 + drivers/net/Kconfig | 2 +- 6 files changed, 2894 insertions(+), 2840 deletions(-) create mode 100644 Documentation/networking/bonding.rst delete mode 100644 Documentation/networking/bonding.txt diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst new file mode 100644 index 000000000000..dd49f95d28d3 --- /dev/null +++ b/Documentation/networking/bonding.rst @@ -0,0 +1,2890 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Linux Ethernet Bonding Driver HOWTO +=================================== + +Latest update: 27 April 2011 + +Initial release: Thomas Davis + +Corrections, HA extensions: 2000/10/03-15: + + - Willy Tarreau + - Constantine Gavrilov + - Chad N. Tindel + - Janice Girouard + - Jay Vosburgh + +Reorganized and updated Feb 2005 by Jay Vosburgh +Added Sysfs information: 2006/04/24 + + - Mitch Williams + +Introduction +============ + +The Linux bonding driver provides a method for aggregating +multiple network interfaces into a single logical "bonded" interface. +The behavior of the bonded interfaces depends upon the mode; generally +speaking, modes provide either hot standby or load balancing services. +Additionally, link integrity monitoring may be performed. + +The bonding driver originally came from Donald Becker's +beowulf patches for kernel 2.0. It has changed quite a bit since, and +the original tools from extreme-linux and beowulf sites will not work +with this version of the driver. + +For new versions of the driver, updated userspace tools, and +who to ask for help, please follow the links at the end of this file. + +.. Table of Contents + + 1. Bonding Driver Installation + + 2. Bonding Driver Options + + 3. Configuring Bonding Devices + 3.1 Configuration with Sysconfig Support + 3.1.1 Using DHCP with Sysconfig + 3.1.2 Configuring Multiple Bonds with Sysconfig + 3.2 Configuration with Initscripts Support + 3.2.1 Using DHCP with Initscripts + 3.2.2 Configuring Multiple Bonds with Initscripts + 3.3 Configuring Bonding Manually with Ifenslave + 3.3.1 Configuring Multiple Bonds Manually + 3.4 Configuring Bonding Manually via Sysfs + 3.5 Configuration with Interfaces Support + 3.6 Overriding Configuration for Special Cases + 3.7 Configuring LACP for 802.3ad mode in a more secure way + + 4. Querying Bonding Configuration + 4.1 Bonding Configuration + 4.2 Network Configuration + + 5. Switch Configuration + + 6. 802.1q VLAN Support + + 7. Link Monitoring + 7.1 ARP Monitor Operation + 7.2 Configuring Multiple ARP Targets + 7.3 MII Monitor Operation + + 8. Potential Trouble Sources + 8.1 Adventures in Routing + 8.2 Ethernet Device Renaming + 8.3 Painfully Slow Or No Failed Link Detection By Miimon + + 9. SNMP agents + + 10. Promiscuous mode + + 11. Configuring Bonding for High Availability + 11.1 High Availability in a Single Switch Topology + 11.2 High Availability in a Multiple Switch Topology + 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology + 11.2.2 HA Link Monitoring for Multiple Switch Topology + + 12. Configuring Bonding for Maximum Throughput + 12.1 Maximum Throughput in a Single Switch Topology + 12.1.1 MT Bonding Mode Selection for Single Switch Topology + 12.1.2 MT Link Monitoring for Single Switch Topology + 12.2 Maximum Throughput in a Multiple Switch Topology + 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology + 12.2.2 MT Link Monitoring for Multiple Switch Topology + + 13. Switch Behavior Issues + 13.1 Link Establishment and Failover Delays + 13.2 Duplicated Incoming Packets + + 14. Hardware Specific Considerations + 14.1 IBM BladeCenter + + 15. Frequently Asked Questions + + 16. Resources and Links + + +1. Bonding Driver Installation +============================== + +Most popular distro kernels ship with the bonding driver +already available as a module. If your distro does not, or you +have need to compile bonding from source (e.g., configuring and +installing a mainline kernel from kernel.org), you'll need to perform +the following steps: + +1.1 Configure and build the kernel with bonding +----------------------------------------------- + +The current version of the bonding driver is available in the +drivers/net/bonding subdirectory of the most recent kernel source +(which is available on http://kernel.org). Most users "rolling their +own" will want to use the most recent kernel from kernel.org. + +Configure kernel with "make menuconfig" (or "make xconfig" or +"make config"), then select "Bonding driver support" in the "Network +device support" section. It is recommended that you configure the +driver as module since it is currently the only way to pass parameters +to the driver or configure more than one bonding device. + +Build and install the new kernel and modules. + +1.2 Bonding Control Utility +--------------------------- + +It is recommended to configure bonding via iproute2 (netlink) +or sysfs, the old ifenslave control utility is obsolete. + +2. Bonding Driver Options +========================= + +Options for the bonding driver are supplied as parameters to the +bonding module at load time, or are specified via sysfs. + +Module options may be given as command line arguments to the +insmod or modprobe command, but are usually specified in either the +``/etc/modprobe.d/*.conf`` configuration files, or in a distro-specific +configuration file (some of which are detailed in the next section). + +Details on bonding support for sysfs is provided in the +"Configuring Bonding Manually via Sysfs" section, below. + +The available bonding driver parameters are listed below. If a +parameter is not specified the default value is used. When initially +configuring a bond, it is recommended "tail -f /var/log/messages" be +run in a separate window to watch for bonding driver error messages. + +It is critical that either the miimon or arp_interval and +arp_ip_target parameters be specified, otherwise serious network +degradation will occur during link failures. Very few devices do not +support at least miimon, so there is really no reason not to use it. + +Options with textual values will accept either the text name +or, for backwards compatibility, the option value. E.g., +"mode=802.3ad" and "mode=4" set the same mode. + +The parameters are as follows: + +active_slave + + Specifies the new active slave for modes that support it + (active-backup, balance-alb and balance-tlb). Possible values + are the name of any currently enslaved interface, or an empty + string. If a name is given, the slave and its link must be up in order + to be selected as the new active slave. If an empty string is + specified, the current active slave is cleared, and a new active + slave is selected automatically. + + Note that this is only available through the sysfs interface. No module + parameter by this name exists. + + The normal value of this option is the name of the currently + active slave, or the empty string if there is no active slave or + the current mode does not use an active slave. + +ad_actor_sys_prio + + In an AD system, this specifies the system priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 65535 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_actor_system + + In an AD system, this specifies the mac-address for the actor in + protocol packet exchanges (LACPDUs). The value cannot be NULL or + multicast. It is preferred to have the local-admin bit set for this + mac but driver does not enforce it. If the value is not given then + system defaults to using the masters' mac address as actors' system + address. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_select + + Specifies the 802.3ad aggregation selection logic to use. The + possible values and their effects are: + + stable or 0 + + The active aggregator is chosen by largest aggregate + bandwidth. + + Reselection of the active aggregator occurs only when all + slaves of the active aggregator are down or the active + aggregator has no slaves. + + This is the default value. + + bandwidth or 1 + + The active aggregator is chosen by largest aggregate + bandwidth. Reselection occurs if: + + - A slave is added to or removed from the bond + + - Any slave's link state changes + + - Any slave's 802.3ad association state changes + + - The bond's administrative state changes to up + + count or 2 + + The active aggregator is chosen by the largest number of + ports (slaves). Reselection occurs as described under the + "bandwidth" setting, above. + + The bandwidth and count selection policies permit failover of + 802.3ad aggregations when partial failure of the active aggregator + occurs. This keeps the aggregator with the highest availability + (either in bandwidth or in number of ports) active at all times. + + This option was added in bonding version 3.4.0. + +ad_user_port_key + + In an AD system, the port-key has three parts as shown below - + + ===== ============ + Bits Use + ===== ============ + 00 Duplex + 01-05 Speed + 06-15 User-defined + ===== ============ + + This defines the upper 10 bits of the port key. The values can be + from 0 - 1023. If not given, the system defaults to 0. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +all_slaves_active + + Specifies that duplicate frames (received on inactive ports) should be + dropped (0) or delivered (1). + + Normally, bonding will drop duplicate frames (received on inactive + ports), which is desirable for most users. But there are some times + it is nice to allow duplicate frames to be delivered. + + The default value is 0 (drop duplicate frames received on inactive + ports). + +arp_interval + + Specifies the ARP link monitoring frequency in milliseconds. + + The ARP monitor works by periodically checking the slave + devices to determine whether they have sent or received + traffic recently (the precise criteria depends upon the + bonding mode, and the state of the slave). Regular traffic is + generated via ARP probes issued for the addresses specified by + the arp_ip_target option. + + This behavior can be modified by the arp_validate option, + below. + + If ARP monitoring is used in an etherchannel compatible mode + (modes 0 and 2), the switch should be configured in a mode + that evenly distributes packets across all links. If the + switch is configured to distribute the packets in an XOR + fashion, all replies from the ARP targets will be received on + the same link which could cause the other team members to + fail. ARP monitoring should not be used in conjunction with + miimon. A value of 0 disables ARP monitoring. The default + value is 0. + +arp_ip_target + + Specifies the IP addresses to use as ARP monitoring peers when + arp_interval is > 0. These are the targets of the ARP request + sent to determine the health of the link to the targets. + Specify these values in ddd.ddd.ddd.ddd format. Multiple IP + addresses must be separated by a comma. At least one IP + address must be given for ARP monitoring to function. The + maximum number of targets that can be specified is 16. The + default value is no IP addresses. + +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in any mode that supports arp monitoring, or whether + non-ARP traffic should be filtered (disregarded) for link + monitoring purposes. + + Possible values are: + + none or 0 + + No validation or filtering is performed. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + filter or 4 + + Filtering is applied to all slaves. No validation is + performed. + + filter_active or 5 + + Filtering is applied to all slaves, validation is performed + only for the active slave. + + filter_backup or 6 + + Filtering is applied to all slaves, validation is performed + only for backup slaves. + + Validation: + + Enabling validation causes the ARP monitor to examine the incoming + ARP requests and replies, and only consider a slave to be up if it + is receiving the appropriate ARP traffic. + + For an active slave, the validation checks ARP replies to confirm + that they were generated by an arp_ip_target. Since backup slaves + do not typically receive these replies, the validation performed + for backup slaves is on the broadcast ARP request sent out via the + active slave. It is possible that some switch or network + configurations may result in situations wherein the backup slaves + do not receive the ARP requests; in such a situation, validation + of backup slaves must be disabled. + + The validation of ARP requests on backup slaves is mainly helping + bonding to decide which slaves are more likely to work in case of + the active slave failure, it doesn't really guarantee that the + backup slave will work if it's selected as the next active slave. + + Validation is useful in network configurations in which multiple + bonding hosts are concurrently issuing ARPs to one or more targets + beyond a common switch. Should the link between the switch and + target fail (but not the switch itself), the probe traffic + generated by the multiple bonding instances will fool the standard + ARP monitor into considering the links as still up. Use of + validation can resolve this, as the ARP monitor will only consider + ARP requests and replies associated with its own instance of + bonding. + + Filtering: + + Enabling filtering causes the ARP monitor to only use incoming ARP + packets for link availability purposes. Arriving packets that are + not ARPs are delivered normally, but do not count when determining + if a slave is available. + + Filtering operates by only considering the reception of ARP + packets (any ARP packet, regardless of source or destination) when + determining if a slave has received traffic for link availability + purposes. + + Filtering is useful in network configurations in which significant + levels of third party broadcast traffic would fool the standard + ARP monitor into considering the links as still up. Use of + filtering can resolve this, as only ARP traffic is considered for + link availability purposes. + + This option was added in bonding version 3.1.0. + +arp_all_targets + + Specifies the quantity of arp_ip_targets that must be reachable + in order for the ARP monitor to consider a slave as being up. + This option affects only active-backup mode for slaves with + arp_validation enabled. + + Possible values are: + + any or 0 + + consider the slave up only when any of the arp_ip_targets + is reachable + + all or 1 + + consider the slave up only when all of the arp_ip_targets + are reachable + +downdelay + + Specifies the time, in milliseconds, to wait before disabling + a slave after a link failure has been detected. This option + is only valid for the miimon link monitor. The downdelay + value should be a multiple of the miimon value; if not, it + will be rounded down to the nearest multiple. The default + value is 0. + +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address at enslavement (the traditional + behavior), or, when enabled, perform special handling of the + bond's MAC address in accordance with the selected policy. + + Possible values are: + + none or 0 + + This setting disables fail_over_mac, and causes + bonding to set all slaves of an active-backup bond to + the same MAC address at enslavement time. This is the + default. + + active or 1 + + The "active" fail_over_mac policy indicates that the + MAC address of the bond should always be the MAC + address of the currently active slave. The MAC + address of the slaves is not changed; instead, the MAC + address of the bond changes during a failover. + + This policy is useful for devices that cannot ever + alter their MAC address, or for devices that refuse + incoming broadcasts with their own source MAC (which + interferes with the ARP monitor). + + The down side of this policy is that every device on + the network must be updated via gratuitous ARP, + vs. just updating a switch or set of switches (which + often takes place for any traffic, not just ARP + traffic, if the switch snoops incoming traffic to + update its tables) for the traditional method. If the + gratuitous ARP is lost, communication may be + disrupted. + + When this policy is used in conjunction with the mii + monitor, devices which assert link up prior to being + able to actually transmit and receive are particularly + susceptible to loss of the gratuitous ARP, and an + appropriate updelay setting may be required. + + follow or 2 + + The "follow" fail_over_mac policy causes the MAC + address of the bond to be selected normally (normally + the MAC address of the first slave added to the bond). + However, the second and subsequent slaves are not set + to this MAC address while they are in a backup role; a + slave is programmed with the bond's MAC address at + failover time (and the formerly active slave receives + the newly active slave's MAC address). + + This policy is useful for multiport devices that + either become confused or incur a performance penalty + when multiple ports are programmed with the same MAC + address. + + + The default policy is none, unless the first slave cannot + change its MAC address, in which case the active policy is + selected by default. + + This option may be modified via sysfs only when no slaves are + present in the bond. + + This option was added in bonding version 3.2.0. The "follow" + policy was added in bonding version 3.3.0. + +lacp_rate + + Option specifying the rate in which we'll ask our link partner + to transmit LACPDU packets in 802.3ad mode. Possible values + are: + + slow or 0 + Request partner to transmit LACPDUs every 30 seconds + + fast or 1 + Request partner to transmit LACPDUs every 1 second + + The default is slow. + +max_bonds + + Specifies the number of bonding devices to create for this + instance of the bonding driver. E.g., if max_bonds is 3, and + the bonding driver is not already loaded, then bond0, bond1 + and bond2 will be created. The default value is 1. Specifying + a value of 0 will load bonding, but will not create any devices. + +miimon + + Specifies the MII link monitoring frequency in milliseconds. + This determines how often the link state of each slave is + inspected for link failures. A value of zero disables MII + link monitoring. A value of 100 is a good starting point. + The use_carrier option, below, affects how the link state is + determined. See the High Availability section for additional + information. The default value is 0. + +min_links + + Specifies the minimum number of links that must be active before + asserting carrier. It is similar to the Cisco EtherChannel min-links + feature. This allows setting the minimum number of member ports that + must be up (link-up state) before marking the bond device as up + (carrier on). This is useful for situations where higher level services + such as clustering want to ensure a minimum number of low bandwidth + links are active before switchover. This option only affect 802.3ad + mode. + + The default value is 0. This will cause carrier to be asserted (for + 802.3ad mode) whenever there is an active aggregator, regardless of the + number of available links in that aggregator. Note that, because an + aggregator cannot be active without at least one available link, + setting this option to 0 or to 1 has the exact same effect. + +mode + + Specifies one of the bonding policies. The default is + balance-rr (round robin). Possible values are: + + balance-rr or 0 + + Round-robin policy: Transmit packets in sequential + order from the first available slave through the + last. This mode provides load balancing and fault + tolerance. + + active-backup or 1 + + Active-backup policy: Only one slave in the bond is + active. A different slave becomes active if, and only + if, the active slave fails. The bond's MAC address is + externally visible on only one port (network adapter) + to avoid confusing the switch. + + In bonding version 2.6.2 or later, when a failover + occurs in active-backup mode, bonding will issue one + or more gratuitous ARPs on the newly active slave. + One gratuitous ARP is issued for the bonding master + interface and each VLAN interfaces configured above + it, provided that the interface has at least one IP + address configured. Gratuitous ARPs issued for VLAN + interfaces are tagged with the appropriate VLAN id. + + This mode provides fault tolerance. The primary + option, documented below, affects the behavior of this + mode. + + balance-xor or 2 + + XOR policy: Transmit based on the selected transmit + hash policy. The default policy is a simple [(source + MAC address XOR'd with destination MAC address XOR + packet type ID) modulo slave count]. Alternate transmit + policies may be selected via the xmit_hash_policy option, + described below. + + This mode provides load balancing and fault tolerance. + + broadcast or 3 + + Broadcast policy: transmits everything on all slave + interfaces. This mode provides fault tolerance. + + 802.3ad or 4 + + IEEE 802.3ad Dynamic link aggregation. Creates + aggregation groups that share the same speed and + duplex settings. Utilizes all slaves in the active + aggregator according to the 802.3ad specification. + + Slave selection for outgoing traffic is done according + to the transmit hash policy, which may be changed from + the default simple XOR policy via the xmit_hash_policy + option, documented below. Note that not all transmit + policies may be 802.3ad compliant, particularly in + regards to the packet mis-ordering requirements of + section 43.2.4 of the 802.3ad standard. Differing + peer implementations will have varying tolerances for + noncompliance. + + Prerequisites: + + 1. Ethtool support in the base drivers for retrieving + the speed and duplex of each slave. + + 2. A switch that supports IEEE 802.3ad Dynamic link + aggregation. + + Most switches will require some type of configuration + to enable 802.3ad mode. + + balance-tlb or 5 + + Adaptive transmit load balancing: channel bonding that + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. + + Prerequisite: + + Ethtool support in the base drivers for retrieving the + speed of each slave. + + balance-alb or 6 + + Adaptive load balancing: includes balance-tlb plus + receive load balancing (rlb) for IPV4 traffic, and + does not require any special switch support. The + receive load balancing is achieved by ARP negotiation. + The bonding driver intercepts the ARP Replies sent by + the local system on their way out and overwrites the + source hardware address with the unique hardware + address of one of the slaves in the bond such that + different peers use different hardware addresses for + the server. + + Receive traffic from connections created by the server + is also balanced. When the local system sends an ARP + Request the bonding driver copies and saves the peer's + IP information from the ARP packet. When the ARP + Reply arrives from the peer, its hardware address is + retrieved and the bonding driver initiates an ARP + reply to this peer assigning it to one of the slaves + in the bond. A problematic outcome of using ARP + negotiation for balancing is that each time that an + ARP request is broadcast it uses the hardware address + of the bond. Hence, peers learn the hardware address + of the bond and the balancing of receive traffic + collapses to the current slave. This is handled by + sending updates (ARP Replies) to all the peers with + their individually assigned hardware address such that + the traffic is redistributed. Receive traffic is also + redistributed when a new slave is added to the bond + and when an inactive slave is re-activated. The + receive load is distributed sequentially (round robin) + among the group of highest speed slaves in the bond. + + When a link is reconnected or a new slave joins the + bond the receive traffic is redistributed among all + active slaves in the bond by initiating ARP Replies + with the selected MAC address to each of the + clients. The updelay parameter (detailed below) must + be set to a value equal or greater than the switch's + forwarding delay so that the ARP Replies sent to the + peers will not be blocked by the switch. + + Prerequisites: + + 1. Ethtool support in the base drivers for retrieving + the speed of each slave. + + 2. Base driver support for setting the hardware + address of a device while it is open. This is + required so that there will always be one slave in the + team using the bond hardware address (the + curr_active_slave) while having a unique hardware + address for each slave in the bond. If the + curr_active_slave fails its hardware address is + swapped with the new curr_active_slave that was + chosen. + +num_grat_arp, +num_unsol_na + + Specify the number of peer notifications (gratuitous ARPs and + unsolicited IPv6 Neighbor Advertisements) to be issued after a + failover event. As soon as the link is up on the new slave + (possibly immediately) a peer notification is sent on the + bonding device and each VLAN sub-device. This is repeated at + the rate specified by peer_notif_delay if the number is + greater than 1. + + The valid range is 0 - 255; the default value is 1. These options + affect only the active-backup mode. These options were added for + bonding versions 3.3.0 and 3.4.0 respectively. + + From Linux 3.0 and bonding version 3.7.1, these notifications + are generated by the ipv4 and ipv6 code and the numbers of + repetitions cannot be set independently. + +packets_per_slave + + Specify the number of packets to transmit through a slave before + moving to the next one. When set to 0 then a slave is chosen at + random. + + The valid range is 0 - 65535; the default value is 1. This option + has effect only in balance-rr mode. + +peer_notif_delay + + Specify the delay, in milliseconds, between each peer + notification (gratuitous ARP and unsolicited IPv6 Neighbor + Advertisement) when they are issued after a failover event. + This delay should be a multiple of the link monitor interval + (arp_interval or miimon, whichever is active). The default + value is 0 which means to match the value of the link monitor + interval. + +primary + + A string (eth0, eth2, etc) specifying which slave is the + primary device. The specified device will always be the + active slave while it is available. Only when the primary is + off-line will alternate devices be used. This is useful when + one slave is preferred over another, e.g., when one slave has + higher throughput than another. + + The primary option is only valid for active-backup(1), + balance-tlb (5) and balance-alb (6) mode. + +primary_reselect + + Specifies the reselection policy for the primary slave. This + affects how the primary slave is chosen to become the active slave + when failure of the active slave or recovery of the primary slave + occurs. This option is designed to prevent flip-flopping between + the primary slave and other slaves. Possible values are: + + always or 0 (default) + + The primary slave becomes the active slave whenever it + comes back up. + + better or 1 + + The primary slave becomes the active slave when it comes + back up, if the speed and duplex of the primary slave is + better than the speed and duplex of the current active + slave. + + failure or 2 + + The primary slave becomes the active slave only if the + current active slave fails and the primary slave is up. + + The primary_reselect setting is ignored in two cases: + + If no slaves are active, the first slave to recover is + made the active slave. + + When initially enslaved, the primary slave is always made + the active slave. + + Changing the primary_reselect policy via sysfs will cause an + immediate selection of the best active slave according to the new + policy. This may or may not result in a change of the active + slave, depending upon the circumstances. + + This option was added for bonding version 3.6.0. + +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + + +updelay + + Specifies the time, in milliseconds, to wait before enabling a + slave after a link recovery has been detected. This option is + only valid for the miimon link monitor. The updelay value + should be a multiple of the miimon value; if not, it will be + rounded down to the nearest multiple. The default value is 0. + +use_carrier + + Specifies whether or not miimon should use MII or ETHTOOL + ioctls vs. netif_carrier_ok() to determine the link + status. The MII or ETHTOOL ioctls are less efficient and + utilize a deprecated calling sequence within the kernel. The + netif_carrier_ok() relies on the device driver to maintain its + state with netif_carrier_on/off; at this writing, most, but + not all, device drivers support this facility. + + If bonding insists that the link is up when it should not be, + it may be that your network device driver does not support + netif_carrier_on/off. The default state for netif_carrier is + "carrier on," so if a driver does not support netif_carrier, + it will appear as if the link is always up. In this case, + setting use_carrier to 0 will cause bonding to revert to the + MII / ETHTOOL ioctl method to determine the link state. + + A value of 1 enables the use of netif_carrier_ok(), a value of + 0 will use the deprecated MII / ETHTOOL ioctls. The default + value is 1. + +xmit_hash_policy + + Selects the transmit hash policy to use for slave selection in + balance-xor, 802.3ad, and tlb modes. Possible values are: + + layer2 + + Uses XOR of hardware MAC addresses and packet type ID + field to generate the hash. The formula is + + hash = source MAC XOR destination MAC XOR packet type ID + slave number = hash modulo slave count + + This algorithm will place all traffic to a particular + network peer on the same slave. + + This algorithm is 802.3ad compliant. + + layer2+3 + + This policy uses a combination of layer2 and layer3 + protocol information to generate the hash. + + Uses XOR of hardware MAC addresses and IP addresses to + generate the hash. The formula is + + hash = source MAC XOR destination MAC XOR packet type ID + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + This algorithm will place all traffic to a particular + network peer on the same slave. For non-IP traffic, + the formula is the same as for the layer2 transmit + hash policy. + + This policy is intended to provide a more balanced + distribution of traffic than layer2 alone, especially + in environments where a layer3 gateway device is + required to reach most destinations. + + This algorithm is 802.3ad compliant. + + layer3+4 + + This policy uses upper layer protocol information, + when available, to generate the hash. This allows for + traffic to a particular network peer to span multiple + slaves, although a single connection will not span + multiple slaves. + + The formula for unfragmented TCP and UDP packets is + + hash = source port, destination port (as in the header) + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + For fragmented TCP or UDP packets and all other IPv4 and + IPv6 protocol traffic, the source and destination port + information is omitted. For non-IP traffic, the + formula is the same as for the layer2 transmit hash + policy. + + This algorithm is not fully 802.3ad compliant. A + single TCP or UDP conversation containing both + fragmented and unfragmented packets will see packets + striped across two interfaces. This may result in out + of order delivery. Most traffic types will not meet + this criteria, as TCP rarely fragments traffic, and + most UDP traffic is not involved in extended + conversations. Other implementations of 802.3ad may + or may not tolerate this noncompliance. + + encap2+3 + + This policy uses the same formula as layer2+3 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + encap3+4 + + This policy uses the same formula as layer3+4 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + The default value is layer2. This option was added in bonding + version 2.6.3. In earlier versions of bonding, this parameter + does not exist, and the layer2 policy is the only policy. The + layer2+3 value was added for bonding version 3.2.2. + +resend_igmp + + Specifies the number of IGMP membership reports to be issued after + a failover event. One membership report is issued immediately after + the failover, subsequent packets are sent in each 200ms interval. + + The valid range is 0 - 255; the default value is 1. A value of 0 + prevents the IGMP membership report from being issued in response + to the failover event. + + This option is useful for bonding modes balance-rr (0), active-backup + (1), balance-tlb (5) and balance-alb (6), in which a failover can + switch the IGMP traffic from one slave to another. Therefore a fresh + IGMP report must be issued to cause the switch to forward the incoming + IGMP traffic over the newly selected slave. + + This option was added for bonding version 3.7.0. + +lp_interval + + Specifies the number of seconds between instances where the bonding + driver sends learning packets to each slaves peer switch. + + The valid range is 1 - 0x7fffffff; the default value is 1. This Option + has effect only in balance-tlb and balance-alb modes. + +3. Configuring Bonding Devices +============================== + +You can configure bonding using either your distro's network +initialization scripts, or manually using either iproute2 or the +sysfs interface. Distros generally use one of three packages for the +network initialization scripts: initscripts, sysconfig or interfaces. +Recent versions of these packages have support for bonding, while older +versions do not. + +We will first describe the options for configuring bonding for +distros using versions of initscripts, sysconfig and interfaces with full +or partial support for bonding, then provide information on enabling +bonding without support from the network initialization scripts (i.e., +older versions of initscripts or sysconfig). + +If you're unsure whether your distro uses sysconfig, +initscripts or interfaces, or don't know if it's new enough, have no fear. +Determining this is fairly straightforward. + +First, look for a file called interfaces in /etc/network directory. +If this file is present in your system, then your system use interfaces. See +Configuration with Interfaces Support. + +Else, issue the command:: + + $ rpm -qf /sbin/ifup + +It will respond with a line of text starting with either +"initscripts" or "sysconfig," followed by some numbers. This is the +package that provides your network initialization scripts. + +Next, to determine if your installation supports bonding, +issue the command:: + + $ grep ifenslave /sbin/ifup + +If this returns any matches, then your initscripts or +sysconfig has support for bonding. + +3.1 Configuration with Sysconfig Support +---------------------------------------- + +This section applies to distros using a version of sysconfig +with bonding support, for example, SuSE Linux Enterprise Server 9. + +SuSE SLES 9's networking configuration system does support +bonding, however, at this writing, the YaST system configuration +front end does not provide any means to work with bonding devices. +Bonding devices can be managed by hand, however, as follows. + +First, if they have not already been configured, configure the +slave devices. On SLES 9, this is most easily done by running the +yast2 sysconfig configuration utility. The goal is for to create an +ifcfg-id file for each slave device. The simplest way to accomplish +this is to configure the devices for DHCP (this is only to get the +file ifcfg-id file created; see below for some issues with DHCP). The +name of the configuration file for each device will be of the form:: + + ifcfg-id-xx:xx:xx:xx:xx:xx + +Where the "xx" portion will be replaced with the digits from +the device's permanent MAC address. + +Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been +created, it is necessary to edit the configuration files for the slave +devices (the MAC addresses correspond to those of the slave devices). +Before editing, the file will contain multiple lines, and will look +something like this:: + + BOOTPROTO='dhcp' + STARTMODE='on' + USERCTL='no' + UNIQUE='XNzu.WeZGOGF+4wE' + _nm_name='bus-pci-0001:61:01.0' + +Change the BOOTPROTO and STARTMODE lines to the following:: + + BOOTPROTO='none' + STARTMODE='off' + +Do not alter the UNIQUE or _nm_name lines. Remove any other +lines (USERCTL, etc). + +Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified, +it's time to create the configuration file for the bonding device +itself. This file is named ifcfg-bondX, where X is the number of the +bonding device to create, starting at 0. The first such file is +ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig +network configuration system will correctly start multiple instances +of bonding. + +The contents of the ifcfg-bondX file is as follows:: + + BOOTPROTO="static" + BROADCAST="10.0.2.255" + IPADDR="10.0.2.10" + NETMASK="255.255.0.0" + NETWORK="10.0.2.0" + REMOTE_IPADDR="" + STARTMODE="onboot" + BONDING_MASTER="yes" + BONDING_MODULE_OPTS="mode=active-backup miimon=100" + BONDING_SLAVE0="eth0" + BONDING_SLAVE1="bus-pci-0000:06:08.1" + +Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK +values with the appropriate values for your network. + +The STARTMODE specifies when the device is brought online. +The possible values are: + + ======== ====================================================== + onboot The device is started at boot time. If you're not + sure, this is probably what you want. + + manual The device is started only when ifup is called + manually. Bonding devices may be configured this + way if you do not wish them to start automatically + at boot for some reason. + + hotplug The device is started by a hotplug event. This is not + a valid choice for a bonding device. + + off or The device configuration is ignored. + ignore + ======== ====================================================== + +The line BONDING_MASTER='yes' indicates that the device is a +bonding master device. The only useful value is "yes." + +The contents of BONDING_MODULE_OPTS are supplied to the +instance of the bonding module for this device. Specify the options +for the bonding mode, link monitoring, and so on here. Do not include +the max_bonds bonding parameter; this will confuse the configuration +system if you have multiple bonding devices. + +Finally, supply one BONDING_SLAVEn="slave device" for each +slave. where "n" is an increasing value, one for each slave. The +"slave device" is either an interface name, e.g., "eth0", or a device +specifier for the network device. The interface name is easier to +find, but the ethN names are subject to change at boot time if, e.g., +a device early in the sequence has failed. The device specifiers +(bus-pci-0000:06:08.1 in the example above) specify the physical +network device, and will not change unless the device's bus location +changes (for example, it is moved from one PCI slot to another). The +example above uses one of each type for demonstration purposes; most +configurations will choose one or the other for all slave devices. + +When all configuration files have been modified or created, +networking must be restarted for the configuration changes to take +effect. This can be accomplished via the following:: + + # /etc/init.d/network restart + +Note that the network control script (/sbin/ifdown) will +remove the bonding module as part of the network shutdown processing, +so it is not necessary to remove the module by hand if, e.g., the +module parameters have changed. + +Also, at this writing, YaST/YaST2 will not manage bonding +devices (they do not show bonding interfaces on its list of network +devices). It is necessary to edit the configuration file by hand to +change the bonding configuration. + +Additional general options and details of the ifcfg file +format can be found in an example ifcfg template file:: + + /etc/sysconfig/network/ifcfg.template + +Note that the template does not document the various ``BONDING_*`` +settings described above, but does describe many of the other options. + +3.1.1 Using DHCP with Sysconfig +------------------------------- + +Under sysconfig, configuring a device with BOOTPROTO='dhcp' +will cause it to query DHCP for its IP address information. At this +writing, this does not function for bonding devices; the scripts +attempt to obtain the device address from DHCP prior to adding any of +the slave devices. Without active slaves, the DHCP requests are not +sent to the network. + +3.1.2 Configuring Multiple Bonds with Sysconfig +----------------------------------------------- + +The sysconfig network initialization system is capable of +handling multiple bonding devices. All that is necessary is for each +bonding instance to have an appropriately configured ifcfg-bondX file +(as described above). Do not specify the "max_bonds" parameter to any +instance of bonding, as this will confuse sysconfig. If you require +multiple bonding devices with identical parameters, create multiple +ifcfg-bondX files. + +Because the sysconfig scripts supply the bonding module +options in the ifcfg-bondX file, it is not necessary to add them to +the system ``/etc/modules.d/*.conf`` configuration files. + +3.2 Configuration with Initscripts Support +------------------------------------------ + +This section applies to distros using a recent version of +initscripts with bonding support, for example, Red Hat Enterprise Linux +version 3 or later, Fedora, etc. On these systems, the network +initialization scripts have knowledge of bonding, and can be configured to +control bonding devices. Note that older versions of the initscripts +package have lower levels of support for bonding; this will be noted where +applicable. + +These distros will not automatically load the network adapter +driver unless the ethX device is configured with an IP address. +Because of this constraint, users must manually configure a +network-script file for all physical adapters that will be members of +a bondX link. Network script files are located in the directory: + +/etc/sysconfig/network-scripts + +The file name must be prefixed with "ifcfg-eth" and suffixed +with the adapter's physical adapter number. For example, the script +for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0. +Place the following text in the file:: + + DEVICE=eth0 + USERCTL=no + ONBOOT=yes + MASTER=bond0 + SLAVE=yes + BOOTPROTO=none + +The DEVICE= line will be different for every ethX device and +must correspond with the name of the file, i.e., ifcfg-eth1 must have +a device line of DEVICE=eth1. The setting of the MASTER= line will +also depend on the final bonding interface name chosen for your bond. +As with other network devices, these typically start at 0, and go up +one for each device, i.e., the first bonding instance is bond0, the +second is bond1, and so on. + +Next, create a bond network script. The file name for this +script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is +the number of the bond. For bond0 the file is named "ifcfg-bond0", +for bond1 it is named "ifcfg-bond1", and so on. Within that file, +place the following text:: + + DEVICE=bond0 + IPADDR=192.168.1.1 + NETMASK=255.255.255.0 + NETWORK=192.168.1.0 + BROADCAST=192.168.1.255 + ONBOOT=yes + BOOTPROTO=none + USERCTL=no + +Be sure to change the networking specific lines (IPADDR, +NETMASK, NETWORK and BROADCAST) to match your network configuration. + +For later versions of initscripts, such as that found with Fedora +7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible, +and, indeed, preferable, to specify the bonding options in the ifcfg-bond0 +file, e.g. a line of the format:: + + BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254" + +will configure the bond with the specified options. The options +specified in BONDING_OPTS are identical to the bonding module parameters +except for the arp_ip_target field when using versions of initscripts older +than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When +using older versions each target should be included as a separate option and +should be preceded by a '+' to indicate it should be added to the list of +queried targets, e.g.,:: + + arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 + +is the proper syntax to specify multiple targets. When specifying +options via BONDING_OPTS, it is not necessary to edit +``/etc/modprobe.d/*.conf``. + +For even older versions of initscripts that do not support +BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon +your distro) to load the bonding module with your desired options when the +bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf +will load the bonding module, and select its options: + + alias bond0 bonding + options bond0 mode=balance-alb miimon=100 + +Replace the sample parameters with the appropriate set of +options for your configuration. + +Finally run "/etc/rc.d/init.d/network restart" as root. This +will restart the networking subsystem and your bond link should be now +up and running. + +3.2.1 Using DHCP with Initscripts +--------------------------------- + +Recent versions of initscripts (the versions supplied with Fedora +Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to +work) have support for assigning IP information to bonding devices via +DHCP. + +To configure bonding for DHCP, configure it as described +above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" +and add a line consisting of "TYPE=Bonding". Note that the TYPE value +is case sensitive. + +3.2.2 Configuring Multiple Bonds with Initscripts +------------------------------------------------- + +Initscripts packages that are included with Fedora 7 and Red Hat +Enterprise Linux 5 support multiple bonding interfaces by simply +specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the +number of the bond. This support requires sysfs support in the kernel, +and a bonding driver of version 3.0.0 or later. Other configurations may +not support this method for specifying multiple bonding interfaces; for +those instances, see the "Configuring Multiple Bonds Manually" section, +below. + +3.3 Configuring Bonding Manually with iproute2 +----------------------------------------------- + +This section applies to distros whose network initialization +scripts (the sysconfig or initscripts package) do not have specific +knowledge of bonding. One such distro is SuSE Linux Enterprise Server +version 8. + +The general method for these systems is to place the bonding +module parameters into a config file in /etc/modprobe.d/ (as +appropriate for the installed distro), then add modprobe and/or +`ip link` commands to the system's global init script. The name of +the global init script differs; for sysconfig, it is +/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. + +For example, if you wanted to make a simple bond of two e100 +devices (presumed to be eth0 and eth1), and have it persist across +reboots, edit the appropriate file (/etc/init.d/boot.local or +/etc/rc.d/rc.local), and add the following:: + + modprobe bonding mode=balance-alb miimon=100 + modprobe e100 + ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up + ip link set eth0 master bond0 + ip link set eth1 master bond0 + +Replace the example bonding module parameters and bond0 +network configuration (IP address, netmask, etc) with the appropriate +values for your configuration. + +Unfortunately, this method will not provide support for the +ifup and ifdown scripts on the bond devices. To reload the bonding +configuration, it is necessary to run the initialization script, e.g.,:: + + # /etc/init.d/boot.local + +or:: + + # /etc/rc.d/rc.local + +It may be desirable in such a case to create a separate script +which only initializes the bonding configuration, then call that +separate script from within boot.local. This allows for bonding to be +enabled without re-running the entire global init script. + +To shut down the bonding devices, it is necessary to first +mark the bonding device itself as being down, then remove the +appropriate device driver modules. For our example above, you can do +the following:: + + # ifconfig bond0 down + # rmmod bonding + # rmmod e100 + +Again, for convenience, it may be desirable to create a script +with these commands. + + +3.3.1 Configuring Multiple Bonds Manually +----------------------------------------- + +This section contains information on configuring multiple +bonding devices with differing options for those systems whose network +initialization scripts lack support for configuring multiple bonds. + +If you require multiple bonding devices, but all with the same +options, you may wish to use the "max_bonds" module parameter, +documented above. + +To create multiple bonding devices with differing options, it is +preferable to use bonding parameters exported by sysfs, documented in the +section below. + +For versions of bonding without sysfs support, the only means to +provide multiple instances of bonding with differing options is to load +the bonding driver multiple times. Note that current versions of the +sysconfig network initialization scripts handle this automatically; if +your distro uses these scripts, no special action is needed. See the +section Configuring Bonding Devices, above, if you're not sure about your +network initialization scripts. + +To load multiple instances of the module, it is necessary to +specify a different name for each instance (the module loading system +requires that every loaded module, even multiple instances of the same +module, have a unique name). This is accomplished by supplying multiple +sets of bonding options in ``/etc/modprobe.d/*.conf``, for example:: + + alias bond0 bonding + options bond0 -o bond0 mode=balance-rr miimon=100 + + alias bond1 bonding + options bond1 -o bond1 mode=balance-alb miimon=50 + +will load the bonding module two times. The first instance is +named "bond0" and creates the bond0 device in balance-rr mode with an +miimon of 100. The second instance is named "bond1" and creates the +bond1 device in balance-alb mode with an miimon of 50. + +In some circumstances (typically with older distributions), +the above does not work, and the second bonding instance never sees +its options. In that case, the second options line can be substituted +as follows:: + + install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ + mode=balance-alb miimon=50 + +This may be repeated any number of times, specifying a new and +unique name in place of bond1 for each subsequent instance. + +It has been observed that some Red Hat supplied kernels are unable +to rename modules at load time (the "-o bond1" part). Attempts to pass +that option to modprobe will produce an "Operation not permitted" error. +This has been reported on some Fedora Core kernels, and has been seen on +RHEL 4 as well. On kernels exhibiting this problem, it will be impossible +to configure multiple bonds with differing parameters (as they are older +kernels, and also lack sysfs support). + +3.4 Configuring Bonding Manually via Sysfs +------------------------------------------ + +Starting with version 3.0.0, Channel Bonding may be configured +via the sysfs interface. This interface allows dynamic configuration +of all bonds in the system without unloading the module. It also +allows for adding and removing bonds at runtime. Ifenslave is no +longer required, though it is still supported. + +Use of the sysfs interface allows you to use multiple bonds +with different configurations without having to reload the module. +It also allows you to use multiple, differently configured bonds when +bonding is compiled into the kernel. + +You must have the sysfs filesystem mounted to configure +bonding this way. The examples in this document assume that you +are using the standard mount point for sysfs, e.g. /sys. If your +sysfs filesystem is mounted elsewhere, you will need to adjust the +example paths accordingly. + +Creating and Destroying Bonds +----------------------------- +To add a new bond foo:: + + # echo +foo > /sys/class/net/bonding_masters + +To remove an existing bond bar:: + + # echo -bar > /sys/class/net/bonding_masters + +To show all existing bonds:: + + # cat /sys/class/net/bonding_masters + +.. note:: + + due to 4K size limitation of sysfs files, this list may be + truncated if you have more than a few hundred bonds. This is unlikely + to occur under normal operating conditions. + +Adding and Removing Slaves +-------------------------- +Interfaces may be enslaved to a bond using the file +/sys/class/net//bonding/slaves. The semantics for this file +are the same as for the bonding_masters file. + +To enslave interface eth0 to bond bond0:: + + # ifconfig bond0 up + # echo +eth0 > /sys/class/net/bond0/bonding/slaves + +To free slave eth0 from bond bond0:: + + # echo -eth0 > /sys/class/net/bond0/bonding/slaves + +When an interface is enslaved to a bond, symlinks between the +two are created in the sysfs filesystem. In this case, you would get +/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and +/sys/class/net/eth0/master pointing to /sys/class/net/bond0. + +This means that you can tell quickly whether or not an +interface is enslaved by looking for the master symlink. Thus: +# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves +will free eth0 from whatever bond it is enslaved to, regardless of +the name of the bond interface. + +Changing a Bond's Configuration +------------------------------- +Each bond may be configured individually by manipulating the +files located in /sys/class/net//bonding + +The names of these files correspond directly with the command- +line parameters described elsewhere in this file, and, with the +exception of arp_ip_target, they accept the same values. To see the +current setting, simply cat the appropriate file. + +A few examples will be given here; for specific usage +guidelines for each parameter, see the appropriate section in this +document. + +To configure bond0 for balance-alb mode:: + + # ifconfig bond0 down + # echo 6 > /sys/class/net/bond0/bonding/mode + - or - + # echo balance-alb > /sys/class/net/bond0/bonding/mode + +.. note:: + + The bond interface must be down before the mode can be changed. + +To enable MII monitoring on bond0 with a 1 second interval:: + + # echo 1000 > /sys/class/net/bond0/bonding/miimon + +.. note:: + + If ARP monitoring is enabled, it will disabled when MII + monitoring is enabled, and vice-versa. + +To add ARP targets:: + + # echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target + # echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target + +.. note:: + + up to 16 target addresses may be specified. + +To remove an ARP target:: + + # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target + +To configure the interval between learning packet transmits:: + + # echo 12 > /sys/class/net/bond0/bonding/lp_interval + +.. note:: + + the lp_interval is the number of seconds between instances where + the bonding driver sends learning packets to each slaves peer switch. The + default interval is 1 second. + +Example Configuration +--------------------- +We begin with the same example that is shown in section 3.3, +executed with sysfs, and without using ifenslave. + +To make a simple bond of two e100 devices (presumed to be eth0 +and eth1), and have it persist across reboots, edit the appropriate +file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the +following:: + + modprobe bonding + modprobe e100 + echo balance-alb > /sys/class/net/bond0/bonding/mode + ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up + echo 100 > /sys/class/net/bond0/bonding/miimon + echo +eth0 > /sys/class/net/bond0/bonding/slaves + echo +eth1 > /sys/class/net/bond0/bonding/slaves + +To add a second bond, with two e1000 interfaces in +active-backup mode, using ARP monitoring, add the following lines to +your init script:: + + modprobe e1000 + echo +bond1 > /sys/class/net/bonding_masters + echo active-backup > /sys/class/net/bond1/bonding/mode + ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up + echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target + echo 2000 > /sys/class/net/bond1/bonding/arp_interval + echo +eth2 > /sys/class/net/bond1/bonding/slaves + echo +eth3 > /sys/class/net/bond1/bonding/slaves + +3.5 Configuration with Interfaces Support +----------------------------------------- + +This section applies to distros which use /etc/network/interfaces file +to describe network interface configuration, most notably Debian and it's +derivatives. + +The ifup and ifdown commands on Debian don't support bonding out of +the box. The ifenslave-2.6 package should be installed to provide bonding +support. Once installed, this package will provide ``bond-*`` options +to be used into /etc/network/interfaces. + +Note that ifenslave-2.6 package will load the bonding module and use +the ifenslave command when appropriate. + +Example Configurations +---------------------- + +In /etc/network/interfaces, the following stanza will configure bond0, in +active-backup mode, with eth0 and eth1 as slaves:: + + auto bond0 + iface bond0 inet dhcp + bond-slaves eth0 eth1 + bond-mode active-backup + bond-miimon 100 + bond-primary eth0 eth1 + +If the above configuration doesn't work, you might have a system using +upstart for system startup. This is most notably true for recent +Ubuntu versions. The following stanza in /etc/network/interfaces will +produce the same result on those systems:: + + auto bond0 + iface bond0 inet dhcp + bond-slaves none + bond-mode active-backup + bond-miimon 100 + + auto eth0 + iface eth0 inet manual + bond-master bond0 + bond-primary eth0 eth1 + + auto eth1 + iface eth1 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +For a full list of ``bond-*`` supported options in /etc/network/interfaces and +some more advanced examples tailored to you particular distros, see the files in +/usr/share/doc/ifenslave-2.6. + +3.6 Overriding Configuration for Special Cases +---------------------------------------------- + +When using the bonding driver, the physical port which transmits a frame is +typically selected by the bonding driver, and is not relevant to the user or +system administrator. The output port is simply selected using the policies of +the selected bonding mode. On occasion however, it is helpful to direct certain +classes of traffic to certain physical interfaces on output to implement +slightly more complex policies. For example, to reach a web server over a +bonded interface in which eth0 connects to a private network, while eth1 +connects via a public network, it may be desirous to bias the bond to send said +traffic over eth0 first, using eth1 only as a fall back, while all other traffic +can safely be sent over either interface. Such configurations may be achieved +using the traffic control utilities inherent in linux. + +By default the bonding driver is multiqueue aware and 16 queues are created +when the driver initializes (see Documentation/networking/multiqueue.txt +for details). If more or less queues are desired the module parameter +tx_queues can be used to change this value. There is no sysfs parameter +available as the allocation is done at module init time. + +The output of the file /proc/net/bonding/bondX has changed so the output Queue +ID is now printed for each slave:: + + Bonding Mode: fault-tolerance (active-backup) + Primary Slave: None + Currently Active Slave: eth0 + MII Status: up + MII Polling Interval (ms): 0 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth0 + MII Status: up + Link Failure Count: 0 + Permanent HW addr: 00:1a:a0:12:8f:cb + Slave queue ID: 0 + + Slave Interface: eth1 + MII Status: up + Link Failure Count: 0 + Permanent HW addr: 00:1a:a0:12:8f:cc + Slave queue ID: 2 + +The queue_id for a slave can be set using the command:: + + # echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id + +Any interface that needs a queue_id set should set it with multiple calls +like the one above until proper priorities are set for all interfaces. On +distributions that allow configuration via initscripts, multiple 'queue_id' +arguments can be added to BONDING_OPTS to set all needed slave queues. + +These queue id's can be used in conjunction with the tc utility to configure +a multiqueue qdisc and filters to bias certain traffic to transmit on certain +slave devices. For instance, say we wanted, in the above configuration to +force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output +device. The following commands would accomplish this:: + + # tc qdisc add dev bond0 handle 1 root multiq + + # tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip \ + dst 192.168.1.100 action skbedit queue_mapping 2 + +These commands tell the kernel to attach a multiqueue queue discipline to the +bond0 interface and filter traffic enqueued to it, such that packets with a dst +ip of 192.168.1.100 have their output queue mapping value overwritten to 2. +This value is then passed into the driver, causing the normal output path +selection policy to be overridden, selecting instead qid 2, which maps to eth1. + +Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver +that normal output policy selection should take place. One benefit to simply +leaving the qid for a slave to 0 is the multiqueue awareness in the bonding +driver that is now present. This awareness allows tc filters to be placed on +slave devices as well as bond devices and the bonding driver will simply act as +a pass-through for selecting output queues on the slave device rather than +output port selection. + +This feature first appeared in bonding driver version 3.7.0 and support for +output slave selection was limited to round-robin and active-backup modes. + +3.7 Configuring LACP for 802.3ad mode in a more secure way +---------------------------------------------------------- + +When using 802.3ad bonding mode, the Actor (host) and Partner (switch) +exchange LACPDUs. These LACPDUs cannot be sniffed, because they are +destined to link local mac addresses (which switches/bridges are not +supposed to forward). However, most of the values are easily predictable +or are simply the machine's MAC address (which is trivially known to all +other hosts in the same L2). This implies that other machines in the L2 +domain can spoof LACPDU packets from other hosts to the switch and potentially +cause mayhem by joining (from the point of view of the switch) another +machine's aggregate, thus receiving a portion of that hosts incoming +traffic and / or spoofing traffic from that machine themselves (potentially +even successfully terminating some portion of flows). Though this is not +a likely scenario, one could avoid this possibility by simply configuring +few bonding parameters: + + (a) ad_actor_system : You can set a random mac-address that can be used for + these LACPDU exchanges. The value can not be either NULL or Multicast. + Also it's preferable to set the local-admin bit. Following shell code + generates a random mac-address as described above:: + + # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ + $(( (RANDOM & 0xFE) | 0x02 )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF ))) + # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system + + (b) ad_actor_sys_prio : Randomize the system priority. The default value + is 65535, but system can take the value from 1 - 65535. Following shell + code generates random priority and sets it:: + + # sys_prio=$(( 1 + RANDOM + RANDOM )) + # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio + + (c) ad_user_port_key : Use the user portion of the port-key. The default + keeps this empty. These are the upper 10 bits of the port-key and value + ranges from 0 - 1023. Following shell code generates these 10 bits and + sets it:: + + # usr_port_key=$(( RANDOM & 0x3FF )) + # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key + + +4 Querying Bonding Configuration +================================= + +4.1 Bonding Configuration +------------------------- + +Each bonding device has a read-only file residing in the +/proc/net/bonding directory. The file contents include information +about the bonding configuration, options and state of each slave. + +For example, the contents of /proc/net/bonding/bond0 after the +driver is loaded with parameters of mode=0 and miimon=1000 is +generally as follows:: + + Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004) + Bonding Mode: load balancing (round-robin) + Currently Active Slave: eth0 + MII Status: up + MII Polling Interval (ms): 1000 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth1 + MII Status: up + Link Failure Count: 1 + + Slave Interface: eth0 + MII Status: up + Link Failure Count: 1 + +The precise format and contents will change depending upon the +bonding configuration, state, and version of the bonding driver. + +4.2 Network configuration +------------------------- + +The network configuration can be inspected using the ifconfig +command. Bonding devices will have the MASTER flag set; Bonding slave +devices will have the SLAVE flag set. The ifconfig output does not +contain information on which slaves are associated with which masters. + +In the example below, the bond0 interface is the master +(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of +bond0 have the same MAC address (HWaddr) as bond0 for all modes except +TLB and ALB that require a unique MAC address for each slave:: + + # /sbin/ifconfig + bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 + UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 + RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 + TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 + collisions:0 txqueuelen:0 + + eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 + RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 + TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 + collisions:0 txqueuelen:100 + Interrupt:10 Base address:0x1080 + + eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 + RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 + TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 + collisions:0 txqueuelen:100 + Interrupt:9 Base address:0x1400 + +5. Switch Configuration +======================= + +For this section, "switch" refers to whatever system the +bonded devices are directly connected to (i.e., where the other end of +the cable plugs into). This may be an actual dedicated switch device, +or it may be another regular system (e.g., another computer running +Linux), + +The active-backup, balance-tlb and balance-alb modes do not +require any specific configuration of the switch. + +The 802.3ad mode requires that the switch have the appropriate +ports configured as an 802.3ad aggregation. The precise method used +to configure this varies from switch to switch, but, for example, a +Cisco 3550 series switch requires that the appropriate ports first be +grouped together in a single etherchannel instance, then that +etherchannel is set to mode "lacp" to enable 802.3ad (instead of +standard EtherChannel). + +The balance-rr, balance-xor and broadcast modes generally +require that the switch have the appropriate ports grouped together. +The nomenclature for such a group differs between switches, it may be +called an "etherchannel" (as in the Cisco example, above), a "trunk +group" or some other similar variation. For these modes, each switch +will also have its own configuration options for the switch's transmit +policy to the bond. Typical choices include XOR of either the MAC or +IP addresses. The transmit policy of the two peers does not need to +match. For these three modes, the bonding mode really selects a +transmit policy for an EtherChannel group; all three will interoperate +with another EtherChannel group. + + +6. 802.1q VLAN Support +====================== + +It is possible to configure VLAN devices over a bond interface +using the 8021q driver. However, only packets coming from the 8021q +driver and passing through bonding will be tagged by default. Self +generated packets, for example, bonding's learning packets or ARP +packets generated by either ALB mode or the ARP monitor mechanism, are +tagged internally by bonding itself. As a result, bonding must +"learn" the VLAN IDs configured above it, and use those IDs to tag +self generated packets. + +For reasons of simplicity, and to support the use of adapters +that can do VLAN hardware acceleration offloading, the bonding +interface declares itself as fully hardware offloading capable, it gets +the add_vid/kill_vid notifications to gather the necessary +information, and it propagates those actions to the slaves. In case +of mixed adapter types, hardware accelerated tagged packets that +should go through an adapter that is not offloading capable are +"un-accelerated" by the bonding driver so the VLAN tag sits in the +regular location. + +VLAN interfaces *must* be added on top of a bonding interface +only after enslaving at least one slave. The bonding interface has a +hardware address of 00:00:00:00:00:00 until the first slave is added. +If the VLAN interface is created prior to the first enslavement, it +would pick up the all-zeroes hardware address. Once the first slave +is attached to the bond, the bond device itself will pick up the +slave's hardware address, which is then available for the VLAN device. + +Also, be aware that a similar problem can occur if all slaves +are released from a bond that still has one or more VLAN interfaces on +top of it. When a new slave is added, the bonding interface will +obtain its hardware address from the first slave, which might not +match the hardware address of the VLAN interfaces (which was +ultimately copied from an earlier slave). + +There are two methods to insure that the VLAN device operates +with the correct hardware address if all slaves are removed from a +bond interface: + +1. Remove all VLAN interfaces then recreate them + +2. Set the bonding interface's hardware address so that it +matches the hardware address of the VLAN interfaces. + +Note that changing a VLAN interface's HW address would set the +underlying device -- i.e. the bonding interface -- to promiscuous +mode, which might not be what you want. + + +7. Link Monitoring +================== + +The bonding driver at present supports two schemes for +monitoring a slave device's link state: the ARP monitor and the MII +monitor. + +At the present time, due to implementation restrictions in the +bonding driver itself, it is not possible to enable both ARP and MII +monitoring simultaneously. + +7.1 ARP Monitor Operation +------------------------- + +The ARP monitor operates as its name suggests: it sends ARP +queries to one or more designated peer systems on the network, and +uses the response as an indication that the link is operating. This +gives some assurance that traffic is actually flowing to and from one +or more peers on the local network. + +The ARP monitor relies on the device driver itself to verify +that traffic is flowing. In particular, the driver must keep up to +date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX +flag must also update netdev_queue->trans_start. If they do not, then the +ARP monitor will immediately fail any slaves using that driver, and +those slaves will stay down. If networking monitoring (tcpdump, etc) +shows the ARP requests and replies on the network, then it may be that +your device driver is not updating last_rx and trans_start. + +7.2 Configuring Multiple ARP Targets +------------------------------------ + +While ARP monitoring can be done with just one target, it can +be useful in a High Availability setup to have several targets to +monitor. In the case of just one target, the target itself may go +down or have a problem making it unresponsive to ARP requests. Having +an additional target (or several) increases the reliability of the ARP +monitoring. + +Multiple ARP targets must be separated by commas as follows:: + + # example options for ARP monitoring with three targets + alias bond0 bonding + options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 + +For just a single target the options would resemble:: + + # example options for ARP monitoring with one target + alias bond0 bonding + options bond0 arp_interval=60 arp_ip_target=192.168.0.100 + + +7.3 MII Monitor Operation +------------------------- + +The MII monitor monitors only the carrier state of the local +network interface. It accomplishes this in one of three ways: by +depending upon the device driver to maintain its carrier state, by +querying the device's MII registers, or by making an ethtool query to +the device. + +If the use_carrier module parameter is 1 (the default value), +then the MII monitor will rely on the driver for carrier state +information (via the netif_carrier subsystem). As explained in the +use_carrier parameter information, above, if the MII monitor fails to +detect carrier loss on the device (e.g., when the cable is physically +disconnected), it may be that the driver does not support +netif_carrier. + +If use_carrier is 0, then the MII monitor will first query the +device's (via ioctl) MII registers and check the link state. If that +request fails (not just that it returns carrier down), then the MII +monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain +the same information. If both methods fail (i.e., the driver either +does not support or had some error in processing both the MII register +and ethtool requests), then the MII monitor will assume the link is +up. + +8. Potential Sources of Trouble +=============================== + +8.1 Adventures in Routing +------------------------- + +When bonding is configured, it is important that the slave +devices not have routes that supersede routes of the master (or, +generally, not have routes at all). For example, suppose the bonding +device bond0 has two slaves, eth0 and eth1, and the routing table is +as follows:: + + Kernel IP routing table + Destination Gateway Genmask Flags MSS Window irtt Iface + 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 + 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 + 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 + 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo + +This routing configuration will likely still update the +receive/transmit times in the driver (needed by the ARP monitor), but +may bypass the bonding driver (because outgoing traffic to, in this +case, another host on network 10 would use eth0 or eth1 before bond0). + +The ARP monitor (and ARP itself) may become confused by this +configuration, because ARP requests (generated by the ARP monitor) +will be sent on one interface (bond0), but the corresponding reply +will arrive on a different interface (eth0). This reply looks to ARP +as an unsolicited ARP reply (because ARP matches replies on an +interface basis), and is discarded. The MII monitor is not affected +by the state of the routing table. + +The solution here is simply to insure that slaves do not have +routes of their own, and if for some reason they must, those routes do +not supersede routes of their master. This should generally be the +case, but unusual configurations or errant manual or automatic static +route additions may cause trouble. + +8.2 Ethernet Device Renaming +---------------------------- + +On systems with network configuration scripts that do not +associate physical devices directly with network interface names (so +that the same physical device always has the same "ethX" name), it may +be necessary to add some special logic to config files in +/etc/modprobe.d/. + +For example, given a modules.conf containing the following:: + + alias bond0 bonding + options bond0 mode=some-mode miimon=50 + alias eth0 tg3 + alias eth1 tg3 + alias eth2 e1000 + alias eth3 e1000 + +If neither eth0 and eth1 are slaves to bond0, then when the +bond0 interface comes up, the devices may end up reordered. This +happens because bonding is loaded first, then its slave device's +drivers are loaded next. Since no other drivers have been loaded, +when the e1000 driver loads, it will receive eth0 and eth1 for its +devices, but the bonding configuration tries to enslave eth2 and eth3 +(which may later be assigned to the tg3 devices). + +Adding the following:: + + add above bonding e1000 tg3 + +causes modprobe to load e1000 then tg3, in that order, when +bonding is loaded. This command is fully documented in the +modules.conf manual page. + +On systems utilizing modprobe an equivalent problem can occur. +In this case, the following can be added to config files in +/etc/modprobe.d/ as:: + + softdep bonding pre: tg3 e1000 + +This will load tg3 and e1000 modules before loading the bonding one. +Full documentation on this can be found in the modprobe.d and modprobe +manual pages. + +8.3. Painfully Slow Or No Failed Link Detection By Miimon +--------------------------------------------------------- + +By default, bonding enables the use_carrier option, which +instructs bonding to trust the driver to maintain carrier state. + +As discussed in the options section, above, some drivers do +not support the netif_carrier_on/_off link state tracking system. +With use_carrier enabled, bonding will always see these links as up, +regardless of their actual state. + +Additionally, other drivers do support netif_carrier, but do +not maintain it in real time, e.g., only polling the link state at +some fixed interval. In this case, miimon will detect failures, but +only after some long period of time has expired. If it appears that +miimon is very slow in detecting link failures, try specifying +use_carrier=0 to see if that improves the failure detection time. If +it does, then it may be that the driver checks the carrier state at a +fixed interval, but does not cache the MII register values (so the +use_carrier=0 method of querying the registers directly works). If +use_carrier=0 does not improve the failover, then the driver may cache +the registers, or the problem may be elsewhere. + +Also, remember that miimon only checks for the device's +carrier state. It has no way to determine the state of devices on or +beyond other ports of a switch, or if a switch is refusing to pass +traffic while still maintaining carrier on. + +9. SNMP agents +=============== + +If running SNMP agents, the bonding driver should be loaded +before any network drivers participating in a bond. This requirement +is due to the interface index (ipAdEntIfIndex) being associated to +the first interface found with a given IP address. That is, there is +only one ipAdEntIfIndex for each IP address. For example, if eth0 and +eth1 are slaves of bond0 and the driver for eth0 is loaded before the +bonding driver, the interface for the IP address will be associated +with the eth0 interface. This configuration is shown below, the IP +address 192.168.1.1 has an interface index of 2 which indexes to eth0 +in the ifDescr table (ifDescr.2). + +:: + + interfaces.ifTable.ifEntry.ifDescr.1 = lo + interfaces.ifTable.ifEntry.ifDescr.2 = eth0 + interfaces.ifTable.ifEntry.ifDescr.3 = eth1 + interfaces.ifTable.ifEntry.ifDescr.4 = eth2 + interfaces.ifTable.ifEntry.ifDescr.5 = eth3 + interfaces.ifTable.ifEntry.ifDescr.6 = bond0 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 + +This problem is avoided by loading the bonding driver before +any network drivers participating in a bond. Below is an example of +loading the bonding driver first, the IP address 192.168.1.1 is +correctly associated with ifDescr.2. + + interfaces.ifTable.ifEntry.ifDescr.1 = lo + interfaces.ifTable.ifEntry.ifDescr.2 = bond0 + interfaces.ifTable.ifEntry.ifDescr.3 = eth0 + interfaces.ifTable.ifEntry.ifDescr.4 = eth1 + interfaces.ifTable.ifEntry.ifDescr.5 = eth2 + interfaces.ifTable.ifEntry.ifDescr.6 = eth3 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 + +While some distributions may not report the interface name in +ifDescr, the association between the IP address and IfIndex remains +and SNMP functions such as Interface_Scan_Next will report that +association. + +10. Promiscuous mode +==================== + +When running network monitoring tools, e.g., tcpdump, it is +common to enable promiscuous mode on the device, so that all traffic +is seen (instead of seeing only traffic destined for the local host). +The bonding driver handles promiscuous mode changes to the bonding +master device (e.g., bond0), and propagates the setting to the slave +devices. + +For the balance-rr, balance-xor, broadcast, and 802.3ad modes, +the promiscuous mode setting is propagated to all slaves. + +For the active-backup, balance-tlb and balance-alb modes, the +promiscuous mode setting is propagated only to the active slave. + +For balance-tlb mode, the active slave is the slave currently +receiving inbound traffic. + +For balance-alb mode, the active slave is the slave used as a +"primary." This slave is used for mode-specific control traffic, for +sending to peers that are unassigned or if the load is unbalanced. + +For the active-backup, balance-tlb and balance-alb modes, when +the active slave changes (e.g., due to a link failure), the +promiscuous setting will be propagated to the new active slave. + +11. Configuring Bonding for High Availability +============================================= + +High Availability refers to configurations that provide +maximum network availability by having redundant or backup devices, +links or switches between the host and the rest of the world. The +goal is to provide the maximum availability of network connectivity +(i.e., the network always works), even though other configurations +could provide higher throughput. + +11.1 High Availability in a Single Switch Topology +-------------------------------------------------- + +If two hosts (or a host and a single switch) are directly +connected via multiple physical links, then there is no availability +penalty to optimizing for maximum bandwidth. In this case, there is +only one switch (or peer), so if it fails, there is no alternative +access to fail over to. Additionally, the bonding load balance modes +support link monitoring of their members, so if individual links fail, +the load will be rebalanced across the remaining devices. + +See Section 12, "Configuring Bonding for Maximum Throughput" +for information on configuring bonding with one peer device. + +11.2 High Availability in a Multiple Switch Topology +---------------------------------------------------- + +With multiple switches, the configuration of bonding and the +network changes dramatically. In multiple switch topologies, there is +a trade off between network availability and usable bandwidth. + +Below is a sample network, configured to maximize the +availability of the network:: + + | | + |port3 port3| + +-----+----+ +-----+----+ + | |port2 ISL port2| | + | switch A +--------------------------+ switch B | + | | | | + +-----+----+ +-----++---+ + |port1 port1| + | +-------+ | + +-------------+ host1 +---------------+ + eth0 +-------+ eth1 + +In this configuration, there is a link between the two +switches (ISL, or inter switch link), and multiple ports connecting to +the outside world ("port3" on each switch). There is no technical +reason that this could not be extended to a third switch. + +11.2.1 HA Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- + +In a topology such as the example above, the active-backup and +broadcast modes are the only useful bonding modes when optimizing for +availability; the other modes require all links to terminate on the +same peer for them to behave rationally. + +active-backup: + This is generally the preferred mode, particularly if + the switches have an ISL and play together well. If the + network configuration is such that one switch is specifically + a backup switch (e.g., has lower capacity, higher cost, etc), + then the primary option can be used to insure that the + preferred link is always used when it is available. + +broadcast: + This mode is really a special purpose mode, and is suitable + only for very specific needs. For example, if the two + switches are not connected (no ISL), and the networks beyond + them are totally independent. In this case, if it is + necessary for some specific one-way traffic to reach both + independent networks, then the broadcast mode may be suitable. + +11.2.2 HA Link Monitoring Selection for Multiple Switch Topology +---------------------------------------------------------------- + +The choice of link monitoring ultimately depends upon your +switch. If the switch can reliably fail ports in response to other +failures, then either the MII or ARP monitors should work. For +example, in the above example, if the "port3" link fails at the remote +end, the MII monitor has no direct means to detect this. The ARP +monitor could be configured with a target at the remote end of port3, +thus detecting that failure without switch support. + +In general, however, in a multiple switch topology, the ARP +monitor can provide a higher level of reliability in detecting end to +end connectivity failures (which may be caused by the failure of any +individual component to pass traffic for any reason). Additionally, +the ARP monitor should be configured with multiple targets (at least +one for each switch in the network). This will insure that, +regardless of which switch is active, the ARP monitor has a suitable +target to query. + +Note, also, that of late many switches now support a functionality +generally referred to as "trunk failover." This is a feature of the +switch that causes the link state of a particular switch port to be set +down (or up) when the state of another switch port goes down (or up). +Its purpose is to propagate link failures from logically "exterior" ports +to the logically "interior" ports that bonding is able to monitor via +miimon. Availability and configuration for trunk failover varies by +switch, but this can be a viable alternative to the ARP monitor when using +suitable switches. + +12. Configuring Bonding for Maximum Throughput +============================================== + +12.1 Maximizing Throughput in a Single Switch Topology +------------------------------------------------------ + +In a single switch configuration, the best method to maximize +throughput depends upon the application and network environment. The +various load balancing modes each have strengths and weaknesses in +different environments, as detailed below. + +For this discussion, we will break down the topologies into +two categories. Depending upon the destination of most traffic, we +categorize them into either "gatewayed" or "local" configurations. + +In a gatewayed configuration, the "switch" is acting primarily +as a router, and the majority of traffic passes through this router to +other networks. An example would be the following:: + + + +----------+ +----------+ + | |eth0 port1| | to other networks + | Host A +---------------------+ router +-------------------> + | +---------------------+ | Hosts B and C are out + | |eth1 port2| | here somewhere + +----------+ +----------+ + +The router may be a dedicated router device, or another host +acting as a gateway. For our discussion, the important point is that +the majority of traffic from Host A will pass through the router to +some other network before reaching its final destination. + +In a gatewayed network configuration, although Host A may +communicate with many other systems, all of its traffic will be sent +and received via one other peer on the local network, the router. + +Note that the case of two systems connected directly via +multiple physical links is, for purposes of configuring bonding, the +same as a gatewayed configuration. In that case, it happens that all +traffic is destined for the "gateway" itself, not some other network +beyond the gateway. + +In a local configuration, the "switch" is acting primarily as +a switch, and the majority of traffic passes through this switch to +reach other stations on the same network. An example would be the +following:: + + +----------+ +----------+ +--------+ + | |eth0 port1| +-------+ Host B | + | Host A +------------+ switch |port3 +--------+ + | +------------+ | +--------+ + | |eth1 port2| +------------------+ Host C | + +----------+ +----------+port4 +--------+ + + +Again, the switch may be a dedicated switch device, or another +host acting as a gateway. For our discussion, the important point is +that the majority of traffic from Host A is destined for other hosts +on the same local network (Hosts B and C in the above example). + +In summary, in a gatewayed configuration, traffic to and from +the bonded device will be to the same MAC level peer on the network +(the gateway itself, i.e., the router), regardless of its final +destination. In a local configuration, traffic flows directly to and +from the final destinations, thus, each destination (Host B, Host C) +will be addressed directly by their individual MAC addresses. + +This distinction between a gatewayed and a local network +configuration is important because many of the load balancing modes +available use the MAC addresses of the local network source and +destination to make load balancing decisions. The behavior of each +mode is described below. + + +12.1.1 MT Bonding Mode Selection for Single Switch Topology +----------------------------------------------------------- + +This configuration is the easiest to set up and to understand, +although you will have to decide which bonding mode best suits your +needs. The trade offs for each mode are detailed below: + +balance-rr: + This mode is the only mode that will permit a single + TCP/IP connection to stripe traffic across multiple + interfaces. It is therefore the only mode that will allow a + single TCP/IP stream to utilize more than one interface's + worth of throughput. This comes at a cost, however: the + striping generally results in peer systems receiving packets out + of order, causing TCP/IP's congestion control system to kick + in, often by retransmitting segments. + + It is possible to adjust TCP/IP's congestion limits by + altering the net.ipv4.tcp_reordering sysctl parameter. The + usual default value is 3. But keep in mind TCP stack is able + to automatically increase this when it detects reorders. + + Note that the fraction of packets that will be delivered out of + order is highly variable, and is unlikely to be zero. The level + of reordering depends upon a variety of factors, including the + networking interfaces, the switch, and the topology of the + configuration. Speaking in general terms, higher speed network + cards produce more reordering (due to factors such as packet + coalescing), and a "many to many" topology will reorder at a + higher rate than a "many slow to one fast" configuration. + + Many switches do not support any modes that stripe traffic + (instead choosing a port based upon IP or MAC level addresses); + for those devices, traffic for a particular connection flowing + through the switch to a balance-rr bond will not utilize greater + than one interface's worth of bandwidth. + + If you are utilizing protocols other than TCP/IP, UDP for + example, and your application can tolerate out of order + delivery, then this mode can allow for single stream datagram + performance that scales near linearly as interfaces are added + to the bond. + + This mode requires the switch to have the appropriate ports + configured for "etherchannel" or "trunking." + +active-backup: + There is not much advantage in this network topology to + the active-backup mode, as the inactive backup devices are all + connected to the same peer as the primary. In this case, a + load balancing mode (with link monitoring) will provide the + same level of network availability, but with increased + available bandwidth. On the plus side, active-backup mode + does not require any configuration of the switch, so it may + have value if the hardware available does not support any of + the load balance modes. + +balance-xor: + This mode will limit traffic such that packets destined + for specific peers will always be sent over the same + interface. Since the destination is determined by the MAC + addresses involved, this mode works best in a "local" network + configuration (as described above), with destinations all on + the same local network. This mode is likely to be suboptimal + if all your traffic is passed through a single router (i.e., a + "gatewayed" network configuration, as described above). + + As with balance-rr, the switch ports need to be configured for + "etherchannel" or "trunking." + +broadcast: + Like active-backup, there is not much advantage to this + mode in this type of network topology. + +802.3ad: + This mode can be a good choice for this type of network + topology. The 802.3ad mode is an IEEE standard, so all peers + that implement 802.3ad should interoperate well. The 802.3ad + protocol includes automatic configuration of the aggregates, + so minimal manual configuration of the switch is needed + (typically only to designate that some set of devices is + available for 802.3ad). The 802.3ad standard also mandates + that frames be delivered in order (within certain limits), so + in general single connections will not see misordering of + packets. The 802.3ad mode does have some drawbacks: the + standard mandates that all devices in the aggregate operate at + the same speed and duplex. Also, as with all bonding load + balance modes other than balance-rr, no single connection will + be able to utilize more than a single interface's worth of + bandwidth. + + Additionally, the linux bonding 802.3ad implementation + distributes traffic by peer (using an XOR of MAC addresses + and packet type ID), so in a "gatewayed" configuration, all + outgoing traffic will generally use the same device. Incoming + traffic may also end up on a single device, but that is + dependent upon the balancing policy of the peer's 802.3ad + implementation. In a "local" configuration, traffic will be + distributed across the devices in the bond. + + Finally, the 802.3ad mode mandates the use of the MII monitor, + therefore, the ARP monitor is not available in this mode. + +balance-tlb: + The balance-tlb mode balances outgoing traffic by peer. + Since the balancing is done according to MAC address, in a + "gatewayed" configuration (as described above), this mode will + send all traffic across a single device. However, in a + "local" network configuration, this mode balances multiple + local network peers across devices in a vaguely intelligent + manner (not a simple XOR as in balance-xor or 802.3ad mode), + so that mathematically unlucky MAC addresses (i.e., ones that + XOR to the same value) will not all "bunch up" on a single + interface. + + Unlike 802.3ad, interfaces may be of differing speeds, and no + special switch configuration is required. On the down side, + in this mode all incoming traffic arrives over a single + interface, this mode requires certain ethtool support in the + network device driver of the slave interfaces, and the ARP + monitor is not available. + +balance-alb: + This mode is everything that balance-tlb is, and more. + It has all of the features (and restrictions) of balance-tlb, + and will also balance incoming traffic from local network + peers (as described in the Bonding Module Options section, + above). + + The only additional down side to this mode is that the network + device driver must support changing the hardware address while + the device is open. + +12.1.2 MT Link Monitoring for Single Switch Topology +---------------------------------------------------- + +The choice of link monitoring may largely depend upon which +mode you choose to use. The more advanced load balancing modes do not +support the use of the ARP monitor, and are thus restricted to using +the MII monitor (which does not provide as high a level of end to end +assurance as the ARP monitor). + +12.2 Maximum Throughput in a Multiple Switch Topology +----------------------------------------------------- + +Multiple switches may be utilized to optimize for throughput +when they are configured in parallel as part of an isolated network +between two or more systems, for example:: + + +-----------+ + | Host A | + +-+---+---+-+ + | | | + +--------+ | +---------+ + | | | + +------+---+ +-----+----+ +-----+----+ + | Switch A | | Switch B | | Switch C | + +------+---+ +-----+----+ +-----+----+ + | | | + +--------+ | +---------+ + | | | + +-+---+---+-+ + | Host B | + +-----------+ + +In this configuration, the switches are isolated from one +another. One reason to employ a topology such as this is for an +isolated network with many hosts (a cluster configured for high +performance, for example), using multiple smaller switches can be more +cost effective than a single larger switch, e.g., on a network with 24 +hosts, three 24 port switches can be significantly less expensive than +a single 72 port switch. + +If access beyond the network is required, an individual host +can be equipped with an additional network device connected to an +external network; this host then additionally acts as a gateway. + +12.2.1 MT Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- + +In actual practice, the bonding mode typically employed in +configurations of this type is balance-rr. Historically, in this +network configuration, the usual caveats about out of order packet +delivery are mitigated by the use of network adapters that do not do +any kind of packet coalescing (via the use of NAPI, or because the +device itself does not generate interrupts until some number of +packets has arrived). When employed in this fashion, the balance-rr +mode allows individual connections between two hosts to effectively +utilize greater than one interface's bandwidth. + +12.2.2 MT Link Monitoring for Multiple Switch Topology +------------------------------------------------------ + +Again, in actual practice, the MII monitor is most often used +in this configuration, as performance is given preference over +availability. The ARP monitor will function in this topology, but its +advantages over the MII monitor are mitigated by the volume of probes +needed as the number of systems involved grows (remember that each +host in the network is configured with bonding). + +13. Switch Behavior Issues +========================== + +13.1 Link Establishment and Failover Delays +------------------------------------------- + +Some switches exhibit undesirable behavior with regard to the +timing of link up and down reporting by the switch. + +First, when a link comes up, some switches may indicate that +the link is up (carrier available), but not pass traffic over the +interface for some period of time. This delay is typically due to +some type of autonegotiation or routing protocol, but may also occur +during switch initialization (e.g., during recovery after a switch +failure). If you find this to be a problem, specify an appropriate +value to the updelay bonding module option to delay the use of the +relevant interface(s). + +Second, some switches may "bounce" the link state one or more +times while a link is changing state. This occurs most commonly while +the switch is initializing. Again, an appropriate updelay value may +help. + +Note that when a bonding interface has no active links, the +driver will immediately reuse the first link that goes up, even if the +updelay parameter has been specified (the updelay is ignored in this +case). If there are slave interfaces waiting for the updelay timeout +to expire, the interface that first went into that state will be +immediately reused. This reduces down time of the network if the +value of updelay has been overestimated, and since this occurs only in +cases with no connectivity, there is no additional penalty for +ignoring the updelay. + +In addition to the concerns about switch timings, if your +switches take a long time to go into backup mode, it may be desirable +to not activate a backup interface immediately after a link goes down. +Failover may be delayed via the downdelay bonding module option. + +13.2 Duplicated Incoming Packets +-------------------------------- + +NOTE: Starting with version 3.0.2, the bonding driver has logic to +suppress duplicate packets, which should largely eliminate this problem. +The following description is kept for reference. + +It is not uncommon to observe a short burst of duplicated +traffic when the bonding device is first used, or after it has been +idle for some period of time. This is most easily observed by issuing +a "ping" to some other host on the network, and noticing that the +output from ping flags duplicates (typically one per slave). + +For example, on a bond in active-backup mode with five slaves +all connected to one switch, the output may appear as follows:: + + # ping -n 10.0.4.2 + PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms + 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms + 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms + +This is not due to an error in the bonding driver, rather, it +is a side effect of how many switches update their MAC forwarding +tables. Initially, the switch does not associate the MAC address in +the packet with a particular switch port, and so it may send the +traffic to all ports until its MAC forwarding table is updated. Since +the interfaces attached to the bond may occupy multiple ports on a +single switch, when the switch (temporarily) floods the traffic to all +ports, the bond device receives multiple copies of the same packet +(one per slave device). + +The duplicated packet behavior is switch dependent, some +switches exhibit this, and some do not. On switches that display this +behavior, it can be induced by clearing the MAC forwarding table (on +most Cisco switches, the privileged command "clear mac address-table +dynamic" will accomplish this). + +14. Hardware Specific Considerations +==================================== + +This section contains additional information for configuring +bonding on specific hardware platforms, or for interfacing bonding +with particular switches or other devices. + +14.1 IBM BladeCenter +-------------------- + +This applies to the JS20 and similar systems. + +On the JS20 blades, the bonding driver supports only +balance-rr, active-backup, balance-tlb and balance-alb modes. This is +largely due to the network topology inside the BladeCenter, detailed +below. + +JS20 network adapter information +-------------------------------- + +All JS20s come with two Broadcom Gigabit Ethernet ports +integrated on the planar (that's "motherboard" in IBM-speak). In the +BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to +I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. +An add-on Broadcom daughter card can be installed on a JS20 to provide +two more Gigabit Ethernet ports. These ports, eth2 and eth3, are +wired to I/O Modules 3 and 4, respectively. + +Each I/O Module may contain either a switch or a passthrough +module (which allows ports to be directly connected to an external +switch). Some bonding modes require a specific BladeCenter internal +network topology in order to function; these are detailed below. + +Additional BladeCenter-specific networking information can be +found in two IBM Redbooks (www.ibm.com/redbooks): + +- "IBM eServer BladeCenter Networking Options" +- "IBM eServer BladeCenter Layer 2-7 Network Switching" + +BladeCenter networking configuration +------------------------------------ + +Because a BladeCenter can be configured in a very large number +of ways, this discussion will be confined to describing basic +configurations. + +Normally, Ethernet Switch Modules (ESMs) are used in I/O +modules 1 and 2. In this configuration, the eth0 and eth1 ports of a +JS20 will be connected to different internal switches (in the +respective I/O modules). + +A passthrough module (OPM or CPM, optical or copper, +passthrough module) connects the I/O module directly to an external +switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 +interfaces of a JS20 can be redirected to the outside world and +connected to a common external switch. + +Depending upon the mix of ESMs and PMs, the network will +appear to bonding as either a single switch topology (all PMs) or as a +multiple switch topology (one or more ESMs, zero or more PMs). It is +also possible to connect ESMs together, resulting in a configuration +much like the example in "High Availability in a Multiple Switch +Topology," above. + +Requirements for specific modes +------------------------------- + +The balance-rr mode requires the use of passthrough modules +for devices in the bond, all connected to an common external switch. +That switch must be configured for "etherchannel" or "trunking" on the +appropriate ports, as is usual for balance-rr. + +The balance-alb and balance-tlb modes will function with +either switch modules or passthrough modules (or a mix). The only +specific requirement for these modes is that all network interfaces +must be able to reach all destinations for traffic sent over the +bonding device (i.e., the network must converge at some point outside +the BladeCenter). + +The active-backup mode has no additional requirements. + +Link monitoring issues +---------------------- + +When an Ethernet Switch Module is in place, only the ARP +monitor will reliably detect link loss to an external switch. This is +nothing unusual, but examination of the BladeCenter cabinet would +suggest that the "external" network ports are the ethernet ports for +the system, when it fact there is a switch between these "external" +ports and the devices on the JS20 system itself. The MII monitor is +only able to detect link failures between the ESM and the JS20 system. + +When a passthrough module is in place, the MII monitor does +detect failures to the "external" port, which is then directly +connected to the JS20 system. + +Other concerns +-------------- + +The Serial Over LAN (SoL) link is established over the primary +ethernet (eth0) only, therefore, any loss of link to eth0 will result +in losing your SoL connection. It will not fail over with other +network traffic, as the SoL system is beyond the control of the +bonding driver. + +It may be desirable to disable spanning tree on the switch +(either the internal Ethernet Switch Module, or an external switch) to +avoid fail-over delay issues when using bonding. + + +15. Frequently Asked Questions +============================== + +1. Is it SMP safe? +------------------- + +Yes. The old 2.0.xx channel bonding patch was not SMP safe. +The new driver was designed to be SMP safe from the start. + +2. What type of cards will work with it? +----------------------------------------- + +Any Ethernet type cards (you can even mix cards - a Intel +EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, +devices need not be of the same speed. + +Starting with version 3.2.1, bonding also supports Infiniband +slaves in active-backup mode. + +3. How many bonding devices can I have? +---------------------------------------- + +There is no limit. + +4. How many slaves can a bonding device have? +---------------------------------------------- + +This is limited only by the number of network interfaces Linux +supports and/or the number of network cards you can place in your +system. + +5. What happens when a slave link dies? +---------------------------------------- + +If link monitoring is enabled, then the failing device will be +disabled. The active-backup mode will fail over to a backup link, and +other modes will ignore the failed link. The link will continue to be +monitored, and should it recover, it will rejoin the bond (in whatever +manner is appropriate for the mode). See the sections on High +Availability and the documentation for each mode for additional +information. + +Link monitoring can be enabled via either the miimon or +arp_interval parameters (described in the module parameters section, +above). In general, miimon monitors the carrier state as sensed by +the underlying network device, and the arp monitor (arp_interval) +monitors connectivity to another host on the local network. + +If no link monitoring is configured, the bonding driver will +be unable to detect link failures, and will assume that all links are +always available. This will likely result in lost packets, and a +resulting degradation of performance. The precise performance loss +depends upon the bonding mode and network configuration. + +6. Can bonding be used for High Availability? +---------------------------------------------- + +Yes. See the section on High Availability for details. + +7. Which switches/systems does it work with? +--------------------------------------------- + +The full answer to this depends upon the desired mode. + +In the basic balance modes (balance-rr and balance-xor), it +works with any system that supports etherchannel (also called +trunking). Most managed switches currently available have such +support, and many unmanaged switches as well. + +The advanced balance modes (balance-tlb and balance-alb) do +not have special switch requirements, but do need device drivers that +support specific features (described in the appropriate section under +module parameters, above). + +In 802.3ad mode, it works with systems that support IEEE +802.3ad Dynamic Link Aggregation. Most managed and many unmanaged +switches currently available support 802.3ad. + +The active-backup mode should work with any Layer-II switch. + +8. Where does a bonding device get its MAC address from? +--------------------------------------------------------- + +When using slave devices that have fixed MAC addresses, or when +the fail_over_mac option is enabled, the bonding device's MAC address is +the MAC address of the active slave. + +For other configurations, if not explicitly configured (with +ifconfig or ip link), the MAC address of the bonding device is taken from +its first slave device. This MAC address is then passed to all following +slaves and remains persistent (even if the first slave is removed) until +the bonding device is brought down or reconfigured. + +If you wish to change the MAC address, you can set it with +ifconfig or ip link:: + + # ifconfig bond0 hw ether 00:11:22:33:44:55 + + # ip link set bond0 address 66:77:88:99:aa:bb + +The MAC address can be also changed by bringing down/up the +device and then changing its slaves (or their order):: + + # ifconfig bond0 down ; modprobe -r bonding + # ifconfig bond0 .... up + # ifenslave bond0 eth... + +This method will automatically take the address from the next +slave that is added. + +To restore your slaves' MAC addresses, you need to detach them +from the bond (``ifenslave -d bond0 eth0``). The bonding driver will +then restore the MAC addresses that the slaves had before they were +enslaved. + +16. Resources and Links +======================= + +The latest version of the bonding driver can be found in the latest +version of the linux kernel, found on http://kernel.org + +The latest version of this document can be found in the latest kernel +source (named Documentation/networking/bonding.rst). + +Discussions regarding the usage of the bonding driver take place on the +bonding-devel mailing list, hosted at sourceforge.net. If you have questions or +problems, post them to the list. The list address is: + +bonding-devel@lists.sourceforge.net + +The administrative interface (to subscribe or unsubscribe) can +be found at: + +https://lists.sourceforge.net/lists/listinfo/bonding-devel + +Discussions regarding the development of the bonding driver take place +on the main Linux network mailing list, hosted at vger.kernel.org. The list +address is: + +netdev@vger.kernel.org + +The administrative interface (to subscribe or unsubscribe) can +be found at: + +http://vger.kernel.org/vger-lists.html#netdev + +Donald Becker's Ethernet Drivers and diag programs may be found at : + + - http://web.archive.org/web/%2E/http://www.scyld.com/network/ + +You will also find a lot of information regarding Ethernet, NWay, MII, +etc. at www.scyld.com. diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt deleted file mode 100644 index e3abfbd32f71..000000000000 --- a/Documentation/networking/bonding.txt +++ /dev/null @@ -1,2837 +0,0 @@ - - Linux Ethernet Bonding Driver HOWTO - - Latest update: 27 April 2011 - -Initial release : Thomas Davis -Corrections, HA extensions : 2000/10/03-15 : - - Willy Tarreau - - Constantine Gavrilov - - Chad N. Tindel - - Janice Girouard - - Jay Vosburgh - -Reorganized and updated Feb 2005 by Jay Vosburgh -Added Sysfs information: 2006/04/24 - - Mitch Williams - -Introduction -============ - - The Linux bonding driver provides a method for aggregating -multiple network interfaces into a single logical "bonded" interface. -The behavior of the bonded interfaces depends upon the mode; generally -speaking, modes provide either hot standby or load balancing services. -Additionally, link integrity monitoring may be performed. - - The bonding driver originally came from Donald Becker's -beowulf patches for kernel 2.0. It has changed quite a bit since, and -the original tools from extreme-linux and beowulf sites will not work -with this version of the driver. - - For new versions of the driver, updated userspace tools, and -who to ask for help, please follow the links at the end of this file. - -Table of Contents -================= - -1. Bonding Driver Installation - -2. Bonding Driver Options - -3. Configuring Bonding Devices -3.1 Configuration with Sysconfig Support -3.1.1 Using DHCP with Sysconfig -3.1.2 Configuring Multiple Bonds with Sysconfig -3.2 Configuration with Initscripts Support -3.2.1 Using DHCP with Initscripts -3.2.2 Configuring Multiple Bonds with Initscripts -3.3 Configuring Bonding Manually with Ifenslave -3.3.1 Configuring Multiple Bonds Manually -3.4 Configuring Bonding Manually via Sysfs -3.5 Configuration with Interfaces Support -3.6 Overriding Configuration for Special Cases -3.7 Configuring LACP for 802.3ad mode in a more secure way - -4. Querying Bonding Configuration -4.1 Bonding Configuration -4.2 Network Configuration - -5. Switch Configuration - -6. 802.1q VLAN Support - -7. Link Monitoring -7.1 ARP Monitor Operation -7.2 Configuring Multiple ARP Targets -7.3 MII Monitor Operation - -8. Potential Trouble Sources -8.1 Adventures in Routing -8.2 Ethernet Device Renaming -8.3 Painfully Slow Or No Failed Link Detection By Miimon - -9. SNMP agents - -10. Promiscuous mode - -11. Configuring Bonding for High Availability -11.1 High Availability in a Single Switch Topology -11.2 High Availability in a Multiple Switch Topology -11.2.1 HA Bonding Mode Selection for Multiple Switch Topology -11.2.2 HA Link Monitoring for Multiple Switch Topology - -12. Configuring Bonding for Maximum Throughput -12.1 Maximum Throughput in a Single Switch Topology -12.1.1 MT Bonding Mode Selection for Single Switch Topology -12.1.2 MT Link Monitoring for Single Switch Topology -12.2 Maximum Throughput in a Multiple Switch Topology -12.2.1 MT Bonding Mode Selection for Multiple Switch Topology -12.2.2 MT Link Monitoring for Multiple Switch Topology - -13. Switch Behavior Issues -13.1 Link Establishment and Failover Delays -13.2 Duplicated Incoming Packets - -14. Hardware Specific Considerations -14.1 IBM BladeCenter - -15. Frequently Asked Questions - -16. Resources and Links - - -1. Bonding Driver Installation -============================== - - Most popular distro kernels ship with the bonding driver -already available as a module. If your distro does not, or you -have need to compile bonding from source (e.g., configuring and -installing a mainline kernel from kernel.org), you'll need to perform -the following steps: - -1.1 Configure and build the kernel with bonding ------------------------------------------------ - - The current version of the bonding driver is available in the -drivers/net/bonding subdirectory of the most recent kernel source -(which is available on http://kernel.org). Most users "rolling their -own" will want to use the most recent kernel from kernel.org. - - Configure kernel with "make menuconfig" (or "make xconfig" or -"make config"), then select "Bonding driver support" in the "Network -device support" section. It is recommended that you configure the -driver as module since it is currently the only way to pass parameters -to the driver or configure more than one bonding device. - - Build and install the new kernel and modules. - -1.2 Bonding Control Utility -------------------------------------- - - It is recommended to configure bonding via iproute2 (netlink) -or sysfs, the old ifenslave control utility is obsolete. - -2. Bonding Driver Options -========================= - - Options for the bonding driver are supplied as parameters to the -bonding module at load time, or are specified via sysfs. - - Module options may be given as command line arguments to the -insmod or modprobe command, but are usually specified in either the -/etc/modprobe.d/*.conf configuration files, or in a distro-specific -configuration file (some of which are detailed in the next section). - - Details on bonding support for sysfs is provided in the -"Configuring Bonding Manually via Sysfs" section, below. - - The available bonding driver parameters are listed below. If a -parameter is not specified the default value is used. When initially -configuring a bond, it is recommended "tail -f /var/log/messages" be -run in a separate window to watch for bonding driver error messages. - - It is critical that either the miimon or arp_interval and -arp_ip_target parameters be specified, otherwise serious network -degradation will occur during link failures. Very few devices do not -support at least miimon, so there is really no reason not to use it. - - Options with textual values will accept either the text name -or, for backwards compatibility, the option value. E.g., -"mode=802.3ad" and "mode=4" set the same mode. - - The parameters are as follows: - -active_slave - - Specifies the new active slave for modes that support it - (active-backup, balance-alb and balance-tlb). Possible values - are the name of any currently enslaved interface, or an empty - string. If a name is given, the slave and its link must be up in order - to be selected as the new active slave. If an empty string is - specified, the current active slave is cleared, and a new active - slave is selected automatically. - - Note that this is only available through the sysfs interface. No module - parameter by this name exists. - - The normal value of this option is the name of the currently - active slave, or the empty string if there is no active slave or - the current mode does not use an active slave. - -ad_actor_sys_prio - - In an AD system, this specifies the system priority. The allowed range - is 1 - 65535. If the value is not specified, it takes 65535 as the - default value. - - This parameter has effect only in 802.3ad mode and is available through - SysFs interface. - -ad_actor_system - - In an AD system, this specifies the mac-address for the actor in - protocol packet exchanges (LACPDUs). The value cannot be NULL or - multicast. It is preferred to have the local-admin bit set for this - mac but driver does not enforce it. If the value is not given then - system defaults to using the masters' mac address as actors' system - address. - - This parameter has effect only in 802.3ad mode and is available through - SysFs interface. - -ad_select - - Specifies the 802.3ad aggregation selection logic to use. The - possible values and their effects are: - - stable or 0 - - The active aggregator is chosen by largest aggregate - bandwidth. - - Reselection of the active aggregator occurs only when all - slaves of the active aggregator are down or the active - aggregator has no slaves. - - This is the default value. - - bandwidth or 1 - - The active aggregator is chosen by largest aggregate - bandwidth. Reselection occurs if: - - - A slave is added to or removed from the bond - - - Any slave's link state changes - - - Any slave's 802.3ad association state changes - - - The bond's administrative state changes to up - - count or 2 - - The active aggregator is chosen by the largest number of - ports (slaves). Reselection occurs as described under the - "bandwidth" setting, above. - - The bandwidth and count selection policies permit failover of - 802.3ad aggregations when partial failure of the active aggregator - occurs. This keeps the aggregator with the highest availability - (either in bandwidth or in number of ports) active at all times. - - This option was added in bonding version 3.4.0. - -ad_user_port_key - - In an AD system, the port-key has three parts as shown below - - - Bits Use - 00 Duplex - 01-05 Speed - 06-15 User-defined - - This defines the upper 10 bits of the port key. The values can be - from 0 - 1023. If not given, the system defaults to 0. - - This parameter has effect only in 802.3ad mode and is available through - SysFs interface. - -all_slaves_active - - Specifies that duplicate frames (received on inactive ports) should be - dropped (0) or delivered (1). - - Normally, bonding will drop duplicate frames (received on inactive - ports), which is desirable for most users. But there are some times - it is nice to allow duplicate frames to be delivered. - - The default value is 0 (drop duplicate frames received on inactive - ports). - -arp_interval - - Specifies the ARP link monitoring frequency in milliseconds. - - The ARP monitor works by periodically checking the slave - devices to determine whether they have sent or received - traffic recently (the precise criteria depends upon the - bonding mode, and the state of the slave). Regular traffic is - generated via ARP probes issued for the addresses specified by - the arp_ip_target option. - - This behavior can be modified by the arp_validate option, - below. - - If ARP monitoring is used in an etherchannel compatible mode - (modes 0 and 2), the switch should be configured in a mode - that evenly distributes packets across all links. If the - switch is configured to distribute the packets in an XOR - fashion, all replies from the ARP targets will be received on - the same link which could cause the other team members to - fail. ARP monitoring should not be used in conjunction with - miimon. A value of 0 disables ARP monitoring. The default - value is 0. - -arp_ip_target - - Specifies the IP addresses to use as ARP monitoring peers when - arp_interval is > 0. These are the targets of the ARP request - sent to determine the health of the link to the targets. - Specify these values in ddd.ddd.ddd.ddd format. Multiple IP - addresses must be separated by a comma. At least one IP - address must be given for ARP monitoring to function. The - maximum number of targets that can be specified is 16. The - default value is no IP addresses. - -arp_validate - - Specifies whether or not ARP probes and replies should be - validated in any mode that supports arp monitoring, or whether - non-ARP traffic should be filtered (disregarded) for link - monitoring purposes. - - Possible values are: - - none or 0 - - No validation or filtering is performed. - - active or 1 - - Validation is performed only for the active slave. - - backup or 2 - - Validation is performed only for backup slaves. - - all or 3 - - Validation is performed for all slaves. - - filter or 4 - - Filtering is applied to all slaves. No validation is - performed. - - filter_active or 5 - - Filtering is applied to all slaves, validation is performed - only for the active slave. - - filter_backup or 6 - - Filtering is applied to all slaves, validation is performed - only for backup slaves. - - Validation: - - Enabling validation causes the ARP monitor to examine the incoming - ARP requests and replies, and only consider a slave to be up if it - is receiving the appropriate ARP traffic. - - For an active slave, the validation checks ARP replies to confirm - that they were generated by an arp_ip_target. Since backup slaves - do not typically receive these replies, the validation performed - for backup slaves is on the broadcast ARP request sent out via the - active slave. It is possible that some switch or network - configurations may result in situations wherein the backup slaves - do not receive the ARP requests; in such a situation, validation - of backup slaves must be disabled. - - The validation of ARP requests on backup slaves is mainly helping - bonding to decide which slaves are more likely to work in case of - the active slave failure, it doesn't really guarantee that the - backup slave will work if it's selected as the next active slave. - - Validation is useful in network configurations in which multiple - bonding hosts are concurrently issuing ARPs to one or more targets - beyond a common switch. Should the link between the switch and - target fail (but not the switch itself), the probe traffic - generated by the multiple bonding instances will fool the standard - ARP monitor into considering the links as still up. Use of - validation can resolve this, as the ARP monitor will only consider - ARP requests and replies associated with its own instance of - bonding. - - Filtering: - - Enabling filtering causes the ARP monitor to only use incoming ARP - packets for link availability purposes. Arriving packets that are - not ARPs are delivered normally, but do not count when determining - if a slave is available. - - Filtering operates by only considering the reception of ARP - packets (any ARP packet, regardless of source or destination) when - determining if a slave has received traffic for link availability - purposes. - - Filtering is useful in network configurations in which significant - levels of third party broadcast traffic would fool the standard - ARP monitor into considering the links as still up. Use of - filtering can resolve this, as only ARP traffic is considered for - link availability purposes. - - This option was added in bonding version 3.1.0. - -arp_all_targets - - Specifies the quantity of arp_ip_targets that must be reachable - in order for the ARP monitor to consider a slave as being up. - This option affects only active-backup mode for slaves with - arp_validation enabled. - - Possible values are: - - any or 0 - - consider the slave up only when any of the arp_ip_targets - is reachable - - all or 1 - - consider the slave up only when all of the arp_ip_targets - are reachable - -downdelay - - Specifies the time, in milliseconds, to wait before disabling - a slave after a link failure has been detected. This option - is only valid for the miimon link monitor. The downdelay - value should be a multiple of the miimon value; if not, it - will be rounded down to the nearest multiple. The default - value is 0. - -fail_over_mac - - Specifies whether active-backup mode should set all slaves to - the same MAC address at enslavement (the traditional - behavior), or, when enabled, perform special handling of the - bond's MAC address in accordance with the selected policy. - - Possible values are: - - none or 0 - - This setting disables fail_over_mac, and causes - bonding to set all slaves of an active-backup bond to - the same MAC address at enslavement time. This is the - default. - - active or 1 - - The "active" fail_over_mac policy indicates that the - MAC address of the bond should always be the MAC - address of the currently active slave. The MAC - address of the slaves is not changed; instead, the MAC - address of the bond changes during a failover. - - This policy is useful for devices that cannot ever - alter their MAC address, or for devices that refuse - incoming broadcasts with their own source MAC (which - interferes with the ARP monitor). - - The down side of this policy is that every device on - the network must be updated via gratuitous ARP, - vs. just updating a switch or set of switches (which - often takes place for any traffic, not just ARP - traffic, if the switch snoops incoming traffic to - update its tables) for the traditional method. If the - gratuitous ARP is lost, communication may be - disrupted. - - When this policy is used in conjunction with the mii - monitor, devices which assert link up prior to being - able to actually transmit and receive are particularly - susceptible to loss of the gratuitous ARP, and an - appropriate updelay setting may be required. - - follow or 2 - - The "follow" fail_over_mac policy causes the MAC - address of the bond to be selected normally (normally - the MAC address of the first slave added to the bond). - However, the second and subsequent slaves are not set - to this MAC address while they are in a backup role; a - slave is programmed with the bond's MAC address at - failover time (and the formerly active slave receives - the newly active slave's MAC address). - - This policy is useful for multiport devices that - either become confused or incur a performance penalty - when multiple ports are programmed with the same MAC - address. - - - The default policy is none, unless the first slave cannot - change its MAC address, in which case the active policy is - selected by default. - - This option may be modified via sysfs only when no slaves are - present in the bond. - - This option was added in bonding version 3.2.0. The "follow" - policy was added in bonding version 3.3.0. - -lacp_rate - - Option specifying the rate in which we'll ask our link partner - to transmit LACPDU packets in 802.3ad mode. Possible values - are: - - slow or 0 - Request partner to transmit LACPDUs every 30 seconds - - fast or 1 - Request partner to transmit LACPDUs every 1 second - - The default is slow. - -max_bonds - - Specifies the number of bonding devices to create for this - instance of the bonding driver. E.g., if max_bonds is 3, and - the bonding driver is not already loaded, then bond0, bond1 - and bond2 will be created. The default value is 1. Specifying - a value of 0 will load bonding, but will not create any devices. - -miimon - - Specifies the MII link monitoring frequency in milliseconds. - This determines how often the link state of each slave is - inspected for link failures. A value of zero disables MII - link monitoring. A value of 100 is a good starting point. - The use_carrier option, below, affects how the link state is - determined. See the High Availability section for additional - information. The default value is 0. - -min_links - - Specifies the minimum number of links that must be active before - asserting carrier. It is similar to the Cisco EtherChannel min-links - feature. This allows setting the minimum number of member ports that - must be up (link-up state) before marking the bond device as up - (carrier on). This is useful for situations where higher level services - such as clustering want to ensure a minimum number of low bandwidth - links are active before switchover. This option only affect 802.3ad - mode. - - The default value is 0. This will cause carrier to be asserted (for - 802.3ad mode) whenever there is an active aggregator, regardless of the - number of available links in that aggregator. Note that, because an - aggregator cannot be active without at least one available link, - setting this option to 0 or to 1 has the exact same effect. - -mode - - Specifies one of the bonding policies. The default is - balance-rr (round robin). Possible values are: - - balance-rr or 0 - - Round-robin policy: Transmit packets in sequential - order from the first available slave through the - last. This mode provides load balancing and fault - tolerance. - - active-backup or 1 - - Active-backup policy: Only one slave in the bond is - active. A different slave becomes active if, and only - if, the active slave fails. The bond's MAC address is - externally visible on only one port (network adapter) - to avoid confusing the switch. - - In bonding version 2.6.2 or later, when a failover - occurs in active-backup mode, bonding will issue one - or more gratuitous ARPs on the newly active slave. - One gratuitous ARP is issued for the bonding master - interface and each VLAN interfaces configured above - it, provided that the interface has at least one IP - address configured. Gratuitous ARPs issued for VLAN - interfaces are tagged with the appropriate VLAN id. - - This mode provides fault tolerance. The primary - option, documented below, affects the behavior of this - mode. - - balance-xor or 2 - - XOR policy: Transmit based on the selected transmit - hash policy. The default policy is a simple [(source - MAC address XOR'd with destination MAC address XOR - packet type ID) modulo slave count]. Alternate transmit - policies may be selected via the xmit_hash_policy option, - described below. - - This mode provides load balancing and fault tolerance. - - broadcast or 3 - - Broadcast policy: transmits everything on all slave - interfaces. This mode provides fault tolerance. - - 802.3ad or 4 - - IEEE 802.3ad Dynamic link aggregation. Creates - aggregation groups that share the same speed and - duplex settings. Utilizes all slaves in the active - aggregator according to the 802.3ad specification. - - Slave selection for outgoing traffic is done according - to the transmit hash policy, which may be changed from - the default simple XOR policy via the xmit_hash_policy - option, documented below. Note that not all transmit - policies may be 802.3ad compliant, particularly in - regards to the packet mis-ordering requirements of - section 43.2.4 of the 802.3ad standard. Differing - peer implementations will have varying tolerances for - noncompliance. - - Prerequisites: - - 1. Ethtool support in the base drivers for retrieving - the speed and duplex of each slave. - - 2. A switch that supports IEEE 802.3ad Dynamic link - aggregation. - - Most switches will require some type of configuration - to enable 802.3ad mode. - - balance-tlb or 5 - - Adaptive transmit load balancing: channel bonding that - does not require any special switch support. - - In tlb_dynamic_lb=1 mode; the outgoing traffic is - distributed according to the current load (computed - relative to the speed) on each slave. - - In tlb_dynamic_lb=0 mode; the load balancing based on - current load is disabled and the load is distributed - only using the hash distribution. - - Incoming traffic is received by the current slave. - If the receiving slave fails, another slave takes over - the MAC address of the failed receiving slave. - - Prerequisite: - - Ethtool support in the base drivers for retrieving the - speed of each slave. - - balance-alb or 6 - - Adaptive load balancing: includes balance-tlb plus - receive load balancing (rlb) for IPV4 traffic, and - does not require any special switch support. The - receive load balancing is achieved by ARP negotiation. - The bonding driver intercepts the ARP Replies sent by - the local system on their way out and overwrites the - source hardware address with the unique hardware - address of one of the slaves in the bond such that - different peers use different hardware addresses for - the server. - - Receive traffic from connections created by the server - is also balanced. When the local system sends an ARP - Request the bonding driver copies and saves the peer's - IP information from the ARP packet. When the ARP - Reply arrives from the peer, its hardware address is - retrieved and the bonding driver initiates an ARP - reply to this peer assigning it to one of the slaves - in the bond. A problematic outcome of using ARP - negotiation for balancing is that each time that an - ARP request is broadcast it uses the hardware address - of the bond. Hence, peers learn the hardware address - of the bond and the balancing of receive traffic - collapses to the current slave. This is handled by - sending updates (ARP Replies) to all the peers with - their individually assigned hardware address such that - the traffic is redistributed. Receive traffic is also - redistributed when a new slave is added to the bond - and when an inactive slave is re-activated. The - receive load is distributed sequentially (round robin) - among the group of highest speed slaves in the bond. - - When a link is reconnected or a new slave joins the - bond the receive traffic is redistributed among all - active slaves in the bond by initiating ARP Replies - with the selected MAC address to each of the - clients. The updelay parameter (detailed below) must - be set to a value equal or greater than the switch's - forwarding delay so that the ARP Replies sent to the - peers will not be blocked by the switch. - - Prerequisites: - - 1. Ethtool support in the base drivers for retrieving - the speed of each slave. - - 2. Base driver support for setting the hardware - address of a device while it is open. This is - required so that there will always be one slave in the - team using the bond hardware address (the - curr_active_slave) while having a unique hardware - address for each slave in the bond. If the - curr_active_slave fails its hardware address is - swapped with the new curr_active_slave that was - chosen. - -num_grat_arp -num_unsol_na - - Specify the number of peer notifications (gratuitous ARPs and - unsolicited IPv6 Neighbor Advertisements) to be issued after a - failover event. As soon as the link is up on the new slave - (possibly immediately) a peer notification is sent on the - bonding device and each VLAN sub-device. This is repeated at - the rate specified by peer_notif_delay if the number is - greater than 1. - - The valid range is 0 - 255; the default value is 1. These options - affect only the active-backup mode. These options were added for - bonding versions 3.3.0 and 3.4.0 respectively. - - From Linux 3.0 and bonding version 3.7.1, these notifications - are generated by the ipv4 and ipv6 code and the numbers of - repetitions cannot be set independently. - -packets_per_slave - - Specify the number of packets to transmit through a slave before - moving to the next one. When set to 0 then a slave is chosen at - random. - - The valid range is 0 - 65535; the default value is 1. This option - has effect only in balance-rr mode. - -peer_notif_delay - - Specify the delay, in milliseconds, between each peer - notification (gratuitous ARP and unsolicited IPv6 Neighbor - Advertisement) when they are issued after a failover event. - This delay should be a multiple of the link monitor interval - (arp_interval or miimon, whichever is active). The default - value is 0 which means to match the value of the link monitor - interval. - -primary - - A string (eth0, eth2, etc) specifying which slave is the - primary device. The specified device will always be the - active slave while it is available. Only when the primary is - off-line will alternate devices be used. This is useful when - one slave is preferred over another, e.g., when one slave has - higher throughput than another. - - The primary option is only valid for active-backup(1), - balance-tlb (5) and balance-alb (6) mode. - -primary_reselect - - Specifies the reselection policy for the primary slave. This - affects how the primary slave is chosen to become the active slave - when failure of the active slave or recovery of the primary slave - occurs. This option is designed to prevent flip-flopping between - the primary slave and other slaves. Possible values are: - - always or 0 (default) - - The primary slave becomes the active slave whenever it - comes back up. - - better or 1 - - The primary slave becomes the active slave when it comes - back up, if the speed and duplex of the primary slave is - better than the speed and duplex of the current active - slave. - - failure or 2 - - The primary slave becomes the active slave only if the - current active slave fails and the primary slave is up. - - The primary_reselect setting is ignored in two cases: - - If no slaves are active, the first slave to recover is - made the active slave. - - When initially enslaved, the primary slave is always made - the active slave. - - Changing the primary_reselect policy via sysfs will cause an - immediate selection of the best active slave according to the new - policy. This may or may not result in a change of the active - slave, depending upon the circumstances. - - This option was added for bonding version 3.6.0. - -tlb_dynamic_lb - - Specifies if dynamic shuffling of flows is enabled in tlb - mode. The value has no effect on any other modes. - - The default behavior of tlb mode is to shuffle active flows across - slaves based on the load in that interval. This gives nice lb - characteristics but can cause packet reordering. If re-ordering is - a concern use this variable to disable flow shuffling and rely on - load balancing provided solely by the hash distribution. - xmit-hash-policy can be used to select the appropriate hashing for - the setup. - - The sysfs entry can be used to change the setting per bond device - and the initial value is derived from the module parameter. The - sysfs entry is allowed to be changed only if the bond device is - down. - - The default value is "1" that enables flow shuffling while value "0" - disables it. This option was added in bonding driver 3.7.1 - - -updelay - - Specifies the time, in milliseconds, to wait before enabling a - slave after a link recovery has been detected. This option is - only valid for the miimon link monitor. The updelay value - should be a multiple of the miimon value; if not, it will be - rounded down to the nearest multiple. The default value is 0. - -use_carrier - - Specifies whether or not miimon should use MII or ETHTOOL - ioctls vs. netif_carrier_ok() to determine the link - status. The MII or ETHTOOL ioctls are less efficient and - utilize a deprecated calling sequence within the kernel. The - netif_carrier_ok() relies on the device driver to maintain its - state with netif_carrier_on/off; at this writing, most, but - not all, device drivers support this facility. - - If bonding insists that the link is up when it should not be, - it may be that your network device driver does not support - netif_carrier_on/off. The default state for netif_carrier is - "carrier on," so if a driver does not support netif_carrier, - it will appear as if the link is always up. In this case, - setting use_carrier to 0 will cause bonding to revert to the - MII / ETHTOOL ioctl method to determine the link state. - - A value of 1 enables the use of netif_carrier_ok(), a value of - 0 will use the deprecated MII / ETHTOOL ioctls. The default - value is 1. - -xmit_hash_policy - - Selects the transmit hash policy to use for slave selection in - balance-xor, 802.3ad, and tlb modes. Possible values are: - - layer2 - - Uses XOR of hardware MAC addresses and packet type ID - field to generate the hash. The formula is - - hash = source MAC XOR destination MAC XOR packet type ID - slave number = hash modulo slave count - - This algorithm will place all traffic to a particular - network peer on the same slave. - - This algorithm is 802.3ad compliant. - - layer2+3 - - This policy uses a combination of layer2 and layer3 - protocol information to generate the hash. - - Uses XOR of hardware MAC addresses and IP addresses to - generate the hash. The formula is - - hash = source MAC XOR destination MAC XOR packet type ID - hash = hash XOR source IP XOR destination IP - hash = hash XOR (hash RSHIFT 16) - hash = hash XOR (hash RSHIFT 8) - And then hash is reduced modulo slave count. - - If the protocol is IPv6 then the source and destination - addresses are first hashed using ipv6_addr_hash. - - This algorithm will place all traffic to a particular - network peer on the same slave. For non-IP traffic, - the formula is the same as for the layer2 transmit - hash policy. - - This policy is intended to provide a more balanced - distribution of traffic than layer2 alone, especially - in environments where a layer3 gateway device is - required to reach most destinations. - - This algorithm is 802.3ad compliant. - - layer3+4 - - This policy uses upper layer protocol information, - when available, to generate the hash. This allows for - traffic to a particular network peer to span multiple - slaves, although a single connection will not span - multiple slaves. - - The formula for unfragmented TCP and UDP packets is - - hash = source port, destination port (as in the header) - hash = hash XOR source IP XOR destination IP - hash = hash XOR (hash RSHIFT 16) - hash = hash XOR (hash RSHIFT 8) - And then hash is reduced modulo slave count. - - If the protocol is IPv6 then the source and destination - addresses are first hashed using ipv6_addr_hash. - - For fragmented TCP or UDP packets and all other IPv4 and - IPv6 protocol traffic, the source and destination port - information is omitted. For non-IP traffic, the - formula is the same as for the layer2 transmit hash - policy. - - This algorithm is not fully 802.3ad compliant. A - single TCP or UDP conversation containing both - fragmented and unfragmented packets will see packets - striped across two interfaces. This may result in out - of order delivery. Most traffic types will not meet - this criteria, as TCP rarely fragments traffic, and - most UDP traffic is not involved in extended - conversations. Other implementations of 802.3ad may - or may not tolerate this noncompliance. - - encap2+3 - - This policy uses the same formula as layer2+3 but it - relies on skb_flow_dissect to obtain the header fields - which might result in the use of inner headers if an - encapsulation protocol is used. For example this will - improve the performance for tunnel users because the - packets will be distributed according to the encapsulated - flows. - - encap3+4 - - This policy uses the same formula as layer3+4 but it - relies on skb_flow_dissect to obtain the header fields - which might result in the use of inner headers if an - encapsulation protocol is used. For example this will - improve the performance for tunnel users because the - packets will be distributed according to the encapsulated - flows. - - The default value is layer2. This option was added in bonding - version 2.6.3. In earlier versions of bonding, this parameter - does not exist, and the layer2 policy is the only policy. The - layer2+3 value was added for bonding version 3.2.2. - -resend_igmp - - Specifies the number of IGMP membership reports to be issued after - a failover event. One membership report is issued immediately after - the failover, subsequent packets are sent in each 200ms interval. - - The valid range is 0 - 255; the default value is 1. A value of 0 - prevents the IGMP membership report from being issued in response - to the failover event. - - This option is useful for bonding modes balance-rr (0), active-backup - (1), balance-tlb (5) and balance-alb (6), in which a failover can - switch the IGMP traffic from one slave to another. Therefore a fresh - IGMP report must be issued to cause the switch to forward the incoming - IGMP traffic over the newly selected slave. - - This option was added for bonding version 3.7.0. - -lp_interval - - Specifies the number of seconds between instances where the bonding - driver sends learning packets to each slaves peer switch. - - The valid range is 1 - 0x7fffffff; the default value is 1. This Option - has effect only in balance-tlb and balance-alb modes. - -3. Configuring Bonding Devices -============================== - - You can configure bonding using either your distro's network -initialization scripts, or manually using either iproute2 or the -sysfs interface. Distros generally use one of three packages for the -network initialization scripts: initscripts, sysconfig or interfaces. -Recent versions of these packages have support for bonding, while older -versions do not. - - We will first describe the options for configuring bonding for -distros using versions of initscripts, sysconfig and interfaces with full -or partial support for bonding, then provide information on enabling -bonding without support from the network initialization scripts (i.e., -older versions of initscripts or sysconfig). - - If you're unsure whether your distro uses sysconfig, -initscripts or interfaces, or don't know if it's new enough, have no fear. -Determining this is fairly straightforward. - - First, look for a file called interfaces in /etc/network directory. -If this file is present in your system, then your system use interfaces. See -Configuration with Interfaces Support. - - Else, issue the command: - -$ rpm -qf /sbin/ifup - - It will respond with a line of text starting with either -"initscripts" or "sysconfig," followed by some numbers. This is the -package that provides your network initialization scripts. - - Next, to determine if your installation supports bonding, -issue the command: - -$ grep ifenslave /sbin/ifup - - If this returns any matches, then your initscripts or -sysconfig has support for bonding. - -3.1 Configuration with Sysconfig Support ----------------------------------------- - - This section applies to distros using a version of sysconfig -with bonding support, for example, SuSE Linux Enterprise Server 9. - - SuSE SLES 9's networking configuration system does support -bonding, however, at this writing, the YaST system configuration -front end does not provide any means to work with bonding devices. -Bonding devices can be managed by hand, however, as follows. - - First, if they have not already been configured, configure the -slave devices. On SLES 9, this is most easily done by running the -yast2 sysconfig configuration utility. The goal is for to create an -ifcfg-id file for each slave device. The simplest way to accomplish -this is to configure the devices for DHCP (this is only to get the -file ifcfg-id file created; see below for some issues with DHCP). The -name of the configuration file for each device will be of the form: - -ifcfg-id-xx:xx:xx:xx:xx:xx - - Where the "xx" portion will be replaced with the digits from -the device's permanent MAC address. - - Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been -created, it is necessary to edit the configuration files for the slave -devices (the MAC addresses correspond to those of the slave devices). -Before editing, the file will contain multiple lines, and will look -something like this: - -BOOTPROTO='dhcp' -STARTMODE='on' -USERCTL='no' -UNIQUE='XNzu.WeZGOGF+4wE' -_nm_name='bus-pci-0001:61:01.0' - - Change the BOOTPROTO and STARTMODE lines to the following: - -BOOTPROTO='none' -STARTMODE='off' - - Do not alter the UNIQUE or _nm_name lines. Remove any other -lines (USERCTL, etc). - - Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified, -it's time to create the configuration file for the bonding device -itself. This file is named ifcfg-bondX, where X is the number of the -bonding device to create, starting at 0. The first such file is -ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig -network configuration system will correctly start multiple instances -of bonding. - - The contents of the ifcfg-bondX file is as follows: - -BOOTPROTO="static" -BROADCAST="10.0.2.255" -IPADDR="10.0.2.10" -NETMASK="255.255.0.0" -NETWORK="10.0.2.0" -REMOTE_IPADDR="" -STARTMODE="onboot" -BONDING_MASTER="yes" -BONDING_MODULE_OPTS="mode=active-backup miimon=100" -BONDING_SLAVE0="eth0" -BONDING_SLAVE1="bus-pci-0000:06:08.1" - - Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK -values with the appropriate values for your network. - - The STARTMODE specifies when the device is brought online. -The possible values are: - - onboot: The device is started at boot time. If you're not - sure, this is probably what you want. - - manual: The device is started only when ifup is called - manually. Bonding devices may be configured this - way if you do not wish them to start automatically - at boot for some reason. - - hotplug: The device is started by a hotplug event. This is not - a valid choice for a bonding device. - - off or ignore: The device configuration is ignored. - - The line BONDING_MASTER='yes' indicates that the device is a -bonding master device. The only useful value is "yes." - - The contents of BONDING_MODULE_OPTS are supplied to the -instance of the bonding module for this device. Specify the options -for the bonding mode, link monitoring, and so on here. Do not include -the max_bonds bonding parameter; this will confuse the configuration -system if you have multiple bonding devices. - - Finally, supply one BONDING_SLAVEn="slave device" for each -slave. where "n" is an increasing value, one for each slave. The -"slave device" is either an interface name, e.g., "eth0", or a device -specifier for the network device. The interface name is easier to -find, but the ethN names are subject to change at boot time if, e.g., -a device early in the sequence has failed. The device specifiers -(bus-pci-0000:06:08.1 in the example above) specify the physical -network device, and will not change unless the device's bus location -changes (for example, it is moved from one PCI slot to another). The -example above uses one of each type for demonstration purposes; most -configurations will choose one or the other for all slave devices. - - When all configuration files have been modified or created, -networking must be restarted for the configuration changes to take -effect. This can be accomplished via the following: - -# /etc/init.d/network restart - - Note that the network control script (/sbin/ifdown) will -remove the bonding module as part of the network shutdown processing, -so it is not necessary to remove the module by hand if, e.g., the -module parameters have changed. - - Also, at this writing, YaST/YaST2 will not manage bonding -devices (they do not show bonding interfaces on its list of network -devices). It is necessary to edit the configuration file by hand to -change the bonding configuration. - - Additional general options and details of the ifcfg file -format can be found in an example ifcfg template file: - -/etc/sysconfig/network/ifcfg.template - - Note that the template does not document the various BONDING_ -settings described above, but does describe many of the other options. - -3.1.1 Using DHCP with Sysconfig -------------------------------- - - Under sysconfig, configuring a device with BOOTPROTO='dhcp' -will cause it to query DHCP for its IP address information. At this -writing, this does not function for bonding devices; the scripts -attempt to obtain the device address from DHCP prior to adding any of -the slave devices. Without active slaves, the DHCP requests are not -sent to the network. - -3.1.2 Configuring Multiple Bonds with Sysconfig ------------------------------------------------ - - The sysconfig network initialization system is capable of -handling multiple bonding devices. All that is necessary is for each -bonding instance to have an appropriately configured ifcfg-bondX file -(as described above). Do not specify the "max_bonds" parameter to any -instance of bonding, as this will confuse sysconfig. If you require -multiple bonding devices with identical parameters, create multiple -ifcfg-bondX files. - - Because the sysconfig scripts supply the bonding module -options in the ifcfg-bondX file, it is not necessary to add them to -the system /etc/modules.d/*.conf configuration files. - -3.2 Configuration with Initscripts Support ------------------------------------------- - - This section applies to distros using a recent version of -initscripts with bonding support, for example, Red Hat Enterprise Linux -version 3 or later, Fedora, etc. On these systems, the network -initialization scripts have knowledge of bonding, and can be configured to -control bonding devices. Note that older versions of the initscripts -package have lower levels of support for bonding; this will be noted where -applicable. - - These distros will not automatically load the network adapter -driver unless the ethX device is configured with an IP address. -Because of this constraint, users must manually configure a -network-script file for all physical adapters that will be members of -a bondX link. Network script files are located in the directory: - -/etc/sysconfig/network-scripts - - The file name must be prefixed with "ifcfg-eth" and suffixed -with the adapter's physical adapter number. For example, the script -for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0. -Place the following text in the file: - -DEVICE=eth0 -USERCTL=no -ONBOOT=yes -MASTER=bond0 -SLAVE=yes -BOOTPROTO=none - - The DEVICE= line will be different for every ethX device and -must correspond with the name of the file, i.e., ifcfg-eth1 must have -a device line of DEVICE=eth1. The setting of the MASTER= line will -also depend on the final bonding interface name chosen for your bond. -As with other network devices, these typically start at 0, and go up -one for each device, i.e., the first bonding instance is bond0, the -second is bond1, and so on. - - Next, create a bond network script. The file name for this -script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is -the number of the bond. For bond0 the file is named "ifcfg-bond0", -for bond1 it is named "ifcfg-bond1", and so on. Within that file, -place the following text: - -DEVICE=bond0 -IPADDR=192.168.1.1 -NETMASK=255.255.255.0 -NETWORK=192.168.1.0 -BROADCAST=192.168.1.255 -ONBOOT=yes -BOOTPROTO=none -USERCTL=no - - Be sure to change the networking specific lines (IPADDR, -NETMASK, NETWORK and BROADCAST) to match your network configuration. - - For later versions of initscripts, such as that found with Fedora -7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible, -and, indeed, preferable, to specify the bonding options in the ifcfg-bond0 -file, e.g. a line of the format: - -BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254" - - will configure the bond with the specified options. The options -specified in BONDING_OPTS are identical to the bonding module parameters -except for the arp_ip_target field when using versions of initscripts older -than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When -using older versions each target should be included as a separate option and -should be preceded by a '+' to indicate it should be added to the list of -queried targets, e.g., - - arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 - - is the proper syntax to specify multiple targets. When specifying -options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf. - - For even older versions of initscripts that do not support -BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon -your distro) to load the bonding module with your desired options when the -bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf -will load the bonding module, and select its options: - -alias bond0 bonding -options bond0 mode=balance-alb miimon=100 - - Replace the sample parameters with the appropriate set of -options for your configuration. - - Finally run "/etc/rc.d/init.d/network restart" as root. This -will restart the networking subsystem and your bond link should be now -up and running. - -3.2.1 Using DHCP with Initscripts ---------------------------------- - - Recent versions of initscripts (the versions supplied with Fedora -Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to -work) have support for assigning IP information to bonding devices via -DHCP. - - To configure bonding for DHCP, configure it as described -above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" -and add a line consisting of "TYPE=Bonding". Note that the TYPE value -is case sensitive. - -3.2.2 Configuring Multiple Bonds with Initscripts -------------------------------------------------- - - Initscripts packages that are included with Fedora 7 and Red Hat -Enterprise Linux 5 support multiple bonding interfaces by simply -specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the -number of the bond. This support requires sysfs support in the kernel, -and a bonding driver of version 3.0.0 or later. Other configurations may -not support this method for specifying multiple bonding interfaces; for -those instances, see the "Configuring Multiple Bonds Manually" section, -below. - -3.3 Configuring Bonding Manually with iproute2 ------------------------------------------------ - - This section applies to distros whose network initialization -scripts (the sysconfig or initscripts package) do not have specific -knowledge of bonding. One such distro is SuSE Linux Enterprise Server -version 8. - - The general method for these systems is to place the bonding -module parameters into a config file in /etc/modprobe.d/ (as -appropriate for the installed distro), then add modprobe and/or -`ip link` commands to the system's global init script. The name of -the global init script differs; for sysconfig, it is -/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. - - For example, if you wanted to make a simple bond of two e100 -devices (presumed to be eth0 and eth1), and have it persist across -reboots, edit the appropriate file (/etc/init.d/boot.local or -/etc/rc.d/rc.local), and add the following: - -modprobe bonding mode=balance-alb miimon=100 -modprobe e100 -ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -ip link set eth0 master bond0 -ip link set eth1 master bond0 - - Replace the example bonding module parameters and bond0 -network configuration (IP address, netmask, etc) with the appropriate -values for your configuration. - - Unfortunately, this method will not provide support for the -ifup and ifdown scripts on the bond devices. To reload the bonding -configuration, it is necessary to run the initialization script, e.g., - -# /etc/init.d/boot.local - - or - -# /etc/rc.d/rc.local - - It may be desirable in such a case to create a separate script -which only initializes the bonding configuration, then call that -separate script from within boot.local. This allows for bonding to be -enabled without re-running the entire global init script. - - To shut down the bonding devices, it is necessary to first -mark the bonding device itself as being down, then remove the -appropriate device driver modules. For our example above, you can do -the following: - -# ifconfig bond0 down -# rmmod bonding -# rmmod e100 - - Again, for convenience, it may be desirable to create a script -with these commands. - - -3.3.1 Configuring Multiple Bonds Manually ------------------------------------------ - - This section contains information on configuring multiple -bonding devices with differing options for those systems whose network -initialization scripts lack support for configuring multiple bonds. - - If you require multiple bonding devices, but all with the same -options, you may wish to use the "max_bonds" module parameter, -documented above. - - To create multiple bonding devices with differing options, it is -preferable to use bonding parameters exported by sysfs, documented in the -section below. - - For versions of bonding without sysfs support, the only means to -provide multiple instances of bonding with differing options is to load -the bonding driver multiple times. Note that current versions of the -sysconfig network initialization scripts handle this automatically; if -your distro uses these scripts, no special action is needed. See the -section Configuring Bonding Devices, above, if you're not sure about your -network initialization scripts. - - To load multiple instances of the module, it is necessary to -specify a different name for each instance (the module loading system -requires that every loaded module, even multiple instances of the same -module, have a unique name). This is accomplished by supplying multiple -sets of bonding options in /etc/modprobe.d/*.conf, for example: - -alias bond0 bonding -options bond0 -o bond0 mode=balance-rr miimon=100 - -alias bond1 bonding -options bond1 -o bond1 mode=balance-alb miimon=50 - - will load the bonding module two times. The first instance is -named "bond0" and creates the bond0 device in balance-rr mode with an -miimon of 100. The second instance is named "bond1" and creates the -bond1 device in balance-alb mode with an miimon of 50. - - In some circumstances (typically with older distributions), -the above does not work, and the second bonding instance never sees -its options. In that case, the second options line can be substituted -as follows: - -install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ - mode=balance-alb miimon=50 - - This may be repeated any number of times, specifying a new and -unique name in place of bond1 for each subsequent instance. - - It has been observed that some Red Hat supplied kernels are unable -to rename modules at load time (the "-o bond1" part). Attempts to pass -that option to modprobe will produce an "Operation not permitted" error. -This has been reported on some Fedora Core kernels, and has been seen on -RHEL 4 as well. On kernels exhibiting this problem, it will be impossible -to configure multiple bonds with differing parameters (as they are older -kernels, and also lack sysfs support). - -3.4 Configuring Bonding Manually via Sysfs ------------------------------------------- - - Starting with version 3.0.0, Channel Bonding may be configured -via the sysfs interface. This interface allows dynamic configuration -of all bonds in the system without unloading the module. It also -allows for adding and removing bonds at runtime. Ifenslave is no -longer required, though it is still supported. - - Use of the sysfs interface allows you to use multiple bonds -with different configurations without having to reload the module. -It also allows you to use multiple, differently configured bonds when -bonding is compiled into the kernel. - - You must have the sysfs filesystem mounted to configure -bonding this way. The examples in this document assume that you -are using the standard mount point for sysfs, e.g. /sys. If your -sysfs filesystem is mounted elsewhere, you will need to adjust the -example paths accordingly. - -Creating and Destroying Bonds ------------------------------ -To add a new bond foo: -# echo +foo > /sys/class/net/bonding_masters - -To remove an existing bond bar: -# echo -bar > /sys/class/net/bonding_masters - -To show all existing bonds: -# cat /sys/class/net/bonding_masters - -NOTE: due to 4K size limitation of sysfs files, this list may be -truncated if you have more than a few hundred bonds. This is unlikely -to occur under normal operating conditions. - -Adding and Removing Slaves --------------------------- - Interfaces may be enslaved to a bond using the file -/sys/class/net//bonding/slaves. The semantics for this file -are the same as for the bonding_masters file. - -To enslave interface eth0 to bond bond0: -# ifconfig bond0 up -# echo +eth0 > /sys/class/net/bond0/bonding/slaves - -To free slave eth0 from bond bond0: -# echo -eth0 > /sys/class/net/bond0/bonding/slaves - - When an interface is enslaved to a bond, symlinks between the -two are created in the sysfs filesystem. In this case, you would get -/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and -/sys/class/net/eth0/master pointing to /sys/class/net/bond0. - - This means that you can tell quickly whether or not an -interface is enslaved by looking for the master symlink. Thus: -# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves -will free eth0 from whatever bond it is enslaved to, regardless of -the name of the bond interface. - -Changing a Bond's Configuration -------------------------------- - Each bond may be configured individually by manipulating the -files located in /sys/class/net//bonding - - The names of these files correspond directly with the command- -line parameters described elsewhere in this file, and, with the -exception of arp_ip_target, they accept the same values. To see the -current setting, simply cat the appropriate file. - - A few examples will be given here; for specific usage -guidelines for each parameter, see the appropriate section in this -document. - -To configure bond0 for balance-alb mode: -# ifconfig bond0 down -# echo 6 > /sys/class/net/bond0/bonding/mode - - or - -# echo balance-alb > /sys/class/net/bond0/bonding/mode - NOTE: The bond interface must be down before the mode can be -changed. - -To enable MII monitoring on bond0 with a 1 second interval: -# echo 1000 > /sys/class/net/bond0/bonding/miimon - NOTE: If ARP monitoring is enabled, it will disabled when MII -monitoring is enabled, and vice-versa. - -To add ARP targets: -# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target -# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target - NOTE: up to 16 target addresses may be specified. - -To remove an ARP target: -# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target - -To configure the interval between learning packet transmits: -# echo 12 > /sys/class/net/bond0/bonding/lp_interval - NOTE: the lp_interval is the number of seconds between instances where -the bonding driver sends learning packets to each slaves peer switch. The -default interval is 1 second. - -Example Configuration ---------------------- - We begin with the same example that is shown in section 3.3, -executed with sysfs, and without using ifenslave. - - To make a simple bond of two e100 devices (presumed to be eth0 -and eth1), and have it persist across reboots, edit the appropriate -file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the -following: - -modprobe bonding -modprobe e100 -echo balance-alb > /sys/class/net/bond0/bonding/mode -ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -echo 100 > /sys/class/net/bond0/bonding/miimon -echo +eth0 > /sys/class/net/bond0/bonding/slaves -echo +eth1 > /sys/class/net/bond0/bonding/slaves - - To add a second bond, with two e1000 interfaces in -active-backup mode, using ARP monitoring, add the following lines to -your init script: - -modprobe e1000 -echo +bond1 > /sys/class/net/bonding_masters -echo active-backup > /sys/class/net/bond1/bonding/mode -ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up -echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target -echo 2000 > /sys/class/net/bond1/bonding/arp_interval -echo +eth2 > /sys/class/net/bond1/bonding/slaves -echo +eth3 > /sys/class/net/bond1/bonding/slaves - -3.5 Configuration with Interfaces Support ------------------------------------------ - - This section applies to distros which use /etc/network/interfaces file -to describe network interface configuration, most notably Debian and it's -derivatives. - - The ifup and ifdown commands on Debian don't support bonding out of -the box. The ifenslave-2.6 package should be installed to provide bonding -support. Once installed, this package will provide bond-* options to be used -into /etc/network/interfaces. - - Note that ifenslave-2.6 package will load the bonding module and use -the ifenslave command when appropriate. - -Example Configurations ----------------------- - -In /etc/network/interfaces, the following stanza will configure bond0, in -active-backup mode, with eth0 and eth1 as slaves. - -auto bond0 -iface bond0 inet dhcp - bond-slaves eth0 eth1 - bond-mode active-backup - bond-miimon 100 - bond-primary eth0 eth1 - -If the above configuration doesn't work, you might have a system using -upstart for system startup. This is most notably true for recent -Ubuntu versions. The following stanza in /etc/network/interfaces will -produce the same result on those systems. - -auto bond0 -iface bond0 inet dhcp - bond-slaves none - bond-mode active-backup - bond-miimon 100 - -auto eth0 -iface eth0 inet manual - bond-master bond0 - bond-primary eth0 eth1 - -auto eth1 -iface eth1 inet manual - bond-master bond0 - bond-primary eth0 eth1 - -For a full list of bond-* supported options in /etc/network/interfaces and some -more advanced examples tailored to you particular distros, see the files in -/usr/share/doc/ifenslave-2.6. - -3.6 Overriding Configuration for Special Cases ----------------------------------------------- - -When using the bonding driver, the physical port which transmits a frame is -typically selected by the bonding driver, and is not relevant to the user or -system administrator. The output port is simply selected using the policies of -the selected bonding mode. On occasion however, it is helpful to direct certain -classes of traffic to certain physical interfaces on output to implement -slightly more complex policies. For example, to reach a web server over a -bonded interface in which eth0 connects to a private network, while eth1 -connects via a public network, it may be desirous to bias the bond to send said -traffic over eth0 first, using eth1 only as a fall back, while all other traffic -can safely be sent over either interface. Such configurations may be achieved -using the traffic control utilities inherent in linux. - -By default the bonding driver is multiqueue aware and 16 queues are created -when the driver initializes (see Documentation/networking/multiqueue.txt -for details). If more or less queues are desired the module parameter -tx_queues can be used to change this value. There is no sysfs parameter -available as the allocation is done at module init time. - -The output of the file /proc/net/bonding/bondX has changed so the output Queue -ID is now printed for each slave: - -Bonding Mode: fault-tolerance (active-backup) -Primary Slave: None -Currently Active Slave: eth0 -MII Status: up -MII Polling Interval (ms): 0 -Up Delay (ms): 0 -Down Delay (ms): 0 - -Slave Interface: eth0 -MII Status: up -Link Failure Count: 0 -Permanent HW addr: 00:1a:a0:12:8f:cb -Slave queue ID: 0 - -Slave Interface: eth1 -MII Status: up -Link Failure Count: 0 -Permanent HW addr: 00:1a:a0:12:8f:cc -Slave queue ID: 2 - -The queue_id for a slave can be set using the command: - -# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id - -Any interface that needs a queue_id set should set it with multiple calls -like the one above until proper priorities are set for all interfaces. On -distributions that allow configuration via initscripts, multiple 'queue_id' -arguments can be added to BONDING_OPTS to set all needed slave queues. - -These queue id's can be used in conjunction with the tc utility to configure -a multiqueue qdisc and filters to bias certain traffic to transmit on certain -slave devices. For instance, say we wanted, in the above configuration to -force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output -device. The following commands would accomplish this: - -# tc qdisc add dev bond0 handle 1 root multiq - -# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \ - 192.168.1.100 action skbedit queue_mapping 2 - -These commands tell the kernel to attach a multiqueue queue discipline to the -bond0 interface and filter traffic enqueued to it, such that packets with a dst -ip of 192.168.1.100 have their output queue mapping value overwritten to 2. -This value is then passed into the driver, causing the normal output path -selection policy to be overridden, selecting instead qid 2, which maps to eth1. - -Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver -that normal output policy selection should take place. One benefit to simply -leaving the qid for a slave to 0 is the multiqueue awareness in the bonding -driver that is now present. This awareness allows tc filters to be placed on -slave devices as well as bond devices and the bonding driver will simply act as -a pass-through for selecting output queues on the slave device rather than -output port selection. - -This feature first appeared in bonding driver version 3.7.0 and support for -output slave selection was limited to round-robin and active-backup modes. - -3.7 Configuring LACP for 802.3ad mode in a more secure way ----------------------------------------------------------- - -When using 802.3ad bonding mode, the Actor (host) and Partner (switch) -exchange LACPDUs. These LACPDUs cannot be sniffed, because they are -destined to link local mac addresses (which switches/bridges are not -supposed to forward). However, most of the values are easily predictable -or are simply the machine's MAC address (which is trivially known to all -other hosts in the same L2). This implies that other machines in the L2 -domain can spoof LACPDU packets from other hosts to the switch and potentially -cause mayhem by joining (from the point of view of the switch) another -machine's aggregate, thus receiving a portion of that hosts incoming -traffic and / or spoofing traffic from that machine themselves (potentially -even successfully terminating some portion of flows). Though this is not -a likely scenario, one could avoid this possibility by simply configuring -few bonding parameters: - - (a) ad_actor_system : You can set a random mac-address that can be used for - these LACPDU exchanges. The value can not be either NULL or Multicast. - Also it's preferable to set the local-admin bit. Following shell code - generates a random mac-address as described above. - - # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ - $(( (RANDOM & 0xFE) | 0x02 )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF ))) - # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system - - (b) ad_actor_sys_prio : Randomize the system priority. The default value - is 65535, but system can take the value from 1 - 65535. Following shell - code generates random priority and sets it. - - # sys_prio=$(( 1 + RANDOM + RANDOM )) - # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio - - (c) ad_user_port_key : Use the user portion of the port-key. The default - keeps this empty. These are the upper 10 bits of the port-key and value - ranges from 0 - 1023. Following shell code generates these 10 bits and - sets it. - - # usr_port_key=$(( RANDOM & 0x3FF )) - # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key - - -4 Querying Bonding Configuration -================================= - -4.1 Bonding Configuration -------------------------- - - Each bonding device has a read-only file residing in the -/proc/net/bonding directory. The file contents include information -about the bonding configuration, options and state of each slave. - - For example, the contents of /proc/net/bonding/bond0 after the -driver is loaded with parameters of mode=0 and miimon=1000 is -generally as follows: - - Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004) - Bonding Mode: load balancing (round-robin) - Currently Active Slave: eth0 - MII Status: up - MII Polling Interval (ms): 1000 - Up Delay (ms): 0 - Down Delay (ms): 0 - - Slave Interface: eth1 - MII Status: up - Link Failure Count: 1 - - Slave Interface: eth0 - MII Status: up - Link Failure Count: 1 - - The precise format and contents will change depending upon the -bonding configuration, state, and version of the bonding driver. - -4.2 Network configuration -------------------------- - - The network configuration can be inspected using the ifconfig -command. Bonding devices will have the MASTER flag set; Bonding slave -devices will have the SLAVE flag set. The ifconfig output does not -contain information on which slaves are associated with which masters. - - In the example below, the bond0 interface is the master -(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of -bond0 have the same MAC address (HWaddr) as bond0 for all modes except -TLB and ALB that require a unique MAC address for each slave. - -# /sbin/ifconfig -bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 - UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 - RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 - TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 - collisions:0 txqueuelen:0 - -eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 - RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 - TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 - collisions:0 txqueuelen:100 - Interrupt:10 Base address:0x1080 - -eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 - RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 - TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 - collisions:0 txqueuelen:100 - Interrupt:9 Base address:0x1400 - -5. Switch Configuration -======================= - - For this section, "switch" refers to whatever system the -bonded devices are directly connected to (i.e., where the other end of -the cable plugs into). This may be an actual dedicated switch device, -or it may be another regular system (e.g., another computer running -Linux), - - The active-backup, balance-tlb and balance-alb modes do not -require any specific configuration of the switch. - - The 802.3ad mode requires that the switch have the appropriate -ports configured as an 802.3ad aggregation. The precise method used -to configure this varies from switch to switch, but, for example, a -Cisco 3550 series switch requires that the appropriate ports first be -grouped together in a single etherchannel instance, then that -etherchannel is set to mode "lacp" to enable 802.3ad (instead of -standard EtherChannel). - - The balance-rr, balance-xor and broadcast modes generally -require that the switch have the appropriate ports grouped together. -The nomenclature for such a group differs between switches, it may be -called an "etherchannel" (as in the Cisco example, above), a "trunk -group" or some other similar variation. For these modes, each switch -will also have its own configuration options for the switch's transmit -policy to the bond. Typical choices include XOR of either the MAC or -IP addresses. The transmit policy of the two peers does not need to -match. For these three modes, the bonding mode really selects a -transmit policy for an EtherChannel group; all three will interoperate -with another EtherChannel group. - - -6. 802.1q VLAN Support -====================== - - It is possible to configure VLAN devices over a bond interface -using the 8021q driver. However, only packets coming from the 8021q -driver and passing through bonding will be tagged by default. Self -generated packets, for example, bonding's learning packets or ARP -packets generated by either ALB mode or the ARP monitor mechanism, are -tagged internally by bonding itself. As a result, bonding must -"learn" the VLAN IDs configured above it, and use those IDs to tag -self generated packets. - - For reasons of simplicity, and to support the use of adapters -that can do VLAN hardware acceleration offloading, the bonding -interface declares itself as fully hardware offloading capable, it gets -the add_vid/kill_vid notifications to gather the necessary -information, and it propagates those actions to the slaves. In case -of mixed adapter types, hardware accelerated tagged packets that -should go through an adapter that is not offloading capable are -"un-accelerated" by the bonding driver so the VLAN tag sits in the -regular location. - - VLAN interfaces *must* be added on top of a bonding interface -only after enslaving at least one slave. The bonding interface has a -hardware address of 00:00:00:00:00:00 until the first slave is added. -If the VLAN interface is created prior to the first enslavement, it -would pick up the all-zeroes hardware address. Once the first slave -is attached to the bond, the bond device itself will pick up the -slave's hardware address, which is then available for the VLAN device. - - Also, be aware that a similar problem can occur if all slaves -are released from a bond that still has one or more VLAN interfaces on -top of it. When a new slave is added, the bonding interface will -obtain its hardware address from the first slave, which might not -match the hardware address of the VLAN interfaces (which was -ultimately copied from an earlier slave). - - There are two methods to insure that the VLAN device operates -with the correct hardware address if all slaves are removed from a -bond interface: - - 1. Remove all VLAN interfaces then recreate them - - 2. Set the bonding interface's hardware address so that it -matches the hardware address of the VLAN interfaces. - - Note that changing a VLAN interface's HW address would set the -underlying device -- i.e. the bonding interface -- to promiscuous -mode, which might not be what you want. - - -7. Link Monitoring -================== - - The bonding driver at present supports two schemes for -monitoring a slave device's link state: the ARP monitor and the MII -monitor. - - At the present time, due to implementation restrictions in the -bonding driver itself, it is not possible to enable both ARP and MII -monitoring simultaneously. - -7.1 ARP Monitor Operation -------------------------- - - The ARP monitor operates as its name suggests: it sends ARP -queries to one or more designated peer systems on the network, and -uses the response as an indication that the link is operating. This -gives some assurance that traffic is actually flowing to and from one -or more peers on the local network. - - The ARP monitor relies on the device driver itself to verify -that traffic is flowing. In particular, the driver must keep up to -date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX -flag must also update netdev_queue->trans_start. If they do not, then the -ARP monitor will immediately fail any slaves using that driver, and -those slaves will stay down. If networking monitoring (tcpdump, etc) -shows the ARP requests and replies on the network, then it may be that -your device driver is not updating last_rx and trans_start. - -7.2 Configuring Multiple ARP Targets ------------------------------------- - - While ARP monitoring can be done with just one target, it can -be useful in a High Availability setup to have several targets to -monitor. In the case of just one target, the target itself may go -down or have a problem making it unresponsive to ARP requests. Having -an additional target (or several) increases the reliability of the ARP -monitoring. - - Multiple ARP targets must be separated by commas as follows: - -# example options for ARP monitoring with three targets -alias bond0 bonding -options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 - - For just a single target the options would resemble: - -# example options for ARP monitoring with one target -alias bond0 bonding -options bond0 arp_interval=60 arp_ip_target=192.168.0.100 - - -7.3 MII Monitor Operation -------------------------- - - The MII monitor monitors only the carrier state of the local -network interface. It accomplishes this in one of three ways: by -depending upon the device driver to maintain its carrier state, by -querying the device's MII registers, or by making an ethtool query to -the device. - - If the use_carrier module parameter is 1 (the default value), -then the MII monitor will rely on the driver for carrier state -information (via the netif_carrier subsystem). As explained in the -use_carrier parameter information, above, if the MII monitor fails to -detect carrier loss on the device (e.g., when the cable is physically -disconnected), it may be that the driver does not support -netif_carrier. - - If use_carrier is 0, then the MII monitor will first query the -device's (via ioctl) MII registers and check the link state. If that -request fails (not just that it returns carrier down), then the MII -monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain -the same information. If both methods fail (i.e., the driver either -does not support or had some error in processing both the MII register -and ethtool requests), then the MII monitor will assume the link is -up. - -8. Potential Sources of Trouble -=============================== - -8.1 Adventures in Routing -------------------------- - - When bonding is configured, it is important that the slave -devices not have routes that supersede routes of the master (or, -generally, not have routes at all). For example, suppose the bonding -device bond0 has two slaves, eth0 and eth1, and the routing table is -as follows: - -Kernel IP routing table -Destination Gateway Genmask Flags MSS Window irtt Iface -10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 -10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 -10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 -127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo - - This routing configuration will likely still update the -receive/transmit times in the driver (needed by the ARP monitor), but -may bypass the bonding driver (because outgoing traffic to, in this -case, another host on network 10 would use eth0 or eth1 before bond0). - - The ARP monitor (and ARP itself) may become confused by this -configuration, because ARP requests (generated by the ARP monitor) -will be sent on one interface (bond0), but the corresponding reply -will arrive on a different interface (eth0). This reply looks to ARP -as an unsolicited ARP reply (because ARP matches replies on an -interface basis), and is discarded. The MII monitor is not affected -by the state of the routing table. - - The solution here is simply to insure that slaves do not have -routes of their own, and if for some reason they must, those routes do -not supersede routes of their master. This should generally be the -case, but unusual configurations or errant manual or automatic static -route additions may cause trouble. - -8.2 Ethernet Device Renaming ----------------------------- - - On systems with network configuration scripts that do not -associate physical devices directly with network interface names (so -that the same physical device always has the same "ethX" name), it may -be necessary to add some special logic to config files in -/etc/modprobe.d/. - - For example, given a modules.conf containing the following: - -alias bond0 bonding -options bond0 mode=some-mode miimon=50 -alias eth0 tg3 -alias eth1 tg3 -alias eth2 e1000 -alias eth3 e1000 - - If neither eth0 and eth1 are slaves to bond0, then when the -bond0 interface comes up, the devices may end up reordered. This -happens because bonding is loaded first, then its slave device's -drivers are loaded next. Since no other drivers have been loaded, -when the e1000 driver loads, it will receive eth0 and eth1 for its -devices, but the bonding configuration tries to enslave eth2 and eth3 -(which may later be assigned to the tg3 devices). - - Adding the following: - -add above bonding e1000 tg3 - - causes modprobe to load e1000 then tg3, in that order, when -bonding is loaded. This command is fully documented in the -modules.conf manual page. - - On systems utilizing modprobe an equivalent problem can occur. -In this case, the following can be added to config files in -/etc/modprobe.d/ as: - -softdep bonding pre: tg3 e1000 - - This will load tg3 and e1000 modules before loading the bonding one. -Full documentation on this can be found in the modprobe.d and modprobe -manual pages. - -8.3. Painfully Slow Or No Failed Link Detection By Miimon ---------------------------------------------------------- - - By default, bonding enables the use_carrier option, which -instructs bonding to trust the driver to maintain carrier state. - - As discussed in the options section, above, some drivers do -not support the netif_carrier_on/_off link state tracking system. -With use_carrier enabled, bonding will always see these links as up, -regardless of their actual state. - - Additionally, other drivers do support netif_carrier, but do -not maintain it in real time, e.g., only polling the link state at -some fixed interval. In this case, miimon will detect failures, but -only after some long period of time has expired. If it appears that -miimon is very slow in detecting link failures, try specifying -use_carrier=0 to see if that improves the failure detection time. If -it does, then it may be that the driver checks the carrier state at a -fixed interval, but does not cache the MII register values (so the -use_carrier=0 method of querying the registers directly works). If -use_carrier=0 does not improve the failover, then the driver may cache -the registers, or the problem may be elsewhere. - - Also, remember that miimon only checks for the device's -carrier state. It has no way to determine the state of devices on or -beyond other ports of a switch, or if a switch is refusing to pass -traffic while still maintaining carrier on. - -9. SNMP agents -=============== - - If running SNMP agents, the bonding driver should be loaded -before any network drivers participating in a bond. This requirement -is due to the interface index (ipAdEntIfIndex) being associated to -the first interface found with a given IP address. That is, there is -only one ipAdEntIfIndex for each IP address. For example, if eth0 and -eth1 are slaves of bond0 and the driver for eth0 is loaded before the -bonding driver, the interface for the IP address will be associated -with the eth0 interface. This configuration is shown below, the IP -address 192.168.1.1 has an interface index of 2 which indexes to eth0 -in the ifDescr table (ifDescr.2). - - interfaces.ifTable.ifEntry.ifDescr.1 = lo - interfaces.ifTable.ifEntry.ifDescr.2 = eth0 - interfaces.ifTable.ifEntry.ifDescr.3 = eth1 - interfaces.ifTable.ifEntry.ifDescr.4 = eth2 - interfaces.ifTable.ifEntry.ifDescr.5 = eth3 - interfaces.ifTable.ifEntry.ifDescr.6 = bond0 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 - - This problem is avoided by loading the bonding driver before -any network drivers participating in a bond. Below is an example of -loading the bonding driver first, the IP address 192.168.1.1 is -correctly associated with ifDescr.2. - - interfaces.ifTable.ifEntry.ifDescr.1 = lo - interfaces.ifTable.ifEntry.ifDescr.2 = bond0 - interfaces.ifTable.ifEntry.ifDescr.3 = eth0 - interfaces.ifTable.ifEntry.ifDescr.4 = eth1 - interfaces.ifTable.ifEntry.ifDescr.5 = eth2 - interfaces.ifTable.ifEntry.ifDescr.6 = eth3 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 - - While some distributions may not report the interface name in -ifDescr, the association between the IP address and IfIndex remains -and SNMP functions such as Interface_Scan_Next will report that -association. - -10. Promiscuous mode -==================== - - When running network monitoring tools, e.g., tcpdump, it is -common to enable promiscuous mode on the device, so that all traffic -is seen (instead of seeing only traffic destined for the local host). -The bonding driver handles promiscuous mode changes to the bonding -master device (e.g., bond0), and propagates the setting to the slave -devices. - - For the balance-rr, balance-xor, broadcast, and 802.3ad modes, -the promiscuous mode setting is propagated to all slaves. - - For the active-backup, balance-tlb and balance-alb modes, the -promiscuous mode setting is propagated only to the active slave. - - For balance-tlb mode, the active slave is the slave currently -receiving inbound traffic. - - For balance-alb mode, the active slave is the slave used as a -"primary." This slave is used for mode-specific control traffic, for -sending to peers that are unassigned or if the load is unbalanced. - - For the active-backup, balance-tlb and balance-alb modes, when -the active slave changes (e.g., due to a link failure), the -promiscuous setting will be propagated to the new active slave. - -11. Configuring Bonding for High Availability -============================================= - - High Availability refers to configurations that provide -maximum network availability by having redundant or backup devices, -links or switches between the host and the rest of the world. The -goal is to provide the maximum availability of network connectivity -(i.e., the network always works), even though other configurations -could provide higher throughput. - -11.1 High Availability in a Single Switch Topology --------------------------------------------------- - - If two hosts (or a host and a single switch) are directly -connected via multiple physical links, then there is no availability -penalty to optimizing for maximum bandwidth. In this case, there is -only one switch (or peer), so if it fails, there is no alternative -access to fail over to. Additionally, the bonding load balance modes -support link monitoring of their members, so if individual links fail, -the load will be rebalanced across the remaining devices. - - See Section 12, "Configuring Bonding for Maximum Throughput" -for information on configuring bonding with one peer device. - -11.2 High Availability in a Multiple Switch Topology ----------------------------------------------------- - - With multiple switches, the configuration of bonding and the -network changes dramatically. In multiple switch topologies, there is -a trade off between network availability and usable bandwidth. - - Below is a sample network, configured to maximize the -availability of the network: - - | | - |port3 port3| - +-----+----+ +-----+----+ - | |port2 ISL port2| | - | switch A +--------------------------+ switch B | - | | | | - +-----+----+ +-----++---+ - |port1 port1| - | +-------+ | - +-------------+ host1 +---------------+ - eth0 +-------+ eth1 - - In this configuration, there is a link between the two -switches (ISL, or inter switch link), and multiple ports connecting to -the outside world ("port3" on each switch). There is no technical -reason that this could not be extended to a third switch. - -11.2.1 HA Bonding Mode Selection for Multiple Switch Topology -------------------------------------------------------------- - - In a topology such as the example above, the active-backup and -broadcast modes are the only useful bonding modes when optimizing for -availability; the other modes require all links to terminate on the -same peer for them to behave rationally. - -active-backup: This is generally the preferred mode, particularly if - the switches have an ISL and play together well. If the - network configuration is such that one switch is specifically - a backup switch (e.g., has lower capacity, higher cost, etc), - then the primary option can be used to insure that the - preferred link is always used when it is available. - -broadcast: This mode is really a special purpose mode, and is suitable - only for very specific needs. For example, if the two - switches are not connected (no ISL), and the networks beyond - them are totally independent. In this case, if it is - necessary for some specific one-way traffic to reach both - independent networks, then the broadcast mode may be suitable. - -11.2.2 HA Link Monitoring Selection for Multiple Switch Topology ----------------------------------------------------------------- - - The choice of link monitoring ultimately depends upon your -switch. If the switch can reliably fail ports in response to other -failures, then either the MII or ARP monitors should work. For -example, in the above example, if the "port3" link fails at the remote -end, the MII monitor has no direct means to detect this. The ARP -monitor could be configured with a target at the remote end of port3, -thus detecting that failure without switch support. - - In general, however, in a multiple switch topology, the ARP -monitor can provide a higher level of reliability in detecting end to -end connectivity failures (which may be caused by the failure of any -individual component to pass traffic for any reason). Additionally, -the ARP monitor should be configured with multiple targets (at least -one for each switch in the network). This will insure that, -regardless of which switch is active, the ARP monitor has a suitable -target to query. - - Note, also, that of late many switches now support a functionality -generally referred to as "trunk failover." This is a feature of the -switch that causes the link state of a particular switch port to be set -down (or up) when the state of another switch port goes down (or up). -Its purpose is to propagate link failures from logically "exterior" ports -to the logically "interior" ports that bonding is able to monitor via -miimon. Availability and configuration for trunk failover varies by -switch, but this can be a viable alternative to the ARP monitor when using -suitable switches. - -12. Configuring Bonding for Maximum Throughput -============================================== - -12.1 Maximizing Throughput in a Single Switch Topology ------------------------------------------------------- - - In a single switch configuration, the best method to maximize -throughput depends upon the application and network environment. The -various load balancing modes each have strengths and weaknesses in -different environments, as detailed below. - - For this discussion, we will break down the topologies into -two categories. Depending upon the destination of most traffic, we -categorize them into either "gatewayed" or "local" configurations. - - In a gatewayed configuration, the "switch" is acting primarily -as a router, and the majority of traffic passes through this router to -other networks. An example would be the following: - - - +----------+ +----------+ - | |eth0 port1| | to other networks - | Host A +---------------------+ router +-------------------> - | +---------------------+ | Hosts B and C are out - | |eth1 port2| | here somewhere - +----------+ +----------+ - - The router may be a dedicated router device, or another host -acting as a gateway. For our discussion, the important point is that -the majority of traffic from Host A will pass through the router to -some other network before reaching its final destination. - - In a gatewayed network configuration, although Host A may -communicate with many other systems, all of its traffic will be sent -and received via one other peer on the local network, the router. - - Note that the case of two systems connected directly via -multiple physical links is, for purposes of configuring bonding, the -same as a gatewayed configuration. In that case, it happens that all -traffic is destined for the "gateway" itself, not some other network -beyond the gateway. - - In a local configuration, the "switch" is acting primarily as -a switch, and the majority of traffic passes through this switch to -reach other stations on the same network. An example would be the -following: - - +----------+ +----------+ +--------+ - | |eth0 port1| +-------+ Host B | - | Host A +------------+ switch |port3 +--------+ - | +------------+ | +--------+ - | |eth1 port2| +------------------+ Host C | - +----------+ +----------+port4 +--------+ - - - Again, the switch may be a dedicated switch device, or another -host acting as a gateway. For our discussion, the important point is -that the majority of traffic from Host A is destined for other hosts -on the same local network (Hosts B and C in the above example). - - In summary, in a gatewayed configuration, traffic to and from -the bonded device will be to the same MAC level peer on the network -(the gateway itself, i.e., the router), regardless of its final -destination. In a local configuration, traffic flows directly to and -from the final destinations, thus, each destination (Host B, Host C) -will be addressed directly by their individual MAC addresses. - - This distinction between a gatewayed and a local network -configuration is important because many of the load balancing modes -available use the MAC addresses of the local network source and -destination to make load balancing decisions. The behavior of each -mode is described below. - - -12.1.1 MT Bonding Mode Selection for Single Switch Topology ------------------------------------------------------------ - - This configuration is the easiest to set up and to understand, -although you will have to decide which bonding mode best suits your -needs. The trade offs for each mode are detailed below: - -balance-rr: This mode is the only mode that will permit a single - TCP/IP connection to stripe traffic across multiple - interfaces. It is therefore the only mode that will allow a - single TCP/IP stream to utilize more than one interface's - worth of throughput. This comes at a cost, however: the - striping generally results in peer systems receiving packets out - of order, causing TCP/IP's congestion control system to kick - in, often by retransmitting segments. - - It is possible to adjust TCP/IP's congestion limits by - altering the net.ipv4.tcp_reordering sysctl parameter. The - usual default value is 3. But keep in mind TCP stack is able - to automatically increase this when it detects reorders. - - Note that the fraction of packets that will be delivered out of - order is highly variable, and is unlikely to be zero. The level - of reordering depends upon a variety of factors, including the - networking interfaces, the switch, and the topology of the - configuration. Speaking in general terms, higher speed network - cards produce more reordering (due to factors such as packet - coalescing), and a "many to many" topology will reorder at a - higher rate than a "many slow to one fast" configuration. - - Many switches do not support any modes that stripe traffic - (instead choosing a port based upon IP or MAC level addresses); - for those devices, traffic for a particular connection flowing - through the switch to a balance-rr bond will not utilize greater - than one interface's worth of bandwidth. - - If you are utilizing protocols other than TCP/IP, UDP for - example, and your application can tolerate out of order - delivery, then this mode can allow for single stream datagram - performance that scales near linearly as interfaces are added - to the bond. - - This mode requires the switch to have the appropriate ports - configured for "etherchannel" or "trunking." - -active-backup: There is not much advantage in this network topology to - the active-backup mode, as the inactive backup devices are all - connected to the same peer as the primary. In this case, a - load balancing mode (with link monitoring) will provide the - same level of network availability, but with increased - available bandwidth. On the plus side, active-backup mode - does not require any configuration of the switch, so it may - have value if the hardware available does not support any of - the load balance modes. - -balance-xor: This mode will limit traffic such that packets destined - for specific peers will always be sent over the same - interface. Since the destination is determined by the MAC - addresses involved, this mode works best in a "local" network - configuration (as described above), with destinations all on - the same local network. This mode is likely to be suboptimal - if all your traffic is passed through a single router (i.e., a - "gatewayed" network configuration, as described above). - - As with balance-rr, the switch ports need to be configured for - "etherchannel" or "trunking." - -broadcast: Like active-backup, there is not much advantage to this - mode in this type of network topology. - -802.3ad: This mode can be a good choice for this type of network - topology. The 802.3ad mode is an IEEE standard, so all peers - that implement 802.3ad should interoperate well. The 802.3ad - protocol includes automatic configuration of the aggregates, - so minimal manual configuration of the switch is needed - (typically only to designate that some set of devices is - available for 802.3ad). The 802.3ad standard also mandates - that frames be delivered in order (within certain limits), so - in general single connections will not see misordering of - packets. The 802.3ad mode does have some drawbacks: the - standard mandates that all devices in the aggregate operate at - the same speed and duplex. Also, as with all bonding load - balance modes other than balance-rr, no single connection will - be able to utilize more than a single interface's worth of - bandwidth. - - Additionally, the linux bonding 802.3ad implementation - distributes traffic by peer (using an XOR of MAC addresses - and packet type ID), so in a "gatewayed" configuration, all - outgoing traffic will generally use the same device. Incoming - traffic may also end up on a single device, but that is - dependent upon the balancing policy of the peer's 802.3ad - implementation. In a "local" configuration, traffic will be - distributed across the devices in the bond. - - Finally, the 802.3ad mode mandates the use of the MII monitor, - therefore, the ARP monitor is not available in this mode. - -balance-tlb: The balance-tlb mode balances outgoing traffic by peer. - Since the balancing is done according to MAC address, in a - "gatewayed" configuration (as described above), this mode will - send all traffic across a single device. However, in a - "local" network configuration, this mode balances multiple - local network peers across devices in a vaguely intelligent - manner (not a simple XOR as in balance-xor or 802.3ad mode), - so that mathematically unlucky MAC addresses (i.e., ones that - XOR to the same value) will not all "bunch up" on a single - interface. - - Unlike 802.3ad, interfaces may be of differing speeds, and no - special switch configuration is required. On the down side, - in this mode all incoming traffic arrives over a single - interface, this mode requires certain ethtool support in the - network device driver of the slave interfaces, and the ARP - monitor is not available. - -balance-alb: This mode is everything that balance-tlb is, and more. - It has all of the features (and restrictions) of balance-tlb, - and will also balance incoming traffic from local network - peers (as described in the Bonding Module Options section, - above). - - The only additional down side to this mode is that the network - device driver must support changing the hardware address while - the device is open. - -12.1.2 MT Link Monitoring for Single Switch Topology ----------------------------------------------------- - - The choice of link monitoring may largely depend upon which -mode you choose to use. The more advanced load balancing modes do not -support the use of the ARP monitor, and are thus restricted to using -the MII monitor (which does not provide as high a level of end to end -assurance as the ARP monitor). - -12.2 Maximum Throughput in a Multiple Switch Topology ------------------------------------------------------ - - Multiple switches may be utilized to optimize for throughput -when they are configured in parallel as part of an isolated network -between two or more systems, for example: - - +-----------+ - | Host A | - +-+---+---+-+ - | | | - +--------+ | +---------+ - | | | - +------+---+ +-----+----+ +-----+----+ - | Switch A | | Switch B | | Switch C | - +------+---+ +-----+----+ +-----+----+ - | | | - +--------+ | +---------+ - | | | - +-+---+---+-+ - | Host B | - +-----------+ - - In this configuration, the switches are isolated from one -another. One reason to employ a topology such as this is for an -isolated network with many hosts (a cluster configured for high -performance, for example), using multiple smaller switches can be more -cost effective than a single larger switch, e.g., on a network with 24 -hosts, three 24 port switches can be significantly less expensive than -a single 72 port switch. - - If access beyond the network is required, an individual host -can be equipped with an additional network device connected to an -external network; this host then additionally acts as a gateway. - -12.2.1 MT Bonding Mode Selection for Multiple Switch Topology -------------------------------------------------------------- - - In actual practice, the bonding mode typically employed in -configurations of this type is balance-rr. Historically, in this -network configuration, the usual caveats about out of order packet -delivery are mitigated by the use of network adapters that do not do -any kind of packet coalescing (via the use of NAPI, or because the -device itself does not generate interrupts until some number of -packets has arrived). When employed in this fashion, the balance-rr -mode allows individual connections between two hosts to effectively -utilize greater than one interface's bandwidth. - -12.2.2 MT Link Monitoring for Multiple Switch Topology ------------------------------------------------------- - - Again, in actual practice, the MII monitor is most often used -in this configuration, as performance is given preference over -availability. The ARP monitor will function in this topology, but its -advantages over the MII monitor are mitigated by the volume of probes -needed as the number of systems involved grows (remember that each -host in the network is configured with bonding). - -13. Switch Behavior Issues -========================== - -13.1 Link Establishment and Failover Delays -------------------------------------------- - - Some switches exhibit undesirable behavior with regard to the -timing of link up and down reporting by the switch. - - First, when a link comes up, some switches may indicate that -the link is up (carrier available), but not pass traffic over the -interface for some period of time. This delay is typically due to -some type of autonegotiation or routing protocol, but may also occur -during switch initialization (e.g., during recovery after a switch -failure). If you find this to be a problem, specify an appropriate -value to the updelay bonding module option to delay the use of the -relevant interface(s). - - Second, some switches may "bounce" the link state one or more -times while a link is changing state. This occurs most commonly while -the switch is initializing. Again, an appropriate updelay value may -help. - - Note that when a bonding interface has no active links, the -driver will immediately reuse the first link that goes up, even if the -updelay parameter has been specified (the updelay is ignored in this -case). If there are slave interfaces waiting for the updelay timeout -to expire, the interface that first went into that state will be -immediately reused. This reduces down time of the network if the -value of updelay has been overestimated, and since this occurs only in -cases with no connectivity, there is no additional penalty for -ignoring the updelay. - - In addition to the concerns about switch timings, if your -switches take a long time to go into backup mode, it may be desirable -to not activate a backup interface immediately after a link goes down. -Failover may be delayed via the downdelay bonding module option. - -13.2 Duplicated Incoming Packets --------------------------------- - - NOTE: Starting with version 3.0.2, the bonding driver has logic to -suppress duplicate packets, which should largely eliminate this problem. -The following description is kept for reference. - - It is not uncommon to observe a short burst of duplicated -traffic when the bonding device is first used, or after it has been -idle for some period of time. This is most easily observed by issuing -a "ping" to some other host on the network, and noticing that the -output from ping flags duplicates (typically one per slave). - - For example, on a bond in active-backup mode with five slaves -all connected to one switch, the output may appear as follows: - -# ping -n 10.0.4.2 -PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms -64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms -64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms - - This is not due to an error in the bonding driver, rather, it -is a side effect of how many switches update their MAC forwarding -tables. Initially, the switch does not associate the MAC address in -the packet with a particular switch port, and so it may send the -traffic to all ports until its MAC forwarding table is updated. Since -the interfaces attached to the bond may occupy multiple ports on a -single switch, when the switch (temporarily) floods the traffic to all -ports, the bond device receives multiple copies of the same packet -(one per slave device). - - The duplicated packet behavior is switch dependent, some -switches exhibit this, and some do not. On switches that display this -behavior, it can be induced by clearing the MAC forwarding table (on -most Cisco switches, the privileged command "clear mac address-table -dynamic" will accomplish this). - -14. Hardware Specific Considerations -==================================== - - This section contains additional information for configuring -bonding on specific hardware platforms, or for interfacing bonding -with particular switches or other devices. - -14.1 IBM BladeCenter --------------------- - - This applies to the JS20 and similar systems. - - On the JS20 blades, the bonding driver supports only -balance-rr, active-backup, balance-tlb and balance-alb modes. This is -largely due to the network topology inside the BladeCenter, detailed -below. - -JS20 network adapter information --------------------------------- - - All JS20s come with two Broadcom Gigabit Ethernet ports -integrated on the planar (that's "motherboard" in IBM-speak). In the -BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to -I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. -An add-on Broadcom daughter card can be installed on a JS20 to provide -two more Gigabit Ethernet ports. These ports, eth2 and eth3, are -wired to I/O Modules 3 and 4, respectively. - - Each I/O Module may contain either a switch or a passthrough -module (which allows ports to be directly connected to an external -switch). Some bonding modes require a specific BladeCenter internal -network topology in order to function; these are detailed below. - - Additional BladeCenter-specific networking information can be -found in two IBM Redbooks (www.ibm.com/redbooks): - -"IBM eServer BladeCenter Networking Options" -"IBM eServer BladeCenter Layer 2-7 Network Switching" - -BladeCenter networking configuration ------------------------------------- - - Because a BladeCenter can be configured in a very large number -of ways, this discussion will be confined to describing basic -configurations. - - Normally, Ethernet Switch Modules (ESMs) are used in I/O -modules 1 and 2. In this configuration, the eth0 and eth1 ports of a -JS20 will be connected to different internal switches (in the -respective I/O modules). - - A passthrough module (OPM or CPM, optical or copper, -passthrough module) connects the I/O module directly to an external -switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 -interfaces of a JS20 can be redirected to the outside world and -connected to a common external switch. - - Depending upon the mix of ESMs and PMs, the network will -appear to bonding as either a single switch topology (all PMs) or as a -multiple switch topology (one or more ESMs, zero or more PMs). It is -also possible to connect ESMs together, resulting in a configuration -much like the example in "High Availability in a Multiple Switch -Topology," above. - -Requirements for specific modes -------------------------------- - - The balance-rr mode requires the use of passthrough modules -for devices in the bond, all connected to an common external switch. -That switch must be configured for "etherchannel" or "trunking" on the -appropriate ports, as is usual for balance-rr. - - The balance-alb and balance-tlb modes will function with -either switch modules or passthrough modules (or a mix). The only -specific requirement for these modes is that all network interfaces -must be able to reach all destinations for traffic sent over the -bonding device (i.e., the network must converge at some point outside -the BladeCenter). - - The active-backup mode has no additional requirements. - -Link monitoring issues ----------------------- - - When an Ethernet Switch Module is in place, only the ARP -monitor will reliably detect link loss to an external switch. This is -nothing unusual, but examination of the BladeCenter cabinet would -suggest that the "external" network ports are the ethernet ports for -the system, when it fact there is a switch between these "external" -ports and the devices on the JS20 system itself. The MII monitor is -only able to detect link failures between the ESM and the JS20 system. - - When a passthrough module is in place, the MII monitor does -detect failures to the "external" port, which is then directly -connected to the JS20 system. - -Other concerns --------------- - - The Serial Over LAN (SoL) link is established over the primary -ethernet (eth0) only, therefore, any loss of link to eth0 will result -in losing your SoL connection. It will not fail over with other -network traffic, as the SoL system is beyond the control of the -bonding driver. - - It may be desirable to disable spanning tree on the switch -(either the internal Ethernet Switch Module, or an external switch) to -avoid fail-over delay issues when using bonding. - - -15. Frequently Asked Questions -============================== - -1. Is it SMP safe? - - Yes. The old 2.0.xx channel bonding patch was not SMP safe. -The new driver was designed to be SMP safe from the start. - -2. What type of cards will work with it? - - Any Ethernet type cards (you can even mix cards - a Intel -EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, -devices need not be of the same speed. - - Starting with version 3.2.1, bonding also supports Infiniband -slaves in active-backup mode. - -3. How many bonding devices can I have? - - There is no limit. - -4. How many slaves can a bonding device have? - - This is limited only by the number of network interfaces Linux -supports and/or the number of network cards you can place in your -system. - -5. What happens when a slave link dies? - - If link monitoring is enabled, then the failing device will be -disabled. The active-backup mode will fail over to a backup link, and -other modes will ignore the failed link. The link will continue to be -monitored, and should it recover, it will rejoin the bond (in whatever -manner is appropriate for the mode). See the sections on High -Availability and the documentation for each mode for additional -information. - - Link monitoring can be enabled via either the miimon or -arp_interval parameters (described in the module parameters section, -above). In general, miimon monitors the carrier state as sensed by -the underlying network device, and the arp monitor (arp_interval) -monitors connectivity to another host on the local network. - - If no link monitoring is configured, the bonding driver will -be unable to detect link failures, and will assume that all links are -always available. This will likely result in lost packets, and a -resulting degradation of performance. The precise performance loss -depends upon the bonding mode and network configuration. - -6. Can bonding be used for High Availability? - - Yes. See the section on High Availability for details. - -7. Which switches/systems does it work with? - - The full answer to this depends upon the desired mode. - - In the basic balance modes (balance-rr and balance-xor), it -works with any system that supports etherchannel (also called -trunking). Most managed switches currently available have such -support, and many unmanaged switches as well. - - The advanced balance modes (balance-tlb and balance-alb) do -not have special switch requirements, but do need device drivers that -support specific features (described in the appropriate section under -module parameters, above). - - In 802.3ad mode, it works with systems that support IEEE -802.3ad Dynamic Link Aggregation. Most managed and many unmanaged -switches currently available support 802.3ad. - - The active-backup mode should work with any Layer-II switch. - -8. Where does a bonding device get its MAC address from? - - When using slave devices that have fixed MAC addresses, or when -the fail_over_mac option is enabled, the bonding device's MAC address is -the MAC address of the active slave. - - For other configurations, if not explicitly configured (with -ifconfig or ip link), the MAC address of the bonding device is taken from -its first slave device. This MAC address is then passed to all following -slaves and remains persistent (even if the first slave is removed) until -the bonding device is brought down or reconfigured. - - If you wish to change the MAC address, you can set it with -ifconfig or ip link: - -# ifconfig bond0 hw ether 00:11:22:33:44:55 - -# ip link set bond0 address 66:77:88:99:aa:bb - - The MAC address can be also changed by bringing down/up the -device and then changing its slaves (or their order): - -# ifconfig bond0 down ; modprobe -r bonding -# ifconfig bond0 .... up -# ifenslave bond0 eth... - - This method will automatically take the address from the next -slave that is added. - - To restore your slaves' MAC addresses, you need to detach them -from the bond (`ifenslave -d bond0 eth0'). The bonding driver will -then restore the MAC addresses that the slaves had before they were -enslaved. - -16. Resources and Links -======================= - - The latest version of the bonding driver can be found in the latest -version of the linux kernel, found on http://kernel.org - - The latest version of this document can be found in the latest kernel -source (named Documentation/networking/bonding.txt). - - Discussions regarding the usage of the bonding driver take place on the -bonding-devel mailing list, hosted at sourceforge.net. If you have questions or -problems, post them to the list. The list address is: - -bonding-devel@lists.sourceforge.net - - The administrative interface (to subscribe or unsubscribe) can -be found at: - -https://lists.sourceforge.net/lists/listinfo/bonding-devel - - Discussions regarding the development of the bonding driver take place -on the main Linux network mailing list, hosted at vger.kernel.org. The list -address is: - -netdev@vger.kernel.org - - The administrative interface (to subscribe or unsubscribe) can -be found at: - -http://vger.kernel.org/vger-lists.html#netdev - -Donald Becker's Ethernet Drivers and diag programs may be found at : - - http://web.archive.org/web/*/http://www.scyld.com/network/ - -You will also find a lot of information regarding Ethernet, NWay, MII, -etc. at www.scyld.com. - --- END -- diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/intel/e100.rst index caf023cc88de..3ac21e7119a7 100644 --- a/Documentation/networking/device_drivers/intel/e100.rst +++ b/Documentation/networking/device_drivers/intel/e100.rst @@ -33,7 +33,7 @@ The following features are now available in supported kernels: - SNMP Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.txt +/Documentation/networking/bonding.rst Identifying Your Adapter diff --git a/Documentation/networking/device_drivers/intel/ixgb.rst b/Documentation/networking/device_drivers/intel/ixgb.rst index 945018207a92..ab624f1a44a8 100644 --- a/Documentation/networking/device_drivers/intel/ixgb.rst +++ b/Documentation/networking/device_drivers/intel/ixgb.rst @@ -37,7 +37,7 @@ The following features are available in this kernel: - SNMP Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.txt +/Documentation/networking/bonding.rst The driver information previously displayed in the /proc filesystem is not supported in this release. Alternatively, you can use ethtool (version 1.6 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index fbf845fbaff7..22b872834ef0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -44,6 +44,7 @@ Contents: atm ax25 baycom + bonding .. only:: subproject and html diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index b103fbdd0f68..4ab6d343fd86 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -50,7 +50,7 @@ config BONDING The driver supports multiple bonding modes to allow for both high performance and high availability operation. - Refer to for more + Refer to for more information. To compile this driver as a module, choose M here: the module -- cgit From 92f06f4226fd9bdd6fbbd2e8b84601fc14b5855e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:25 +0200 Subject: docs: networking: convert cdc_mbim.txt to ReST - add SPDX header; - mark code blocks and literals as such; - use :field: markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/cdc_mbim.rst | 355 ++++++++++++++++++++++++++++++++++ Documentation/networking/cdc_mbim.txt | 339 -------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 356 insertions(+), 339 deletions(-) create mode 100644 Documentation/networking/cdc_mbim.rst delete mode 100644 Documentation/networking/cdc_mbim.txt diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst new file mode 100644 index 000000000000..0048409c06b4 --- /dev/null +++ b/Documentation/networking/cdc_mbim.rst @@ -0,0 +1,355 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +cdc_mbim - Driver for CDC MBIM Mobile Broadband modems +====================================================== + +The cdc_mbim driver supports USB devices conforming to the "Universal +Serial Bus Communications Class Subclass Specification for Mobile +Broadband Interface Model" [1], which is a further development of +"Universal Serial Bus Communications Class Subclass Specifications for +Network Control Model Devices" [2] optimized for Mobile Broadband +devices, aka "3G/LTE modems". + + +Command Line Parameters +======================= + +The cdc_mbim driver has no parameters of its own. But the probing +behaviour for NCM 1.0 backwards compatible MBIM functions (an +"NCM/MBIM function" as defined in section 3.2 of [1]) is affected +by a cdc_ncm driver parameter: + +prefer_mbim +----------- +:Type: Boolean +:Valid Range: N/Y (0-1) +:Default Value: Y (MBIM is preferred) + +This parameter sets the system policy for NCM/MBIM functions. Such +functions will be handled by either the cdc_ncm driver or the cdc_mbim +driver depending on the prefer_mbim setting. Setting prefer_mbim=N +makes the cdc_mbim driver ignore these functions and lets the cdc_ncm +driver handle them instead. + +The parameter is writable, and can be changed at any time. A manual +unbind/bind is required to make the change effective for NCM/MBIM +functions bound to the "wrong" driver + + +Basic usage +=========== + +MBIM functions are inactive when unmanaged. The cdc_mbim driver only +provides a userspace interface to the MBIM control channel, and will +not participate in the management of the function. This implies that a +userspace MBIM management application always is required to enable a +MBIM function. + +Such userspace applications includes, but are not limited to: + + - mbimcli (included with the libmbim [3] library), and + - ModemManager [4] + +Establishing a MBIM IP session reequires at least these actions by the +management application: + + - open the control channel + - configure network connection settings + - connect to network + - configure IP interface + +Management application development +---------------------------------- +The driver <-> userspace interfaces are described below. The MBIM +control channel protocol is described in [1]. + + +MBIM control channel userspace ABI +================================== + +/dev/cdc-wdmX character device +------------------------------ +The driver creates a two-way pipe to the MBIM function control channel +using the cdc-wdm driver as a subdriver. The userspace end of the +control channel pipe is a /dev/cdc-wdmX character device. + +The cdc_mbim driver does not process or police messages on the control +channel. The channel is fully delegated to the userspace management +application. It is therefore up to this application to ensure that it +complies with all the control channel requirements in [1]. + +The cdc-wdmX device is created as a child of the MBIM control +interface USB device. The character device associated with a specific +MBIM function can be looked up using sysfs. For example:: + + bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc + cdc-wdm0 + + bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev + 180:0 + + +USB configuration descriptors +----------------------------- +The wMaxControlMessage field of the CDC MBIM functional descriptor +limits the maximum control message size. The managament application is +responsible for negotiating a control message size complying with the +requirements in section 9.3.1 of [1], taking this descriptor field +into consideration. + +The userspace application can access the CDC MBIM functional +descriptor of a MBIM function using either of the two USB +configuration descriptor kernel interfaces described in [6] or [7]. + +See also the ioctl documentation below. + + +Fragmentation +------------- +The userspace application is responsible for all control message +fragmentation and defragmentaion, as described in section 9.5 of [1]. + + +/dev/cdc-wdmX write() +--------------------- +The MBIM control messages from the management application *must not* +exceed the negotiated control message size. + + +/dev/cdc-wdmX read() +-------------------- +The management application *must* accept control messages of up the +negotiated control message size. + + +/dev/cdc-wdmX ioctl() +--------------------- +IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size +This ioctl returns the wMaxControlMessage field of the CDC MBIM +functional descriptor for MBIM devices. This is intended as a +convenience, eliminating the need to parse the USB descriptors from +userspace. + +:: + + #include + #include + #include + #include + #include + int main() + { + __u16 max; + int fd = open("/dev/cdc-wdm0", O_RDWR); + if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) + printf("wMaxControlMessage is %d\n", max); + } + + +Custom device services +---------------------- +The MBIM specification allows vendors to freely define additional +services. This is fully supported by the cdc_mbim driver. + +Support for new MBIM services, including vendor specified services, is +implemented entirely in userspace, like the rest of the MBIM control +protocol + +New services should be registered in the MBIM Registry [5]. + + + +MBIM data channel userspace ABI +=============================== + +wwanY network device +-------------------- +The cdc_mbim driver represents the MBIM data channel as a single +network device of the "wwan" type. This network device is initially +mapped to MBIM IP session 0. + + +Multiplexed IP sessions (IPS) +----------------------------- +MBIM allows multiplexing up to 256 IP sessions over a single USB data +channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN +subdevices of the master wwanY device, mapping MBIM IP session Z to +VLAN ID Z for all values of Z greater than 0. + +The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure +described in section 10.5.1 of [1]. + +The userspace management application is responsible for adding new +VLAN links prior to establishing MBIM IP sessions where the SessionId +is greater than 0. These links can be added by using the normal VLAN +kernel interfaces, either ioctl or netlink. + +For example, adding a link for a MBIM IP session with SessionId 3:: + + ip link add link wwan0 name wwan0.3 type vlan id 3 + +The driver will automatically map the "wwan0.3" network device to MBIM +IP session 3. + + +Device Service Streams (DSS) +---------------------------- +MBIM also allows up to 256 non-IP data streams to be multiplexed over +the same shared USB data channel. The cdc_mbim driver models these +sessions as another set of 802.1q VLAN subdevices of the master wwanY +device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values +of A. + +The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO +structure described in section 10.5.29 of [1]. + +The DSS VLAN subdevices are used as a practical interface between the +shared MBIM data channel and a MBIM DSS aware userspace application. +It is not intended to be presented as-is to an end user. The +assumption is that a userspace application initiating a DSS session +also takes care of the necessary framing of the DSS data, presenting +the stream to the end user in an appropriate way for the stream type. + +The network device ABI requires a dummy ethernet header for every DSS +data frame being transported. The contents of this header is +arbitrary, with the following exceptions: + + - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped + - RX frames will have the protocol field set to ETH_P_802_3 (but will + not be properly formatted 802.3 frames) + - RX frames will have the destination address set to the hardware + address of the master device + +The DSS supporting userspace management application is responsible for +adding the dummy ethernet header on TX and stripping it on RX. + +This is a simple example using tools commonly available, exporting +DssSessionId 5 as a pty character device pointed to by a /dev/nmea +symlink:: + + ip link add link wwan0 name wwan0.dss5 type vlan id 261 + ip link set dev wwan0.dss5 up + socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea + +This is only an example, most suitable for testing out a DSS +service. Userspace applications supporting specific MBIM DSS services +are expected to use the tools and programming interfaces required by +that service. + +Note that adding VLAN links for DSS sessions is entirely optional. A +management application may instead choose to bind a packet socket +directly to the master network device, using the received VLAN tags to +map frames to the correct DSS session and adding 18 byte VLAN ethernet +headers with the appropriate tag on TX. In this case using a socket +filter is recommended, matching only the DSS VLAN subset. This avoid +unnecessary copying of unrelated IP session data to userspace. For +example:: + + static struct sock_filter dssfilter[] = { + /* use special negative offsets to get VLAN tag */ + BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ + + /* verify DSS VLAN range */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ + + /* verify ethertype */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), + + BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ + BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ + }; + + + +Tagged IP session 0 VLAN +------------------------ +As described above, MBIM IP session 0 is treated as special by the +driver. It is initially mapped to untagged frames on the wwanY +network device. + +This mapping implies a few restrictions on multiplexed IPS and DSS +sessions, which may not always be practical: + + - no IPS or DSS session can use a frame size greater than the MTU on + IP session 0 + - no IPS or DSS session can be in the up state unless the network + device representing IP session 0 also is up + +These problems can be avoided by optionally making the driver map IP +session 0 to a VLAN subdevice, similar to all other IP sessions. This +behaviour is triggered by adding a VLAN link for the magic VLAN ID +4094. The driver will then immediately start mapping MBIM IP session +0 to this VLAN, and will drop untagged frames on the master wwanY +device. + +Tip: It might be less confusing to the end user to name this VLAN +subdevice after the MBIM SessionID instead of the VLAN ID. For +example:: + + ip link add link wwan0 name wwan0.0 type vlan id 4094 + + +VLAN mapping +------------ + +Summarizing the cdc_mbim driver mapping described above, we have this +relationship between VLAN tags on the wwanY network device and MBIM +sessions on the shared USB data channel:: + + VLAN ID MBIM type MBIM SessionID Notes + --------------------------------------------------------- + untagged IPS 0 a) + 1 - 255 IPS 1 - 255 + 256 - 511 DSS 0 - 255 + 512 - 4093 b) + 4094 IPS 0 c) + + a) if no VLAN ID 4094 link exists, else dropped + b) unsupported VLAN range, unconditionally dropped + c) if a VLAN ID 4094 link exists, else dropped + + + + +References +========== + + 1) USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specification for Mobile Broadband + Interface Model", Revision 1.0 (Errata 1), May 1, 2013 + + - http://www.usb.org/developers/docs/devclass_docs/ + + 2) USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specifications for Network Control + Model Devices", Revision 1.0 (Errata 1), November 24, 2010 + + - http://www.usb.org/developers/docs/devclass_docs/ + + 3) libmbim - "a glib-based library for talking to WWAN modems and + devices which speak the Mobile Interface Broadband Model (MBIM) + protocol" + + - http://www.freedesktop.org/wiki/Software/libmbim/ + + 4) ModemManager - "a DBus-activated daemon which controls mobile + broadband (2G/3G/4G) devices and connections" + + - http://www.freedesktop.org/wiki/Software/ModemManager/ + + 5) "MBIM (Mobile Broadband Interface Model) Registry" + + - http://compliance.usb.org/mbim/ + + 6) "/sys/kernel/debug/usb/devices output format" + + - Documentation/driver-api/usb/usb.rst + + 7) "/sys/bus/usb/devices/.../descriptors" + + - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt deleted file mode 100644 index 4e68f0bc5dba..000000000000 --- a/Documentation/networking/cdc_mbim.txt +++ /dev/null @@ -1,339 +0,0 @@ - cdc_mbim - Driver for CDC MBIM Mobile Broadband modems - ======================================================== - -The cdc_mbim driver supports USB devices conforming to the "Universal -Serial Bus Communications Class Subclass Specification for Mobile -Broadband Interface Model" [1], which is a further development of -"Universal Serial Bus Communications Class Subclass Specifications for -Network Control Model Devices" [2] optimized for Mobile Broadband -devices, aka "3G/LTE modems". - - -Command Line Parameters -======================= - -The cdc_mbim driver has no parameters of its own. But the probing -behaviour for NCM 1.0 backwards compatible MBIM functions (an -"NCM/MBIM function" as defined in section 3.2 of [1]) is affected -by a cdc_ncm driver parameter: - -prefer_mbim ------------ -Type: Boolean -Valid Range: N/Y (0-1) -Default Value: Y (MBIM is preferred) - -This parameter sets the system policy for NCM/MBIM functions. Such -functions will be handled by either the cdc_ncm driver or the cdc_mbim -driver depending on the prefer_mbim setting. Setting prefer_mbim=N -makes the cdc_mbim driver ignore these functions and lets the cdc_ncm -driver handle them instead. - -The parameter is writable, and can be changed at any time. A manual -unbind/bind is required to make the change effective for NCM/MBIM -functions bound to the "wrong" driver - - -Basic usage -=========== - -MBIM functions are inactive when unmanaged. The cdc_mbim driver only -provides a userspace interface to the MBIM control channel, and will -not participate in the management of the function. This implies that a -userspace MBIM management application always is required to enable a -MBIM function. - -Such userspace applications includes, but are not limited to: - - mbimcli (included with the libmbim [3] library), and - - ModemManager [4] - -Establishing a MBIM IP session reequires at least these actions by the -management application: - - open the control channel - - configure network connection settings - - connect to network - - configure IP interface - -Management application development ----------------------------------- -The driver <-> userspace interfaces are described below. The MBIM -control channel protocol is described in [1]. - - -MBIM control channel userspace ABI -================================== - -/dev/cdc-wdmX character device ------------------------------- -The driver creates a two-way pipe to the MBIM function control channel -using the cdc-wdm driver as a subdriver. The userspace end of the -control channel pipe is a /dev/cdc-wdmX character device. - -The cdc_mbim driver does not process or police messages on the control -channel. The channel is fully delegated to the userspace management -application. It is therefore up to this application to ensure that it -complies with all the control channel requirements in [1]. - -The cdc-wdmX device is created as a child of the MBIM control -interface USB device. The character device associated with a specific -MBIM function can be looked up using sysfs. For example: - - bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc - cdc-wdm0 - - bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev - 180:0 - - -USB configuration descriptors ------------------------------ -The wMaxControlMessage field of the CDC MBIM functional descriptor -limits the maximum control message size. The managament application is -responsible for negotiating a control message size complying with the -requirements in section 9.3.1 of [1], taking this descriptor field -into consideration. - -The userspace application can access the CDC MBIM functional -descriptor of a MBIM function using either of the two USB -configuration descriptor kernel interfaces described in [6] or [7]. - -See also the ioctl documentation below. - - -Fragmentation -------------- -The userspace application is responsible for all control message -fragmentation and defragmentaion, as described in section 9.5 of [1]. - - -/dev/cdc-wdmX write() ---------------------- -The MBIM control messages from the management application *must not* -exceed the negotiated control message size. - - -/dev/cdc-wdmX read() --------------------- -The management application *must* accept control messages of up the -negotiated control message size. - - -/dev/cdc-wdmX ioctl() --------------------- -IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size -This ioctl returns the wMaxControlMessage field of the CDC MBIM -functional descriptor for MBIM devices. This is intended as a -convenience, eliminating the need to parse the USB descriptors from -userspace. - - #include - #include - #include - #include - #include - int main() - { - __u16 max; - int fd = open("/dev/cdc-wdm0", O_RDWR); - if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) - printf("wMaxControlMessage is %d\n", max); - } - - -Custom device services ----------------------- -The MBIM specification allows vendors to freely define additional -services. This is fully supported by the cdc_mbim driver. - -Support for new MBIM services, including vendor specified services, is -implemented entirely in userspace, like the rest of the MBIM control -protocol - -New services should be registered in the MBIM Registry [5]. - - - -MBIM data channel userspace ABI -=============================== - -wwanY network device --------------------- -The cdc_mbim driver represents the MBIM data channel as a single -network device of the "wwan" type. This network device is initially -mapped to MBIM IP session 0. - - -Multiplexed IP sessions (IPS) ------------------------------ -MBIM allows multiplexing up to 256 IP sessions over a single USB data -channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN -subdevices of the master wwanY device, mapping MBIM IP session Z to -VLAN ID Z for all values of Z greater than 0. - -The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure -described in section 10.5.1 of [1]. - -The userspace management application is responsible for adding new -VLAN links prior to establishing MBIM IP sessions where the SessionId -is greater than 0. These links can be added by using the normal VLAN -kernel interfaces, either ioctl or netlink. - -For example, adding a link for a MBIM IP session with SessionId 3: - - ip link add link wwan0 name wwan0.3 type vlan id 3 - -The driver will automatically map the "wwan0.3" network device to MBIM -IP session 3. - - -Device Service Streams (DSS) ----------------------------- -MBIM also allows up to 256 non-IP data streams to be multiplexed over -the same shared USB data channel. The cdc_mbim driver models these -sessions as another set of 802.1q VLAN subdevices of the master wwanY -device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values -of A. - -The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO -structure described in section 10.5.29 of [1]. - -The DSS VLAN subdevices are used as a practical interface between the -shared MBIM data channel and a MBIM DSS aware userspace application. -It is not intended to be presented as-is to an end user. The -assumption is that a userspace application initiating a DSS session -also takes care of the necessary framing of the DSS data, presenting -the stream to the end user in an appropriate way for the stream type. - -The network device ABI requires a dummy ethernet header for every DSS -data frame being transported. The contents of this header is -arbitrary, with the following exceptions: - - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped - - RX frames will have the protocol field set to ETH_P_802_3 (but will - not be properly formatted 802.3 frames) - - RX frames will have the destination address set to the hardware - address of the master device - -The DSS supporting userspace management application is responsible for -adding the dummy ethernet header on TX and stripping it on RX. - -This is a simple example using tools commonly available, exporting -DssSessionId 5 as a pty character device pointed to by a /dev/nmea -symlink: - - ip link add link wwan0 name wwan0.dss5 type vlan id 261 - ip link set dev wwan0.dss5 up - socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea - -This is only an example, most suitable for testing out a DSS -service. Userspace applications supporting specific MBIM DSS services -are expected to use the tools and programming interfaces required by -that service. - -Note that adding VLAN links for DSS sessions is entirely optional. A -management application may instead choose to bind a packet socket -directly to the master network device, using the received VLAN tags to -map frames to the correct DSS session and adding 18 byte VLAN ethernet -headers with the appropriate tag on TX. In this case using a socket -filter is recommended, matching only the DSS VLAN subset. This avoid -unnecessary copying of unrelated IP session data to userspace. For -example: - - static struct sock_filter dssfilter[] = { - /* use special negative offsets to get VLAN tag */ - BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), - BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ - - /* verify DSS VLAN range */ - BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), - BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ - BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ - - /* verify ethertype */ - BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), - BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), - - BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ - BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ - }; - - - -Tagged IP session 0 VLAN ------------------------- -As described above, MBIM IP session 0 is treated as special by the -driver. It is initially mapped to untagged frames on the wwanY -network device. - -This mapping implies a few restrictions on multiplexed IPS and DSS -sessions, which may not always be practical: - - no IPS or DSS session can use a frame size greater than the MTU on - IP session 0 - - no IPS or DSS session can be in the up state unless the network - device representing IP session 0 also is up - -These problems can be avoided by optionally making the driver map IP -session 0 to a VLAN subdevice, similar to all other IP sessions. This -behaviour is triggered by adding a VLAN link for the magic VLAN ID -4094. The driver will then immediately start mapping MBIM IP session -0 to this VLAN, and will drop untagged frames on the master wwanY -device. - -Tip: It might be less confusing to the end user to name this VLAN -subdevice after the MBIM SessionID instead of the VLAN ID. For -example: - - ip link add link wwan0 name wwan0.0 type vlan id 4094 - - -VLAN mapping ------------- - -Summarizing the cdc_mbim driver mapping described above, we have this -relationship between VLAN tags on the wwanY network device and MBIM -sessions on the shared USB data channel: - - VLAN ID MBIM type MBIM SessionID Notes - --------------------------------------------------------- - untagged IPS 0 a) - 1 - 255 IPS 1 - 255 - 256 - 511 DSS 0 - 255 - 512 - 4093 b) - 4094 IPS 0 c) - - a) if no VLAN ID 4094 link exists, else dropped - b) unsupported VLAN range, unconditionally dropped - c) if a VLAN ID 4094 link exists, else dropped - - - - -References -========== - -[1] USB Implementers Forum, Inc. - "Universal Serial Bus - Communications Class Subclass Specification for Mobile Broadband - Interface Model", Revision 1.0 (Errata 1), May 1, 2013 - - http://www.usb.org/developers/docs/devclass_docs/ - -[2] USB Implementers Forum, Inc. - "Universal Serial Bus - Communications Class Subclass Specifications for Network Control - Model Devices", Revision 1.0 (Errata 1), November 24, 2010 - - http://www.usb.org/developers/docs/devclass_docs/ - -[3] libmbim - "a glib-based library for talking to WWAN modems and - devices which speak the Mobile Interface Broadband Model (MBIM) - protocol" - - http://www.freedesktop.org/wiki/Software/libmbim/ - -[4] ModemManager - "a DBus-activated daemon which controls mobile - broadband (2G/3G/4G) devices and connections" - - http://www.freedesktop.org/wiki/Software/ModemManager/ - -[5] "MBIM (Mobile Broadband Interface Model) Registry" - - http://compliance.usb.org/mbim/ - -[6] "/sys/kernel/debug/usb/devices output format" - - Documentation/driver-api/usb/usb.rst - -[7] "/sys/bus/usb/devices/.../descriptors" - - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 22b872834ef0..55802abd65a0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -45,6 +45,7 @@ Contents: ax25 baycom bonding + cdc_mbim .. only:: subproject and html -- cgit From 99b0e82dc5e36edb625f519121d4398628f05e95 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:26 +0200 Subject: docs: networking: convert cops.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/cops.rst | 80 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/cops.txt | 63 ------------------------------ Documentation/networking/index.rst | 1 + drivers/net/appletalk/Kconfig | 2 +- 4 files changed, 82 insertions(+), 64 deletions(-) create mode 100644 Documentation/networking/cops.rst delete mode 100644 Documentation/networking/cops.txt diff --git a/Documentation/networking/cops.rst b/Documentation/networking/cops.rst new file mode 100644 index 000000000000..964ba80599a9 --- /dev/null +++ b/Documentation/networking/cops.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +The COPS LocalTalk Linux driver (cops.c) +======================================== + +By Jay Schulist + +This driver has two modes and they are: Dayna mode and Tangent mode. +Each mode corresponds with the type of card. It has been found +that there are 2 main types of cards and all other cards are +the same and just have different names or only have minor differences +such as more IO ports. As this driver is tested it will +become more clear exactly what cards are supported. + +Right now these cards are known to work with the COPS driver. The +LT-200 cards work in a somewhat more limited capacity than the +DL200 cards, which work very well and are in use by many people. + +TANGENT driver mode: + - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200 + +DAYNA driver mode: + - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95, + - Farallon PhoneNET PC III, Farallon PhoneNET PC II + +Other cards possibly supported mode unknown though: + - Dayna DL2000 (Full length) + +The COPS driver defaults to using Dayna mode. To change the driver's +mode if you built a driver with dual support use board_type=1 or +board_type=2 for Dayna or Tangent with insmod. + +Operation/loading of the driver +=============================== + +Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #) +If you do not specify any options the driver will try and use the IO = 0x240, +IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing. + +To load multiple COPS driver Localtalk cards you can do one of the following:: + + insmod cops io=0x240 irq=5 + insmod -o cops2 cops io=0x260 irq=3 + +Or in lilo.conf put something like this:: + + append="ether=5,0x240,lt0 ether=3,0x260,lt1" + +Then bring up the interface with ifconfig. It will look something like this:: + + lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00 + inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 + UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1 + RX packets:0 errors:0 dropped:0 overruns:0 frame:0 + TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0 + +Netatalk Configuration +====================== + +You will need to configure atalkd with something like the following to make +it work with the cops.c driver. + +* For single LTalk card use:: + + dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033" + lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" + +* For multiple cards, Ethernet and LocalTalk:: + + eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033" + lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" + +* For multiple LocalTalk cards, and an Ethernet card. + +* Order seems to matter here, Ethernet last:: + + lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1" + lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2" + eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk" diff --git a/Documentation/networking/cops.txt b/Documentation/networking/cops.txt deleted file mode 100644 index 3e344b448e07..000000000000 --- a/Documentation/networking/cops.txt +++ /dev/null @@ -1,63 +0,0 @@ -Text File for the COPS LocalTalk Linux driver (cops.c). - By Jay Schulist - -This driver has two modes and they are: Dayna mode and Tangent mode. -Each mode corresponds with the type of card. It has been found -that there are 2 main types of cards and all other cards are -the same and just have different names or only have minor differences -such as more IO ports. As this driver is tested it will -become more clear exactly what cards are supported. - -Right now these cards are known to work with the COPS driver. The -LT-200 cards work in a somewhat more limited capacity than the -DL200 cards, which work very well and are in use by many people. - -TANGENT driver mode: - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200 -DAYNA driver mode: - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95, - Farallon PhoneNET PC III, Farallon PhoneNET PC II -Other cards possibly supported mode unknown though: - Dayna DL2000 (Full length) - -The COPS driver defaults to using Dayna mode. To change the driver's -mode if you built a driver with dual support use board_type=1 or -board_type=2 for Dayna or Tangent with insmod. - -** Operation/loading of the driver. -Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #) -If you do not specify any options the driver will try and use the IO = 0x240, -IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing. - -To load multiple COPS driver Localtalk cards you can do one of the following. - -insmod cops io=0x240 irq=5 -insmod -o cops2 cops io=0x260 irq=3 - -Or in lilo.conf put something like this: - append="ether=5,0x240,lt0 ether=3,0x260,lt1" - -Then bring up the interface with ifconfig. It will look something like this: -lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00 - inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 - UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1 - RX packets:0 errors:0 dropped:0 overruns:0 frame:0 - TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0 - -** Netatalk Configuration -You will need to configure atalkd with something like the following to make -it work with the cops.c driver. - -* For single LTalk card use. -dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033" -lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" - -* For multiple cards, Ethernet and LocalTalk. -eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033" -lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" - -* For multiple LocalTalk cards, and an Ethernet card. -* Order seems to matter here, Ethernet last. -lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1" -lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2" -eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk" diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 55802abd65a0..7b596810d479 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -46,6 +46,7 @@ Contents: baycom bonding cdc_mbim + cops .. only:: subproject and html diff --git a/drivers/net/appletalk/Kconfig b/drivers/net/appletalk/Kconfig index af509b05ac5c..d4e51c048f62 100644 --- a/drivers/net/appletalk/Kconfig +++ b/drivers/net/appletalk/Kconfig @@ -59,7 +59,7 @@ config COPS package. This driver is experimental, which means that it may not work. This driver will only work if you choose "AppleTalk DDP" networking support, above. - Please read the file . + Please read the file . config COPS_DAYNA bool "Dayna firmware support" -- cgit From 9a9891fbdf935c270388fca856c117ad71c02458 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:27 +0200 Subject: docs: networking: convert cxacru.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/cxacru.rst | 120 ++++++++++++++++++++++++++++++++++++ Documentation/networking/cxacru.txt | 100 ------------------------------ Documentation/networking/index.rst | 1 + 3 files changed, 121 insertions(+), 100 deletions(-) create mode 100644 Documentation/networking/cxacru.rst delete mode 100644 Documentation/networking/cxacru.txt diff --git a/Documentation/networking/cxacru.rst b/Documentation/networking/cxacru.rst new file mode 100644 index 000000000000..6088af2ffeda --- /dev/null +++ b/Documentation/networking/cxacru.rst @@ -0,0 +1,120 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +ATM cxacru device driver +======================== + +Firmware is required for this device: http://accessrunner.sourceforge.net/ + +While it is capable of managing/maintaining the ADSL connection without the +module loaded, the device will sometimes stop responding after unloading the +driver and it is necessary to unplug/remove power to the device to fix this. + +Note: support for cxacru-cf.bin has been removed. It was not loaded correctly +so it had no effect on the device configuration. Fixing it could have stopped +existing devices working when an invalid configuration is supplied. + +There is a script cxacru-cf.py to convert an existing file to the sysfs form. + +Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/ +these are directories named cxacruN where N is the device number. A symlink +named device points to the USB interface device's directory which contains +several sysfs attribute files for retrieving device statistics: + +* adsl_controller_version + +* adsl_headend +* adsl_headend_environment + + - Information about the remote headend. + +* adsl_config + + - Configuration writing interface. + - Write parameters in hexadecimal format =, + separated by whitespace, e.g.: + + "1=0 a=5" + + - Up to 7 parameters at a time will be sent and the modem will restart + the ADSL connection when any value is set. These are logged for future + reference. + +* downstream_attenuation (dB) +* downstream_bits_per_frame +* downstream_rate (kbps) +* downstream_snr_margin (dB) + + - Downstream stats. + +* upstream_attenuation (dB) +* upstream_bits_per_frame +* upstream_rate (kbps) +* upstream_snr_margin (dB) +* transmitter_power (dBm/Hz) + + - Upstream stats. + +* downstream_crc_errors +* downstream_fec_errors +* downstream_hec_errors +* upstream_crc_errors +* upstream_fec_errors +* upstream_hec_errors + + - Error counts. + +* line_startable + + - Indicates that ADSL support on the device + is/can be enabled, see adsl_start. + +* line_status + + - "initialising" + - "down" + - "attempting to activate" + - "training" + - "channel analysis" + - "exchange" + - "waiting" + - "up" + + Changes between "down" and "attempting to activate" + if there is no signal. + +* link_status + + - "not connected" + - "connected" + - "lost" + +* mac_address + +* modulation + + - "" (when not connected) + - "ANSI T1.413" + - "ITU-T G.992.1 (G.DMT)" + - "ITU-T G.992.2 (G.LITE)" + +* startup_attempts + + - Count of total attempts to initialise ADSL. + +To enable/disable ADSL, the following can be written to the adsl_state file: + + - "start" + - "stop + - "restart" (stops, waits 1.5s, then starts) + - "poll" (used to resume status polling if it was disabled due to failure) + +Changes in adsl/line state are reported via kernel log messages:: + + [4942145.150704] ATM dev 0: ADSL state: running + [4942243.663766] ATM dev 0: ADSL line: down + [4942249.665075] ATM dev 0: ADSL line: attempting to activate + [4942253.654954] ATM dev 0: ADSL line: training + [4942255.666387] ATM dev 0: ADSL line: channel analysis + [4942259.656262] ATM dev 0: ADSL line: exchange + [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up) diff --git a/Documentation/networking/cxacru.txt b/Documentation/networking/cxacru.txt deleted file mode 100644 index 2cce04457b4d..000000000000 --- a/Documentation/networking/cxacru.txt +++ /dev/null @@ -1,100 +0,0 @@ -Firmware is required for this device: http://accessrunner.sourceforge.net/ - -While it is capable of managing/maintaining the ADSL connection without the -module loaded, the device will sometimes stop responding after unloading the -driver and it is necessary to unplug/remove power to the device to fix this. - -Note: support for cxacru-cf.bin has been removed. It was not loaded correctly -so it had no effect on the device configuration. Fixing it could have stopped -existing devices working when an invalid configuration is supplied. - -There is a script cxacru-cf.py to convert an existing file to the sysfs form. - -Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/ -these are directories named cxacruN where N is the device number. A symlink -named device points to the USB interface device's directory which contains -several sysfs attribute files for retrieving device statistics: - -* adsl_controller_version - -* adsl_headend -* adsl_headend_environment - Information about the remote headend. - -* adsl_config - Configuration writing interface. - Write parameters in hexadecimal format =, - separated by whitespace, e.g.: - "1=0 a=5" - Up to 7 parameters at a time will be sent and the modem will restart - the ADSL connection when any value is set. These are logged for future - reference. - -* downstream_attenuation (dB) -* downstream_bits_per_frame -* downstream_rate (kbps) -* downstream_snr_margin (dB) - Downstream stats. - -* upstream_attenuation (dB) -* upstream_bits_per_frame -* upstream_rate (kbps) -* upstream_snr_margin (dB) -* transmitter_power (dBm/Hz) - Upstream stats. - -* downstream_crc_errors -* downstream_fec_errors -* downstream_hec_errors -* upstream_crc_errors -* upstream_fec_errors -* upstream_hec_errors - Error counts. - -* line_startable - Indicates that ADSL support on the device - is/can be enabled, see adsl_start. - -* line_status - "initialising" - "down" - "attempting to activate" - "training" - "channel analysis" - "exchange" - "waiting" - "up" - - Changes between "down" and "attempting to activate" - if there is no signal. - -* link_status - "not connected" - "connected" - "lost" - -* mac_address - -* modulation - "" (when not connected) - "ANSI T1.413" - "ITU-T G.992.1 (G.DMT)" - "ITU-T G.992.2 (G.LITE)" - -* startup_attempts - Count of total attempts to initialise ADSL. - -To enable/disable ADSL, the following can be written to the adsl_state file: - "start" - "stop - "restart" (stops, waits 1.5s, then starts) - "poll" (used to resume status polling if it was disabled due to failure) - -Changes in adsl/line state are reported via kernel log messages: - [4942145.150704] ATM dev 0: ADSL state: running - [4942243.663766] ATM dev 0: ADSL line: down - [4942249.665075] ATM dev 0: ADSL line: attempting to activate - [4942253.654954] ATM dev 0: ADSL line: training - [4942255.666387] ATM dev 0: ADSL line: channel analysis - [4942259.656262] ATM dev 0: ADSL line: exchange - [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up) diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 7b596810d479..4c8e896490e0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -47,6 +47,7 @@ Contents: bonding cdc_mbim cops + cxacru .. only:: subproject and html -- cgit From 33155bac6519f545137d9c46d2e59e5f8332dd50 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:28 +0200 Subject: docs: networking: convert dccp.txt to ReST - add SPDX header; - adjust title markup; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/dccp.rst | 216 +++++++++++++++++++++++++++++++++++++ Documentation/networking/dccp.txt | 207 ----------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 217 insertions(+), 207 deletions(-) create mode 100644 Documentation/networking/dccp.rst delete mode 100644 Documentation/networking/dccp.txt diff --git a/Documentation/networking/dccp.rst b/Documentation/networking/dccp.rst new file mode 100644 index 000000000000..dde16be04456 --- /dev/null +++ b/Documentation/networking/dccp.rst @@ -0,0 +1,216 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +DCCP protocol +============= + + +.. Contents + - Introduction + - Missing features + - Socket options + - Sysctl variables + - IOCTLs + - Other tunables + - Notes + + +Introduction +============ +Datagram Congestion Control Protocol (DCCP) is an unreliable, connection +oriented protocol designed to solve issues present in UDP and TCP, particularly +for real-time and multimedia (streaming) traffic. +It divides into a base protocol (RFC 4340) and pluggable congestion control +modules called CCIDs. Like pluggable TCP congestion control, at least one CCID +needs to be enabled in order for the protocol to function properly. In the Linux +implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as +the TCP-friendly CCID3 (RFC 4342), are optional. +For a brief introduction to CCIDs and suggestions for choosing a CCID to match +given applications, see section 10 of RFC 4340. + +It has a base protocol and pluggable congestion control IDs (CCIDs). + +DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol +is at http://www.ietf.org/html.charters/dccp-charter.html + + +Missing features +================ +The Linux DCCP implementation does not currently support all the features that are +specified in RFCs 4340...42. + +The known bugs are at: + + http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP + +For more up-to-date versions of the DCCP implementation, please consider using +the experimental DCCP test tree; instructions for checking this out are on: +http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree + + +Socket options +============== +DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes +a policy ID as argument and can only be set before the connection (i.e. changes +during an established connection are not supported). Currently, two policies are +defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, +and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an +u32 priority value as ancillary data to sendmsg(), where higher numbers indicate +a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to +be formatted using a cmsg(3) message header filled in as follows:: + + cmsg->cmsg_level = SOL_DCCP; + cmsg->cmsg_type = DCCP_SCM_PRIORITY; + cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ + +DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero +value is always interpreted as unbounded queue length. If different from zero, +the interpretation of this parameter depends on the current dequeuing policy +(see above): the "simple" policy will enforce a fixed queue size by returning +EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the +lowest-priority packet first. The default value for this parameter is +initialised from /proc/sys/net/dccp/default/tx_qlen. + +DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of +service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, +the socket will fall back to 0 (which means that no meaningful service code +is present). On active sockets this is set before connect(); specifying more +than one code has no effect (all subsequent service codes are ignored). The +case is different for passive sockets, where multiple service codes (up to 32) +can be set before calling bind(). + +DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet +size (application payload size) in bytes, see RFC 4340, section 14. + +DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs +supported by the endpoint. The option value is an array of type uint8_t whose +size is passed as option length. The minimum array size is 4 elements, the +value returned in the optlen argument always reflects the true number of +built-in CCIDs. + +DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same +time, combining the operation of the next two socket options. This option is +preferable over the latter two, since often applications will use the same +type of CCID for both directions; and mixed use of CCIDs is not currently well +understood. This socket option takes as argument at least one uint8_t value, or +an array of uint8_t values, which must match available CCIDS (see above). CCIDs +must be registered on the socket before calling connect() or listen(). + +DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets +the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. +Please note that the getsockopt argument type here is ``int``, not uint8_t. + +DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. + +DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold +timewait state when closing the connection (RFC 4340, 8.3). The usual case is +that the closing server sends a CloseReq, whereupon the client holds timewait +state. When this boolean socket option is on, the server sends a Close instead +and will enter TIMEWAIT. This option must be set after accept() returns. + +DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the +partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums +always cover the entire packet and that only fully covered application data is +accepted by the receiver. Hence, when using this feature on the sender, it must +be enabled at the receiver, too with suitable choice of CsCov. + +DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the + range 0..15 are acceptable. The default setting is 0 (full coverage), + values between 1..15 indicate partial coverage. + +DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it + sets a threshold, where again values 0..15 are acceptable. The default + of 0 means that all packets with a partial coverage will be discarded. + Values in the range 1..15 indicate that packets with minimally such a + coverage value are also acceptable. The higher the number, the more + restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage + settings are inherited to the child socket after accept(). + +The following two options apply to CCID 3 exclusively and are getsockopt()-only. +In either case, a TFRC info struct (defined in ) is returned. + +DCCP_SOCKOPT_CCID_RX_INFO + Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_rx_info). + +DCCP_SOCKOPT_CCID_TX_INFO + Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_tx_info). + +On unidirectional connections it is useful to close the unused half-connection +via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. + + +Sysctl variables +================ +Several DCCP default parameters can be managed by the following sysctls +(sysctl net.dccp.default or /proc/sys/net/dccp/default): + +request_retries + The number of active connection initiation retries (the number of + Requests minus one) before timing out. In addition, it also governs + the behaviour of the other, passive side: this variable also sets + the number of times DCCP repeats sending a Response when the initial + handshake does not progress from RESPOND to OPEN (i.e. when no Ack + is received after the initial Request). This value should be greater + than 0, suggested is less than 10. Analogue of tcp_syn_retries. + +retries1 + How often a DCCP Response is retransmitted until the listening DCCP + side considers its connecting peer dead. Analogue of tcp_retries1. + +retries2 + The number of times a general DCCP packet is retransmitted. This has + importance for retransmitted acknowledgments and feature negotiation, + data packets are never retransmitted. Analogue of tcp_retries2. + +tx_ccid = 2 + Default CCID for the sender-receiver half-connection. Depending on the + choice of CCID, the Send Ack Vector feature is enabled automatically. + +rx_ccid = 2 + Default CCID for the receiver-sender half-connection; see tx_ccid. + +seq_window = 100 + The initial sequence window (sec. 7.5.2) of the sender. This influences + the local ackno validity and the remote seqno validity windows (7.5.1). + Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. + +tx_qlen = 5 + The size of the transmit buffer in packets. A value of 0 corresponds + to an unbounded transmit buffer. + +sync_ratelimit = 125 ms + The timeout between subsequent DCCP-Sync packets sent in response to + sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit + of this parameter is milliseconds; a value of 0 disables rate-limiting. + + +IOCTLS +====== +FIONREAD + Works as in udp(7): returns in the ``int`` argument pointer the size of + the next pending datagram in bytes, or 0 when no datagram is pending. + + +Other tunables +============== +Per-route rto_min support + CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value + of the RTO timer. This setting can be modified via the 'rto_min' option + of iproute2; for example:: + + > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 + > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 + > ip route show dev wlan0 + + CCID-3 also supports the rto_min setting: it is used to define the lower + bound for the expiry of the nofeedback timer. This can be useful on LANs + with very low RTTs (e.g., loopback, Gbit ethernet). + + +Notes +===== +DCCP does not travel through NAT successfully at present on many boxes. This is +because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT +support for DCCP has been added. diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt deleted file mode 100644 index 55c575fcaf17..000000000000 --- a/Documentation/networking/dccp.txt +++ /dev/null @@ -1,207 +0,0 @@ -DCCP protocol -============= - - -Contents -======== -- Introduction -- Missing features -- Socket options -- Sysctl variables -- IOCTLs -- Other tunables -- Notes - - -Introduction -============ -Datagram Congestion Control Protocol (DCCP) is an unreliable, connection -oriented protocol designed to solve issues present in UDP and TCP, particularly -for real-time and multimedia (streaming) traffic. -It divides into a base protocol (RFC 4340) and pluggable congestion control -modules called CCIDs. Like pluggable TCP congestion control, at least one CCID -needs to be enabled in order for the protocol to function properly. In the Linux -implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as -the TCP-friendly CCID3 (RFC 4342), are optional. -For a brief introduction to CCIDs and suggestions for choosing a CCID to match -given applications, see section 10 of RFC 4340. - -It has a base protocol and pluggable congestion control IDs (CCIDs). - -DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol -is at http://www.ietf.org/html.charters/dccp-charter.html - - -Missing features -================ -The Linux DCCP implementation does not currently support all the features that are -specified in RFCs 4340...42. - -The known bugs are at: - http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP - -For more up-to-date versions of the DCCP implementation, please consider using -the experimental DCCP test tree; instructions for checking this out are on: -http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree - - -Socket options -============== -DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes -a policy ID as argument and can only be set before the connection (i.e. changes -during an established connection are not supported). Currently, two policies are -defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, -and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an -u32 priority value as ancillary data to sendmsg(), where higher numbers indicate -a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to -be formatted using a cmsg(3) message header filled in as follows: - cmsg->cmsg_level = SOL_DCCP; - cmsg->cmsg_type = DCCP_SCM_PRIORITY; - cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ - -DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero -value is always interpreted as unbounded queue length. If different from zero, -the interpretation of this parameter depends on the current dequeuing policy -(see above): the "simple" policy will enforce a fixed queue size by returning -EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the -lowest-priority packet first. The default value for this parameter is -initialised from /proc/sys/net/dccp/default/tx_qlen. - -DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of -service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, -the socket will fall back to 0 (which means that no meaningful service code -is present). On active sockets this is set before connect(); specifying more -than one code has no effect (all subsequent service codes are ignored). The -case is different for passive sockets, where multiple service codes (up to 32) -can be set before calling bind(). - -DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet -size (application payload size) in bytes, see RFC 4340, section 14. - -DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs -supported by the endpoint. The option value is an array of type uint8_t whose -size is passed as option length. The minimum array size is 4 elements, the -value returned in the optlen argument always reflects the true number of -built-in CCIDs. - -DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same -time, combining the operation of the next two socket options. This option is -preferable over the latter two, since often applications will use the same -type of CCID for both directions; and mixed use of CCIDs is not currently well -understood. This socket option takes as argument at least one uint8_t value, or -an array of uint8_t values, which must match available CCIDS (see above). CCIDs -must be registered on the socket before calling connect() or listen(). - -DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets -the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. -Please note that the getsockopt argument type here is `int', not uint8_t. - -DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. - -DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold -timewait state when closing the connection (RFC 4340, 8.3). The usual case is -that the closing server sends a CloseReq, whereupon the client holds timewait -state. When this boolean socket option is on, the server sends a Close instead -and will enter TIMEWAIT. This option must be set after accept() returns. - -DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the -partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums -always cover the entire packet and that only fully covered application data is -accepted by the receiver. Hence, when using this feature on the sender, it must -be enabled at the receiver, too with suitable choice of CsCov. - -DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the - range 0..15 are acceptable. The default setting is 0 (full coverage), - values between 1..15 indicate partial coverage. -DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it - sets a threshold, where again values 0..15 are acceptable. The default - of 0 means that all packets with a partial coverage will be discarded. - Values in the range 1..15 indicate that packets with minimally such a - coverage value are also acceptable. The higher the number, the more - restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage - settings are inherited to the child socket after accept(). - -The following two options apply to CCID 3 exclusively and are getsockopt()-only. -In either case, a TFRC info struct (defined in ) is returned. -DCCP_SOCKOPT_CCID_RX_INFO - Returns a `struct tfrc_rx_info' in optval; the buffer for optval and - optlen must be set to at least sizeof(struct tfrc_rx_info). -DCCP_SOCKOPT_CCID_TX_INFO - Returns a `struct tfrc_tx_info' in optval; the buffer for optval and - optlen must be set to at least sizeof(struct tfrc_tx_info). - -On unidirectional connections it is useful to close the unused half-connection -via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. - - -Sysctl variables -================ -Several DCCP default parameters can be managed by the following sysctls -(sysctl net.dccp.default or /proc/sys/net/dccp/default): - -request_retries - The number of active connection initiation retries (the number of - Requests minus one) before timing out. In addition, it also governs - the behaviour of the other, passive side: this variable also sets - the number of times DCCP repeats sending a Response when the initial - handshake does not progress from RESPOND to OPEN (i.e. when no Ack - is received after the initial Request). This value should be greater - than 0, suggested is less than 10. Analogue of tcp_syn_retries. - -retries1 - How often a DCCP Response is retransmitted until the listening DCCP - side considers its connecting peer dead. Analogue of tcp_retries1. - -retries2 - The number of times a general DCCP packet is retransmitted. This has - importance for retransmitted acknowledgments and feature negotiation, - data packets are never retransmitted. Analogue of tcp_retries2. - -tx_ccid = 2 - Default CCID for the sender-receiver half-connection. Depending on the - choice of CCID, the Send Ack Vector feature is enabled automatically. - -rx_ccid = 2 - Default CCID for the receiver-sender half-connection; see tx_ccid. - -seq_window = 100 - The initial sequence window (sec. 7.5.2) of the sender. This influences - the local ackno validity and the remote seqno validity windows (7.5.1). - Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. - -tx_qlen = 5 - The size of the transmit buffer in packets. A value of 0 corresponds - to an unbounded transmit buffer. - -sync_ratelimit = 125 ms - The timeout between subsequent DCCP-Sync packets sent in response to - sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit - of this parameter is milliseconds; a value of 0 disables rate-limiting. - - -IOCTLS -====== -FIONREAD - Works as in udp(7): returns in the `int' argument pointer the size of - the next pending datagram in bytes, or 0 when no datagram is pending. - - -Other tunables -============== -Per-route rto_min support - CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value - of the RTO timer. This setting can be modified via the 'rto_min' option - of iproute2; for example: - > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 - > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 - > ip route show dev wlan0 - CCID-3 also supports the rto_min setting: it is used to define the lower - bound for the expiry of the nofeedback timer. This can be useful on LANs - with very low RTTs (e.g., loopback, Gbit ethernet). - - -Notes -===== -DCCP does not travel through NAT successfully at present on many boxes. This is -because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT -support for DCCP has been added. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4c8e896490e0..3894043332de 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -48,6 +48,7 @@ Contents: cdc_mbim cops cxacru + dccp .. only:: subproject and html -- cgit From 8447bb44ef7c452cbc94c04fc38d4946a3ef9165 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:29 +0200 Subject: docs: networking: convert dctcp.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/dctcp.rst | 52 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/dctcp.txt | 44 -------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 53 insertions(+), 44 deletions(-) create mode 100644 Documentation/networking/dctcp.rst delete mode 100644 Documentation/networking/dctcp.txt diff --git a/Documentation/networking/dctcp.rst b/Documentation/networking/dctcp.rst new file mode 100644 index 000000000000..4cc8bb2dad50 --- /dev/null +++ b/Documentation/networking/dctcp.rst @@ -0,0 +1,52 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +DCTCP (DataCenter TCP) +====================== + +DCTCP is an enhancement to the TCP congestion control algorithm for data +center networks and leverages Explicit Congestion Notification (ECN) in +the data center network to provide multi-bit feedback to the end hosts. + +To enable it on end hosts:: + + sysctl -w net.ipv4.tcp_congestion_control=dctcp + sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) + +All switches in the data center network running DCTCP must support ECN +marking and be configured for marking when reaching defined switch buffer +thresholds. The default ECN marking threshold heuristic for DCTCP on +switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps, +but might need further careful tweaking. + +For more details, see below documents: + +Paper: + +The algorithm is further described in detail in the following two +SIGCOMM/SIGMETRICS papers: + + i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, + Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan: + + "Data Center TCP (DCTCP)", Data Center Networks session" + + Proc. ACM SIGCOMM, New Delhi, 2010. + + http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf + http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192 + +ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar: + + "Analysis of DCTCP: Stability, Convergence, and Fairness" + Proc. ACM SIGMETRICS, San Jose, 2011. + + http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf + +IETF informational draft: + + http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 + +DCTCP site: + + http://simula.stanford.edu/~alizade/Site/DCTCP.html diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt deleted file mode 100644 index 13a857753208..000000000000 --- a/Documentation/networking/dctcp.txt +++ /dev/null @@ -1,44 +0,0 @@ -DCTCP (DataCenter TCP) ----------------------- - -DCTCP is an enhancement to the TCP congestion control algorithm for data -center networks and leverages Explicit Congestion Notification (ECN) in -the data center network to provide multi-bit feedback to the end hosts. - -To enable it on end hosts: - - sysctl -w net.ipv4.tcp_congestion_control=dctcp - sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) - -All switches in the data center network running DCTCP must support ECN -marking and be configured for marking when reaching defined switch buffer -thresholds. The default ECN marking threshold heuristic for DCTCP on -switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps, -but might need further careful tweaking. - -For more details, see below documents: - -Paper: - -The algorithm is further described in detail in the following two -SIGCOMM/SIGMETRICS papers: - - i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, - Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan: - "Data Center TCP (DCTCP)", Data Center Networks session - Proc. ACM SIGCOMM, New Delhi, 2010. - http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf - http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192 - -ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar: - "Analysis of DCTCP: Stability, Convergence, and Fairness" - Proc. ACM SIGMETRICS, San Jose, 2011. - http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf - -IETF informational draft: - - http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 - -DCTCP site: - - http://simula.stanford.edu/~alizade/Site/DCTCP.html diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3894043332de..9e83d3bda4e0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -49,6 +49,7 @@ Contents: cops cxacru dccp + dctcp .. only:: subproject and html -- cgit From 9a69fb9c21c4bf4107becb877729544759bdd059 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:30 +0200 Subject: docs: networking: convert decnet.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/networking/decnet.rst | 243 ++++++++++++++++++++++++ Documentation/networking/decnet.txt | 230 ---------------------- Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- net/decnet/Kconfig | 4 +- 6 files changed, 248 insertions(+), 234 deletions(-) create mode 100644 Documentation/networking/decnet.rst delete mode 100644 Documentation/networking/decnet.txt diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f2a93c8679e8..b23ab11587a6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -831,7 +831,7 @@ decnet.addr= [HW,NET] Format: [,] - See also Documentation/networking/decnet.txt. + See also Documentation/networking/decnet.rst. default_hugepagesz= [same as hugepagesz=] The size of the default diff --git a/Documentation/networking/decnet.rst b/Documentation/networking/decnet.rst new file mode 100644 index 000000000000..b8bc11ff8370 --- /dev/null +++ b/Documentation/networking/decnet.rst @@ -0,0 +1,243 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +Linux DECnet Networking Layer Information +========================================= + +1. Other documentation.... +========================== + + - Project Home Pages + - http://www.chygwyn.com/ - Kernel info + - http://linux-decnet.sourceforge.net/ - Userland tools + - http://www.sourceforge.net/projects/linux-decnet/ - Status page + +2. Configuring the kernel +========================= + +Be sure to turn on the following options: + + - CONFIG_DECNET (obviously) + - CONFIG_PROC_FS (to see what's going on) + - CONFIG_SYSCTL (for easy configuration) + +if you want to try out router support (not properly debugged yet) +you'll need the following options as well... + + - CONFIG_DECNET_ROUTER (to be able to add/delete routes) + - CONFIG_NETFILTER (will be required for the DECnet routing daemon) + +Don't turn on SIOCGIFCONF support for DECnet unless you are really sure +that you need it, in general you won't and it can cause ifconfig to +malfunction. + +Run time configuration has changed slightly from the 2.4 system. If you +want to configure an endnode, then the simplified procedure is as follows: + + - Set the MAC address on your ethernet card before starting _any_ other + network protocols. + +As soon as your network card is brought into the UP state, DECnet should +start working. If you need something more complicated or are unsure how +to set the MAC address, see the next section. Also all configurations which +worked with 2.4 will work under 2.5 with no change. + +3. Command line options +======================= + +You can set a DECnet address on the kernel command line for compatibility +with the 2.4 configuration procedure, but in general it's not needed any more. +If you do st a DECnet address on the command line, it has only one purpose +which is that its added to the addresses on the loopback device. + +With 2.4 kernels, DECnet would only recognise addresses as local if they +were added to the loopback device. In 2.5, any local interface address +can be used to loop back to the local machine. Of course this does not +prevent you adding further addresses to the loopback device if you +want to. + +N.B. Since the address list of an interface determines the addresses for +which "hello" messages are sent, if you don't set an address on the loopback +interface then you won't see any entries in /proc/net/neigh for the local +host until such time as you start a connection. This doesn't affect the +operation of the local communications in any other way though. + +The kernel command line takes options looking like the following:: + + decnet.addr=1,2 + +the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels +and early 2.3.xx kernels, you must use a comma when specifying the +DECnet address like this. For more recent 2.3.xx kernels, you may +use almost any character except space, although a `.` would be the most +obvious choice :-) + +There used to be a third number specifying the node type. This option +has gone away in favour of a per interface node type. This is now set +using /proc/sys/net/decnet/conf//forwarding. This file can be +set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. + +There are also equivalent options for modules. The node address can +also be set through the /proc/sys/net/decnet/ files, as can other system +parameters. + +Currently the only supported devices are ethernet and ip_gre. The +ethernet address of your ethernet card has to be set according to the DECnet +address of the node in order for it to be autoconfigured (and then appear in +/proc/net/decnet_dev). There is a utility available at the above +FTP sites called dn2ethaddr which can compute the correct ethernet +address to use. The address can be set by ifconfig either before or +at the time the device is brought up. If you are using RedHat you can +add the line:: + + MACADDR=AA:00:04:00:03:04 + +or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or +wherever your network card's configuration lives. Setting the MAC address +of your ethernet card to an address starting with "hi-ord" will cause a +DECnet address which matches to be added to the interface (which you can +verify with iproute2). + +The default device for routing can be set through the /proc filesystem +by setting /proc/sys/net/decnet/default_device to the +device you want DECnet to route packets out of when no specific route +is available. Usually this will be eth0, for example:: + + echo -n "eth0" >/proc/sys/net/decnet/default_device + +If you don't set the default device, then it will default to the first +ethernet card which has been autoconfigured as described above. You can +confirm that by looking in the default_device file of course. + +There is a list of what the other files under /proc/sys/net/decnet/ do +on the kernel patch web site (shown above). + +4. Run time kernel configuration +================================ + + +This is either done through the sysctl/proc interface (see the kernel web +pages for details on what the various options do) or through the iproute2 +package in the same way as IPv4/6 configuration is performed. + +Documentation for iproute2 is included with the package, although there is +as yet no specific section on DECnet, most of the features apply to both +IP and DECnet, albeit with DECnet addresses instead of IP addresses and +a reduced functionality. + +If you want to configure a DECnet router you'll need the iproute2 package +since its the _only_ way to add and delete routes currently. Eventually +there will be a routing daemon to send and receive routing messages for +each interface and update the kernel routing tables accordingly. The +routing daemon will use netfilter to listen to routing packets, and +rtnetlink to update the kernels routing tables. + +The DECnet raw socket layer has been removed since it was there purely +for use by the routing daemon which will now use netfilter (a much cleaner +and more generic solution) instead. + +5. How can I tell if its working? +================================= + +Here is a quick guide of what to look for in order to know if your DECnet +kernel subsystem is working. + + - Is the node address set (see /proc/sys/net/decnet/node_address) + - Is the node of the correct type + (see /proc/sys/net/decnet/conf//forwarding) + - Is the Ethernet MAC address of each Ethernet card set to match + the DECnet address. If in doubt use the dn2ethaddr utility available + at the ftp archive. + - If the previous two steps are satisfied, and the Ethernet card is up, + you should find that it is listed in /proc/net/decnet_dev and also + that it appears as a directory in /proc/sys/net/decnet/conf/. The + loopback device (lo) should also appear and is required to communicate + within a node. + - If you have any DECnet routers on your network, they should appear + in /proc/net/decnet_neigh, otherwise this file will only contain the + entry for the node itself (if it doesn't check to see if lo is up). + - If you want to send to any node which is not listed in the + /proc/net/decnet_neigh file, you'll need to set the default device + to point to an Ethernet card with connection to a router. This is + again done with the /proc/sys/net/decnet/default_device file. + - Try starting a simple server and client, like the dnping/dnmirror + over the loopback interface. With luck they should communicate. + For this step and those after, you'll need the DECnet library + which can be obtained from the above ftp sites as well as the + actual utilities themselves. + - If this seems to work, then try talking to a node on your local + network, and see if you can obtain the same results. + - At this point you are on your own... :-) + +6. How to send a bug report +=========================== + +If you've found a bug and want to report it, then there are several things +you can do to help me work out exactly what it is that is wrong. Useful +information (_most_ of which _is_ _essential_) includes: + + - What kernel version are you running ? + - What version of the patch are you running ? + - How far though the above set of tests can you get ? + - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? + - Which services are you running ? + - Which client caused the problem ? + - How much data was being transferred ? + - Was the network congested ? + - How can the problem be reproduced ? + - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of + tcpdump don't understand how to dump DECnet properly, so including + the hex listing of the packet contents is _essential_, usually the -x flag. + You may also need to increase the length grabbed with the -s flag. The + -e flag also provides very useful information (ethernet MAC addresses)) + +7. MAC FAQ +========== + +A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet +interact and how to get the best performance from your hardware. + +Ethernet cards are designed to normally only pass received network frames +to a host computer when they are addressed to it, or to the broadcast address. + +Linux has an interface which allows the setting of extra addresses for +an ethernet card to listen to. If the ethernet card supports it, the +filtering operation will be done in hardware, if not the extra unwanted packets +received will be discarded by the host computer. In the latter case, +significant processor time and bus bandwidth can be used up on a busy +network (see the NAPI documentation for a longer explanation of these +effects). + +DECnet makes use of this interface to allow running DECnet on an ethernet +card which has already been configured using TCP/IP (presumably using the +built in MAC address of the card, as usual) and/or to allow multiple DECnet +addresses on each physical interface. If you do this, be aware that if your +ethernet card doesn't support perfect hashing in its MAC address filter +then your computer will be doing more work than required. Some cards +will simply set themselves into promiscuous mode in order to receive +packets from the DECnet specified addresses. So if you have one of these +cards its better to set the MAC address of the card as described above +to gain the best efficiency. Better still is to use a card which supports +NAPI as well. + + +8. Mailing list +=============== + +If you are keen to get involved in development, or want to ask questions +about configuration, or even just report bugs, then there is a mailing +list that you can join, details are at: + +http://sourceforge.net/mail/?group_id=4993 + +9. Legal Info +============= + +The Linux DECnet project team have placed their code under the GPL. The +software is provided "as is" and without warranty express or implied. +DECnet is a trademark of Compaq. This software is not a product of +Compaq. We acknowledge the help of people at Compaq in providing extra +documentation above and beyond what was previously publicly available. + +Steve Whitehouse + diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt deleted file mode 100644 index d192f8b9948b..000000000000 --- a/Documentation/networking/decnet.txt +++ /dev/null @@ -1,230 +0,0 @@ - Linux DECnet Networking Layer Information - =========================================== - -1) Other documentation.... - - o Project Home Pages - http://www.chygwyn.com/ - Kernel info - http://linux-decnet.sourceforge.net/ - Userland tools - http://www.sourceforge.net/projects/linux-decnet/ - Status page - -2) Configuring the kernel - -Be sure to turn on the following options: - - CONFIG_DECNET (obviously) - CONFIG_PROC_FS (to see what's going on) - CONFIG_SYSCTL (for easy configuration) - -if you want to try out router support (not properly debugged yet) -you'll need the following options as well... - - CONFIG_DECNET_ROUTER (to be able to add/delete routes) - CONFIG_NETFILTER (will be required for the DECnet routing daemon) - -Don't turn on SIOCGIFCONF support for DECnet unless you are really sure -that you need it, in general you won't and it can cause ifconfig to -malfunction. - -Run time configuration has changed slightly from the 2.4 system. If you -want to configure an endnode, then the simplified procedure is as follows: - - o Set the MAC address on your ethernet card before starting _any_ other - network protocols. - -As soon as your network card is brought into the UP state, DECnet should -start working. If you need something more complicated or are unsure how -to set the MAC address, see the next section. Also all configurations which -worked with 2.4 will work under 2.5 with no change. - -3) Command line options - -You can set a DECnet address on the kernel command line for compatibility -with the 2.4 configuration procedure, but in general it's not needed any more. -If you do st a DECnet address on the command line, it has only one purpose -which is that its added to the addresses on the loopback device. - -With 2.4 kernels, DECnet would only recognise addresses as local if they -were added to the loopback device. In 2.5, any local interface address -can be used to loop back to the local machine. Of course this does not -prevent you adding further addresses to the loopback device if you -want to. - -N.B. Since the address list of an interface determines the addresses for -which "hello" messages are sent, if you don't set an address on the loopback -interface then you won't see any entries in /proc/net/neigh for the local -host until such time as you start a connection. This doesn't affect the -operation of the local communications in any other way though. - -The kernel command line takes options looking like the following: - - decnet.addr=1,2 - -the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels -and early 2.3.xx kernels, you must use a comma when specifying the -DECnet address like this. For more recent 2.3.xx kernels, you may -use almost any character except space, although a `.` would be the most -obvious choice :-) - -There used to be a third number specifying the node type. This option -has gone away in favour of a per interface node type. This is now set -using /proc/sys/net/decnet/conf//forwarding. This file can be -set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. - -There are also equivalent options for modules. The node address can -also be set through the /proc/sys/net/decnet/ files, as can other system -parameters. - -Currently the only supported devices are ethernet and ip_gre. The -ethernet address of your ethernet card has to be set according to the DECnet -address of the node in order for it to be autoconfigured (and then appear in -/proc/net/decnet_dev). There is a utility available at the above -FTP sites called dn2ethaddr which can compute the correct ethernet -address to use. The address can be set by ifconfig either before or -at the time the device is brought up. If you are using RedHat you can -add the line: - - MACADDR=AA:00:04:00:03:04 - -or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or -wherever your network card's configuration lives. Setting the MAC address -of your ethernet card to an address starting with "hi-ord" will cause a -DECnet address which matches to be added to the interface (which you can -verify with iproute2). - -The default device for routing can be set through the /proc filesystem -by setting /proc/sys/net/decnet/default_device to the -device you want DECnet to route packets out of when no specific route -is available. Usually this will be eth0, for example: - - echo -n "eth0" >/proc/sys/net/decnet/default_device - -If you don't set the default device, then it will default to the first -ethernet card which has been autoconfigured as described above. You can -confirm that by looking in the default_device file of course. - -There is a list of what the other files under /proc/sys/net/decnet/ do -on the kernel patch web site (shown above). - -4) Run time kernel configuration - -This is either done through the sysctl/proc interface (see the kernel web -pages for details on what the various options do) or through the iproute2 -package in the same way as IPv4/6 configuration is performed. - -Documentation for iproute2 is included with the package, although there is -as yet no specific section on DECnet, most of the features apply to both -IP and DECnet, albeit with DECnet addresses instead of IP addresses and -a reduced functionality. - -If you want to configure a DECnet router you'll need the iproute2 package -since its the _only_ way to add and delete routes currently. Eventually -there will be a routing daemon to send and receive routing messages for -each interface and update the kernel routing tables accordingly. The -routing daemon will use netfilter to listen to routing packets, and -rtnetlink to update the kernels routing tables. - -The DECnet raw socket layer has been removed since it was there purely -for use by the routing daemon which will now use netfilter (a much cleaner -and more generic solution) instead. - -5) How can I tell if its working ? - -Here is a quick guide of what to look for in order to know if your DECnet -kernel subsystem is working. - - - Is the node address set (see /proc/sys/net/decnet/node_address) - - Is the node of the correct type - (see /proc/sys/net/decnet/conf//forwarding) - - Is the Ethernet MAC address of each Ethernet card set to match - the DECnet address. If in doubt use the dn2ethaddr utility available - at the ftp archive. - - If the previous two steps are satisfied, and the Ethernet card is up, - you should find that it is listed in /proc/net/decnet_dev and also - that it appears as a directory in /proc/sys/net/decnet/conf/. The - loopback device (lo) should also appear and is required to communicate - within a node. - - If you have any DECnet routers on your network, they should appear - in /proc/net/decnet_neigh, otherwise this file will only contain the - entry for the node itself (if it doesn't check to see if lo is up). - - If you want to send to any node which is not listed in the - /proc/net/decnet_neigh file, you'll need to set the default device - to point to an Ethernet card with connection to a router. This is - again done with the /proc/sys/net/decnet/default_device file. - - Try starting a simple server and client, like the dnping/dnmirror - over the loopback interface. With luck they should communicate. - For this step and those after, you'll need the DECnet library - which can be obtained from the above ftp sites as well as the - actual utilities themselves. - - If this seems to work, then try talking to a node on your local - network, and see if you can obtain the same results. - - At this point you are on your own... :-) - -6) How to send a bug report - -If you've found a bug and want to report it, then there are several things -you can do to help me work out exactly what it is that is wrong. Useful -information (_most_ of which _is_ _essential_) includes: - - - What kernel version are you running ? - - What version of the patch are you running ? - - How far though the above set of tests can you get ? - - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? - - Which services are you running ? - - Which client caused the problem ? - - How much data was being transferred ? - - Was the network congested ? - - How can the problem be reproduced ? - - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of - tcpdump don't understand how to dump DECnet properly, so including - the hex listing of the packet contents is _essential_, usually the -x flag. - You may also need to increase the length grabbed with the -s flag. The - -e flag also provides very useful information (ethernet MAC addresses)) - -7) MAC FAQ - -A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet -interact and how to get the best performance from your hardware. - -Ethernet cards are designed to normally only pass received network frames -to a host computer when they are addressed to it, or to the broadcast address. - -Linux has an interface which allows the setting of extra addresses for -an ethernet card to listen to. If the ethernet card supports it, the -filtering operation will be done in hardware, if not the extra unwanted packets -received will be discarded by the host computer. In the latter case, -significant processor time and bus bandwidth can be used up on a busy -network (see the NAPI documentation for a longer explanation of these -effects). - -DECnet makes use of this interface to allow running DECnet on an ethernet -card which has already been configured using TCP/IP (presumably using the -built in MAC address of the card, as usual) and/or to allow multiple DECnet -addresses on each physical interface. If you do this, be aware that if your -ethernet card doesn't support perfect hashing in its MAC address filter -then your computer will be doing more work than required. Some cards -will simply set themselves into promiscuous mode in order to receive -packets from the DECnet specified addresses. So if you have one of these -cards its better to set the MAC address of the card as described above -to gain the best efficiency. Better still is to use a card which supports -NAPI as well. - - -8) Mailing list - -If you are keen to get involved in development, or want to ask questions -about configuration, or even just report bugs, then there is a mailing -list that you can join, details are at: - -http://sourceforge.net/mail/?group_id=4993 - -9) Legal Info - -The Linux DECnet project team have placed their code under the GPL. The -software is provided "as is" and without warranty express or implied. -DECnet is a trademark of Compaq. This software is not a product of -Compaq. We acknowledge the help of people at Compaq in providing extra -documentation above and beyond what was previously publicly available. - -Steve Whitehouse - diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 9e83d3bda4e0..e17432492745 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -50,6 +50,7 @@ Contents: cxacru dccp dctcp + decnet .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 453fe0713e68..7323bfc1720f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4728,7 +4728,7 @@ DECnet NETWORK LAYER L: linux-decnet-user@lists.sourceforge.net S: Orphan W: http://linux-decnet.sourceforge.net -F: Documentation/networking/decnet.txt +F: Documentation/networking/decnet.rst F: net/decnet/ DECSTATION PLATFORM SUPPORT diff --git a/net/decnet/Kconfig b/net/decnet/Kconfig index 0935453ccfd5..8f98fb2f2ec9 100644 --- a/net/decnet/Kconfig +++ b/net/decnet/Kconfig @@ -15,7 +15,7 @@ config DECNET . More detailed documentation is available in - . + . Be sure to say Y to "/proc file system support" and "Sysctl support" below when using DECnet, since you will need sysctl support to aid @@ -40,4 +40,4 @@ config DECNET_ROUTER filtering" option will be required for the forthcoming routing daemon to work. - See for more information. + See for more information. -- cgit From 5f32c920c23b75654a839aa87c344b2bcaf350e2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:31 +0200 Subject: docs: networking: convert defza.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - use :field: markup for the version number; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/defza.rst | 63 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/defza.txt | 57 ---------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 64 insertions(+), 57 deletions(-) create mode 100644 Documentation/networking/defza.rst delete mode 100644 Documentation/networking/defza.txt diff --git a/Documentation/networking/defza.rst b/Documentation/networking/defza.rst new file mode 100644 index 000000000000..73c2f793ea26 --- /dev/null +++ b/Documentation/networking/defza.rst @@ -0,0 +1,63 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================================== +Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver +===================================================== + +:Version: v.1.1.4 + + +DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI +network card, designed in 1990 specifically for the DECstation 5000 +model 200 workstation. The board is a single attachment station and +it was manufactured in two variations, both of which are supported. + +First is the SAS MMF DEFZA-AA option, the original design implementing +the standard MMF-PMD, however with a pair of ST connectors rather than +the usual MIC connector. The other one is the SAS ThinWire/STP DEFZA-CA +option, denoted 700-C, with the network medium selectable by a switch +between the DEC proprietary ThinWire-PMD using a BNC connector and the +standard STP-PMD using a DE-9F connector. This option can interface to +a DECconcentrator 500 device and, in the case of the STP-PMD, also other +FDDI equipment and was designed to make it easier to transition from +existing IEEE 802.3 10BASE2 Ethernet and IEEE 802.5 Token Ring networks +by providing means to reuse existing cabling. + +This driver handles any number of cards installed in a single system. +They get fddi0, fddi1, etc. interface names assigned in the order of +increasing TURBOchannel slot numbers. + +The board only supports DMA on the receive side. Transmission involves +the use of PIO. As a result under a heavy transmission load there will +be a significant impact on system performance. + +The board supports a 64-entry CAM for matching destination addresses. +Two entries are preoccupied by the Directed Beacon and Ring Purger +multicast addresses and the rest is used as a multicast filter. An +all-multi mode is also supported for LLC frames and it is used if +requested explicitly or if the CAM overflows. The promiscuous mode +supports separate enables for LLC and SMT frames, but this driver +doesn't support changing them individually. + + +Known problems: + +None. + + +To do: + +5. MAC address change. The card does not support changing the Media + Access Controller's address registers but a similar effect can be + achieved by adding an alias to the CAM. There is no way to disable + matching against the original address though. + +7. Queueing incoming/outgoing SMT frames in the driver if the SMT + receive/RMC transmit ring is full. (?) + +8. Retrieving/reporting FDDI/SNMP stats. + + +Both success and failure reports are welcome. + +Maciej W. Rozycki diff --git a/Documentation/networking/defza.txt b/Documentation/networking/defza.txt deleted file mode 100644 index 663e4a906751..000000000000 --- a/Documentation/networking/defza.txt +++ /dev/null @@ -1,57 +0,0 @@ -Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver v.1.1.4. - - -DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI -network card, designed in 1990 specifically for the DECstation 5000 -model 200 workstation. The board is a single attachment station and -it was manufactured in two variations, both of which are supported. - -First is the SAS MMF DEFZA-AA option, the original design implementing -the standard MMF-PMD, however with a pair of ST connectors rather than -the usual MIC connector. The other one is the SAS ThinWire/STP DEFZA-CA -option, denoted 700-C, with the network medium selectable by a switch -between the DEC proprietary ThinWire-PMD using a BNC connector and the -standard STP-PMD using a DE-9F connector. This option can interface to -a DECconcentrator 500 device and, in the case of the STP-PMD, also other -FDDI equipment and was designed to make it easier to transition from -existing IEEE 802.3 10BASE2 Ethernet and IEEE 802.5 Token Ring networks -by providing means to reuse existing cabling. - -This driver handles any number of cards installed in a single system. -They get fddi0, fddi1, etc. interface names assigned in the order of -increasing TURBOchannel slot numbers. - -The board only supports DMA on the receive side. Transmission involves -the use of PIO. As a result under a heavy transmission load there will -be a significant impact on system performance. - -The board supports a 64-entry CAM for matching destination addresses. -Two entries are preoccupied by the Directed Beacon and Ring Purger -multicast addresses and the rest is used as a multicast filter. An -all-multi mode is also supported for LLC frames and it is used if -requested explicitly or if the CAM overflows. The promiscuous mode -supports separate enables for LLC and SMT frames, but this driver -doesn't support changing them individually. - - -Known problems: - -None. - - -To do: - -5. MAC address change. The card does not support changing the Media - Access Controller's address registers but a similar effect can be - achieved by adding an alias to the CAM. There is no way to disable - matching against the original address though. - -7. Queueing incoming/outgoing SMT frames in the driver if the SMT - receive/RMC transmit ring is full. (?) - -8. Retrieving/reporting FDDI/SNMP stats. - - -Both success and failure reports are welcome. - -Maciej W. Rozycki diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e17432492745..c893127004b9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -51,6 +51,7 @@ Contents: dccp dctcp decnet + defza .. only:: subproject and html -- cgit From 9dfe1361261be48c92fd7cb26909cbcd5d496220 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:32 +0200 Subject: docs: networking: convert dns_resolver.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/dns_resolver.rst | 155 +++++++++++++++++++++++++++++ Documentation/networking/dns_resolver.txt | 157 ------------------------------ Documentation/networking/index.rst | 1 + net/ceph/Kconfig | 2 +- net/dns_resolver/Kconfig | 2 +- net/dns_resolver/dns_key.c | 2 +- net/dns_resolver/dns_query.c | 2 +- 7 files changed, 160 insertions(+), 161 deletions(-) create mode 100644 Documentation/networking/dns_resolver.rst delete mode 100644 Documentation/networking/dns_resolver.txt diff --git a/Documentation/networking/dns_resolver.rst b/Documentation/networking/dns_resolver.rst new file mode 100644 index 000000000000..add4d59a99a5 --- /dev/null +++ b/Documentation/networking/dns_resolver.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +DNS Resolver Module +=================== + +.. Contents: + + - Overview. + - Compilation. + - Setting up. + - Usage. + - Mechanism. + - Debugging. + + +Overview +======== + +The DNS resolver module provides a way for kernel services to make DNS queries +by way of requesting a key of key type dns_resolver. These queries are +upcalled to userspace through /sbin/request-key. + +These routines must be supported by userspace tools dns.upcall, cifs.upcall and +request-key. It is under development and does not yet provide the full feature +set. The features it does support include: + + (*) Implements the dns_resolver key_type to contact userspace. + +It does not yet support the following AFS features: + + (*) Dns query support for AFSDB resource record. + +This code is extracted from the CIFS filesystem. + + +Compilation +=========== + +The module should be enabled by turning on the kernel configuration options:: + + CONFIG_DNS_RESOLVER - tristate "DNS Resolver support" + + +Setting up +========== + +To set up this facility, the /etc/request-key.conf file must be altered so that +/sbin/request-key can appropriately direct the upcalls. For example, to handle +basic dname to IPv4/IPv6 address resolution, the following line should be +added:: + + + #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ... + #====== ============ ======= ======= ========================== + create dns_resolver * * /usr/sbin/cifs.upcall %k + +To direct a query for query type 'foo', a line of the following should be added +before the more general line given above as the first match is the one taken:: + + create dns_resolver foo:* * /usr/sbin/dns.foo %k + + +Usage +===== + +To make use of this facility, one of the following functions that are +implemented in the module can be called after doing:: + + #include + + :: + + int dns_query(const char *type, const char *name, size_t namelen, + const char *options, char **_result, time_t *_expiry); + + This is the basic access function. It looks for a cached DNS query and if + it doesn't find it, it upcalls to userspace to make a new DNS query, which + may then be cached. The key description is constructed as a string of the + form:: + + [:] + + where optionally specifies the particular upcall program to invoke, + and thus the type of query to do, and specifies the string to be + looked up. The default query type is a straight hostname to IP address + set lookup. + + The name parameter is not required to be a NUL-terminated string, and its + length should be given by the namelen argument. + + The options parameter may be NULL or it may be a set of options + appropriate to the query type. + + The return value is a string appropriate to the query type. For instance, + for the default query type it is just a list of comma-separated IPv4 and + IPv6 addresses. The caller must free the result. + + The length of the result string is returned on success, and a negative + error code is returned otherwise. -EKEYREJECTED will be returned if the + DNS lookup failed. + + If _expiry is non-NULL, the expiry time (TTL) of the result will be + returned also. + +The kernel maintains an internal keyring in which it caches looked up keys. +This can be cleared by any process that has the CAP_SYS_ADMIN capability by +the use of KEYCTL_KEYRING_CLEAR on the keyring ID. + + +Reading DNS Keys from Userspace +=============================== + +Keys of dns_resolver type can be read from userspace using keyctl_read() or +"keyctl read/print/pipe". + + +Mechanism +========= + +The dnsresolver module registers a key type called "dns_resolver". Keys of +this type are used to transport and cache DNS lookup results from userspace. + +When dns_query() is invoked, it calls request_key() to search the local +keyrings for a cached DNS result. If that fails to find one, it upcalls to +userspace to get a new result. + +Upcalls to userspace are made through the request_key() upcall vector, and are +directed by means of configuration lines in /etc/request-key.conf that tell +/sbin/request-key what program to run to instantiate the key. + +The upcall handler program is responsible for querying the DNS, processing the +result into a form suitable for passing to the keyctl_instantiate_key() +routine. This then passes the data to dns_resolver_instantiate() which strips +off and processes any options included in the data, and then attaches the +remainder of the string to the key as its payload. + +The upcall handler program should set the expiry time on the key to that of the +lowest TTL of all the records it has extracted a result from. This means that +the key will be discarded and recreated when the data it holds has expired. + +dns_query() returns a copy of the value attached to the key, or an error if +that is indicated instead. + +See for further +information about request-key function. + + +Debugging +========= + +Debugging messages can be turned on dynamically by writing a 1 into the +following file:: + + /sys/module/dnsresolver/parameters/debug diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt deleted file mode 100644 index eaa8f9a6fd5d..000000000000 --- a/Documentation/networking/dns_resolver.txt +++ /dev/null @@ -1,157 +0,0 @@ - =================== - DNS Resolver Module - =================== - -Contents: - - - Overview. - - Compilation. - - Setting up. - - Usage. - - Mechanism. - - Debugging. - - -======== -OVERVIEW -======== - -The DNS resolver module provides a way for kernel services to make DNS queries -by way of requesting a key of key type dns_resolver. These queries are -upcalled to userspace through /sbin/request-key. - -These routines must be supported by userspace tools dns.upcall, cifs.upcall and -request-key. It is under development and does not yet provide the full feature -set. The features it does support include: - - (*) Implements the dns_resolver key_type to contact userspace. - -It does not yet support the following AFS features: - - (*) Dns query support for AFSDB resource record. - -This code is extracted from the CIFS filesystem. - - -=========== -COMPILATION -=========== - -The module should be enabled by turning on the kernel configuration options: - - CONFIG_DNS_RESOLVER - tristate "DNS Resolver support" - - -========== -SETTING UP -========== - -To set up this facility, the /etc/request-key.conf file must be altered so that -/sbin/request-key can appropriately direct the upcalls. For example, to handle -basic dname to IPv4/IPv6 address resolution, the following line should be -added: - - #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ... - #====== ============ ======= ======= ========================== - create dns_resolver * * /usr/sbin/cifs.upcall %k - -To direct a query for query type 'foo', a line of the following should be added -before the more general line given above as the first match is the one taken. - - create dns_resolver foo:* * /usr/sbin/dns.foo %k - - -===== -USAGE -===== - -To make use of this facility, one of the following functions that are -implemented in the module can be called after doing: - - #include - - (1) int dns_query(const char *type, const char *name, size_t namelen, - const char *options, char **_result, time_t *_expiry); - - This is the basic access function. It looks for a cached DNS query and if - it doesn't find it, it upcalls to userspace to make a new DNS query, which - may then be cached. The key description is constructed as a string of the - form: - - [:] - - where optionally specifies the particular upcall program to invoke, - and thus the type of query to do, and specifies the string to be - looked up. The default query type is a straight hostname to IP address - set lookup. - - The name parameter is not required to be a NUL-terminated string, and its - length should be given by the namelen argument. - - The options parameter may be NULL or it may be a set of options - appropriate to the query type. - - The return value is a string appropriate to the query type. For instance, - for the default query type it is just a list of comma-separated IPv4 and - IPv6 addresses. The caller must free the result. - - The length of the result string is returned on success, and a negative - error code is returned otherwise. -EKEYREJECTED will be returned if the - DNS lookup failed. - - If _expiry is non-NULL, the expiry time (TTL) of the result will be - returned also. - -The kernel maintains an internal keyring in which it caches looked up keys. -This can be cleared by any process that has the CAP_SYS_ADMIN capability by -the use of KEYCTL_KEYRING_CLEAR on the keyring ID. - - -=============================== -READING DNS KEYS FROM USERSPACE -=============================== - -Keys of dns_resolver type can be read from userspace using keyctl_read() or -"keyctl read/print/pipe". - - -========= -MECHANISM -========= - -The dnsresolver module registers a key type called "dns_resolver". Keys of -this type are used to transport and cache DNS lookup results from userspace. - -When dns_query() is invoked, it calls request_key() to search the local -keyrings for a cached DNS result. If that fails to find one, it upcalls to -userspace to get a new result. - -Upcalls to userspace are made through the request_key() upcall vector, and are -directed by means of configuration lines in /etc/request-key.conf that tell -/sbin/request-key what program to run to instantiate the key. - -The upcall handler program is responsible for querying the DNS, processing the -result into a form suitable for passing to the keyctl_instantiate_key() -routine. This then passes the data to dns_resolver_instantiate() which strips -off and processes any options included in the data, and then attaches the -remainder of the string to the key as its payload. - -The upcall handler program should set the expiry time on the key to that of the -lowest TTL of all the records it has extracted a result from. This means that -the key will be discarded and recreated when the data it holds has expired. - -dns_query() returns a copy of the value attached to the key, or an error if -that is indicated instead. - -See for further -information about request-key function. - - -========= -DEBUGGING -========= - -Debugging messages can be turned on dynamically by writing a 1 into the -following file: - - /sys/module/dnsresolver/parameters/debug diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index c893127004b9..55746038a7e9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -52,6 +52,7 @@ Contents: dctcp decnet defza + dns_resolver .. only:: subproject and html diff --git a/net/ceph/Kconfig b/net/ceph/Kconfig index 2e8e6f904920..d7bec7adc267 100644 --- a/net/ceph/Kconfig +++ b/net/ceph/Kconfig @@ -39,6 +39,6 @@ config CEPH_LIB_USE_DNS_RESOLVER be resolved using the CONFIG_DNS_RESOLVER facility. For information on how to use CONFIG_DNS_RESOLVER consult - Documentation/networking/dns_resolver.txt + Documentation/networking/dns_resolver.rst If unsure, say N. diff --git a/net/dns_resolver/Kconfig b/net/dns_resolver/Kconfig index 0a1c2238b4bd..255df9b6e9e8 100644 --- a/net/dns_resolver/Kconfig +++ b/net/dns_resolver/Kconfig @@ -19,7 +19,7 @@ config DNS_RESOLVER SMB2 later. DNS Resolver is supported by the userspace upcall helper "/sbin/dns.resolver" via /etc/request-key.conf. - See for further + See for further information. To compile this as a module, choose M here: the module will be called diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c index ad53eb31d40f..3aced951d5ab 100644 --- a/net/dns_resolver/dns_key.c +++ b/net/dns_resolver/dns_key.c @@ -1,6 +1,6 @@ /* Key type used to cache DNS lookups made by the kernel * - * See Documentation/networking/dns_resolver.txt + * See Documentation/networking/dns_resolver.rst * * Copyright (c) 2007 Igor Mammedov * Author(s): Igor Mammedov (niallain@gmail.com) diff --git a/net/dns_resolver/dns_query.c b/net/dns_resolver/dns_query.c index cab4e0df924f..82b084cc1cc6 100644 --- a/net/dns_resolver/dns_query.c +++ b/net/dns_resolver/dns_query.c @@ -1,7 +1,7 @@ /* Upcall routine, designed to work as a key type and working through * /sbin/request-key to contact userspace when handling DNS queries. * - * See Documentation/networking/dns_resolver.txt + * See Documentation/networking/dns_resolver.rst * * Copyright (c) 2007 Igor Mammedov * Author(s): Igor Mammedov (niallain@gmail.com) -- cgit From 28d23311ff35ac97ff20608f47c84c95d6389c33 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:33 +0200 Subject: docs: networking: convert driver.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/driver.rst | 97 +++++++++++++++++++++++++++++++++++++ Documentation/networking/driver.txt | 93 ----------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 98 insertions(+), 93 deletions(-) create mode 100644 Documentation/networking/driver.rst delete mode 100644 Documentation/networking/driver.txt diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst new file mode 100644 index 000000000000..c8f59dbda46f --- /dev/null +++ b/Documentation/networking/driver.rst @@ -0,0 +1,97 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Softnet Driver Issues +===================== + +Transmit path guidelines: + +1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under + any normal circumstances. It is considered a hard error unless + there is no way your device can tell ahead of time when it's + transmit function will become busy. + + Instead it must maintain the queue properly. For example, + for a driver implementing scatter-gather this means:: + + static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, + struct net_device *dev) + { + struct drv *dp = netdev_priv(dev); + + lock_tx(dp); + ... + /* This is a hard error log it. */ + if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) { + netif_stop_queue(dev); + unlock_tx(dp); + printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", + dev->name); + return NETDEV_TX_BUSY; + } + + ... queue packet to card ... + ... update tx consumer index ... + + if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1)) + netif_stop_queue(dev); + + ... + unlock_tx(dp); + ... + return NETDEV_TX_OK; + } + + And then at the end of your TX reclamation event handling:: + + if (netif_queue_stopped(dp->dev) && + TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) + netif_wake_queue(dp->dev); + + For a non-scatter-gather supporting card, the three tests simply become:: + + /* This is a hard error log it. */ + if (TX_BUFFS_AVAIL(dp) <= 0) + + and:: + + if (TX_BUFFS_AVAIL(dp) == 0) + + and:: + + if (netif_queue_stopped(dp->dev) && + TX_BUFFS_AVAIL(dp) > 0) + netif_wake_queue(dp->dev); + +2) An ndo_start_xmit method must not modify the shared parts of a + cloned SKB. + +3) Do not forget that once you return NETDEV_TX_OK from your + ndo_start_xmit method, it is your driver's responsibility to free + up the SKB and in some finite amount of time. + + For example, this means that it is not allowed for your TX + mitigation scheme to let TX packets "hang out" in the TX + ring unreclaimed forever if no new TX packets are sent. + This error can deadlock sockets waiting for send buffer room + to be freed up. + + If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you + must not keep any reference to that SKB and you must not attempt + to free it up. + +Probing guidelines: + +1) Any hardware layer address you obtain for your device should + be verified. For example, for ethernet check it with + linux/etherdevice.h:is_valid_ether_addr() + +Close/stop guidelines: + +1) After the ndo_stop routine has been called, the hardware must + not receive or transmit any data. All in flight packets must + be aborted. If necessary, poll or wait for completion of + any reset commands. + +2) The ndo_stop routine will be called by unregister_netdevice + if device is still UP. diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.txt deleted file mode 100644 index da59e2884130..000000000000 --- a/Documentation/networking/driver.txt +++ /dev/null @@ -1,93 +0,0 @@ -Document about softnet driver issues - -Transmit path guidelines: - -1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under - any normal circumstances. It is considered a hard error unless - there is no way your device can tell ahead of time when it's - transmit function will become busy. - - Instead it must maintain the queue properly. For example, - for a driver implementing scatter-gather this means: - - static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, - struct net_device *dev) - { - struct drv *dp = netdev_priv(dev); - - lock_tx(dp); - ... - /* This is a hard error log it. */ - if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) { - netif_stop_queue(dev); - unlock_tx(dp); - printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", - dev->name); - return NETDEV_TX_BUSY; - } - - ... queue packet to card ... - ... update tx consumer index ... - - if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1)) - netif_stop_queue(dev); - - ... - unlock_tx(dp); - ... - return NETDEV_TX_OK; - } - - And then at the end of your TX reclamation event handling: - - if (netif_queue_stopped(dp->dev) && - TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) - netif_wake_queue(dp->dev); - - For a non-scatter-gather supporting card, the three tests simply become: - - /* This is a hard error log it. */ - if (TX_BUFFS_AVAIL(dp) <= 0) - - and: - - if (TX_BUFFS_AVAIL(dp) == 0) - - and: - - if (netif_queue_stopped(dp->dev) && - TX_BUFFS_AVAIL(dp) > 0) - netif_wake_queue(dp->dev); - -2) An ndo_start_xmit method must not modify the shared parts of a - cloned SKB. - -3) Do not forget that once you return NETDEV_TX_OK from your - ndo_start_xmit method, it is your driver's responsibility to free - up the SKB and in some finite amount of time. - - For example, this means that it is not allowed for your TX - mitigation scheme to let TX packets "hang out" in the TX - ring unreclaimed forever if no new TX packets are sent. - This error can deadlock sockets waiting for send buffer room - to be freed up. - - If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you - must not keep any reference to that SKB and you must not attempt - to free it up. - -Probing guidelines: - -1) Any hardware layer address you obtain for your device should - be verified. For example, for ethernet check it with - linux/etherdevice.h:is_valid_ether_addr() - -Close/stop guidelines: - -1) After the ndo_stop routine has been called, the hardware must - not receive or transmit any data. All in flight packets must - be aborted. If necessary, poll or wait for completion of - any reset commands. - -2) The ndo_stop routine will be called by unregister_netdevice - if device is still UP. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 55746038a7e9..313f66900bce 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -53,6 +53,7 @@ Contents: decnet defza dns_resolver + driver .. only:: subproject and html -- cgit From 06df65723b69a3d4691737503654400c35f9ca5a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:34 +0200 Subject: docs: networking: convert eql.txt to ReST - add SPDX header; - add a document title; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/eql.rst | 373 ++++++++++++++++++++++++++ Documentation/networking/eql.txt | 528 ------------------------------------- Documentation/networking/index.rst | 1 + drivers/net/Kconfig | 2 +- 4 files changed, 375 insertions(+), 529 deletions(-) create mode 100644 Documentation/networking/eql.rst delete mode 100644 Documentation/networking/eql.txt diff --git a/Documentation/networking/eql.rst b/Documentation/networking/eql.rst new file mode 100644 index 000000000000..a628c4c81166 --- /dev/null +++ b/Documentation/networking/eql.rst @@ -0,0 +1,373 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +EQL Driver: Serial IP Load Balancing HOWTO +========================================== + + Simon "Guru Aleph-Null" Janes, simon@ncm.com + + v1.1, February 27, 1995 + + This is the manual for the EQL device driver. EQL is a software device + that lets you load-balance IP serial links (SLIP or uncompressed PPP) + to increase your bandwidth. It will not reduce your latency (i.e. ping + times) except in the case where you already have lots of traffic on + your link, in which it will help them out. This driver has been tested + with the 1.1.75 kernel, and is known to have patched cleanly with + 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch + which was only created to patch cleanly in the very latest kernel + source trees. (Yes, it worked fine.) + +1. Introduction +=============== + + Which is worse? A huge fee for a 56K leased line or two phone lines? + It's probably the former. If you find yourself craving more bandwidth, + and have a ISP that is flexible, it is now possible to bind modems + together to work as one point-to-point link to increase your + bandwidth. All without having to have a special black box on either + side. + + + The eql driver has only been tested with the Livingston PortMaster-2e + terminal server. I do not know if other terminal servers support load- + balancing, but I do know that the PortMaster does it, and does it + almost as well as the eql driver seems to do it (-- Unfortunately, in + my testing so far, the Livingston PortMaster 2e's load-balancing is a + good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps + and 14.4 Kbps connection. However, I am not sure that it really is + the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's + TCP implementation is pretty fast though.--) + + + I suggest to ISPs out there that it would probably be fair to charge + a load-balancing client 75% of the cost of the second line and 50% of + the cost of the third line etc... + + + Hey, we can all dream you know... + + +2. Kernel Configuration +======================= + + Here I describe the general steps of getting a kernel up and working + with the eql driver. From patching, building, to installing. + + +2.1. Patching The Kernel +------------------------ + + If you do not have or cannot get a copy of the kernel with the eql + driver folded into it, get your copy of the driver from + ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz. + Unpack this archive someplace obvious like /usr/local/src/. It will + create the following files:: + + -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY + -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch + -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave + -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c + + Unpack a recent kernel (something after 1.1.92) someplace convenient + like say /usr/src/linux-1.1.92.eql. Use symbolic links to point + /usr/src/linux to this development directory. + + + Apply the patch by running the commands:: + + cd /usr/src + patch + ". Here are some example enslavings:: + + eql_enslave eql sl0 28800 + eql_enslave eql ppp0 14400 + eql_enslave eql sl1 57600 + + When you want to free a device from its life of slavery, you can + either down the device with ifconfig (eql will automatically bury the + dead slave and remove it from its queue) or use eql_emancipate to free + it. (-- Or just ifconfig it down, and the eql driver will take it out + for you.--):: + + eql_emancipate eql sl0 + eql_emancipate eql ppp0 + eql_emancipate eql sl1 + + +3.3. DSLIP Configuration for the eql Device +------------------------------------------- + + The general idea is to bring up and keep up as many SLIP connections + as you need, automatically. + + +3.3.1. /etc/slip/runslip.conf +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + Here is an example runslip.conf:: + + name sl-line-1 + enabled + baud 38400 + mtu 576 + ducmd -e /etc/slip/dialout/cua2-288.xp -t 9 + command eql_enslave eql $interface 28800 + address 198.67.33.239 + line /dev/cua2 + + name sl-line-2 + enabled + baud 38400 + mtu 576 + ducmd -e /etc/slip/dialout/cua3-288.xp -t 9 + command eql_enslave eql $interface 28800 + address 198.67.33.239 + line /dev/cua3 + + +3.4. Using PPP and the eql Device +--------------------------------- + + I have not yet done any load-balancing testing for PPP devices, mainly + because I don't have a PPP-connection manager like SLIP has with + DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance: + make sure you have asyncmap set to something so that control + characters are not escaped. + + + I tried to fix up a PPP script/system for redialing lost PPP + connections for use with the eql driver the weekend of Feb 25-26 '95 + (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this + year. + + +4. About the Slave Scheduler Algorithm +====================================== + + The slave scheduler probably could be replaced with a dozen other + things and push traffic much faster. The formula in the current set + up of the driver was tuned to handle slaves with wildly different + bits-per-second "priorities". + + + All testing I have done was with two 28.8 V.FC modems, one connecting + at 28800 bps or slower, and the other connecting at 14400 bps all the + time. + + + One version of the scheduler was able to push 5.3 K/s through the + 28800 and 14400 connections, but when the priorities on the links were + very wide apart (57600 vs. 14400) the "faster" modem received all + traffic and the "slower" modem starved. + + +5. Testers' Reports +=================== + + Some people have experimented with the eql device with newer + kernels (than 1.1.75). I have since updated the driver to patch + cleanly in newer kernels because of the removal of the old "slave- + balancing" driver config option. + + + - icee from LinuxNET patched 1.1.86 without any rejects and was able + to boot the kernel and enslave a couple of ISDN PPP links. + +5.1. Randolph Bentson's Test Report +----------------------------------- + + :: + + From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995 + Date: Tue, 7 Feb 95 22:57 PST + From: Randolph Bentson + To: guru@ncm.com + Subject: EQL driver tests + + + I have been checking out your eql driver. (Nice work, that!) + Although you may already done this performance testing, here + are some data I've discovered. + + Randolph Bentson + bentson@grieg.seaslug.org + +------------------------------------------------------------------ + + + A pseudo-device driver, EQL, written by Simon Janes, can be used + to bundle multiple SLIP connections into what appears to be a + single connection. This allows one to improve dial-up network + connectivity gradually, without having to buy expensive DSU/CSU + hardware and services. + + I have done some testing of this software, with two goals in + mind: first, to ensure it actually works as described and + second, as a method of exercising my device driver. + + The following performance measurements were derived from a set + of SLIP connections run between two Linux systems (1.1.84) using + a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y. + (Ports 0,1,2,3 were used. A later configuration will distribute + port selection across the different Cirrus chips on the boards.) + Once a link was established, I timed a binary ftp transfer of + 289284 bytes of data. If there were no overhead (packet headers, + inter-character and inter-packet delays, etc.) the transfers + would take the following times:: + + bits/sec seconds + 345600 8.3 + 234600 12.3 + 172800 16.7 + 153600 18.8 + 76800 37.6 + 57600 50.2 + 38400 75.3 + 28800 100.4 + 19200 150.6 + 9600 301.3 + + A single line running at the lower speeds and with large packets + comes to within 2% of this. Performance is limited for the higher + speeds (as predicted by the Cirrus databook) to an aggregate of + about 160 kbits/sec. The next round of testing will distribute + the load across two or more Cirrus chips. + + The good news is that one gets nearly the full advantage of the + second, third, and fourth line's bandwidth. (The bad news is + that the connection establishment seemed fragile for the higher + speeds. Once established, the connection seemed robust enough.) + + ====== ======== === ======== ======= ======= === + #lines speed mtu seconds theory actual %of + kbit/sec duration speed speed max + ====== ======== === ======== ======= ======= === + 3 115200 900 _ 345600 + 3 115200 400 18.1 345600 159825 46 + 2 115200 900 _ 230400 + 2 115200 600 18.1 230400 159825 69 + 2 115200 400 19.3 230400 149888 65 + 4 57600 900 _ 234600 + 4 57600 600 _ 234600 + 4 57600 400 _ 234600 + 3 57600 600 20.9 172800 138413 80 + 3 57600 900 21.2 172800 136455 78 + 3 115200 600 21.7 345600 133311 38 + 3 57600 400 22.5 172800 128571 74 + 4 38400 900 25.2 153600 114795 74 + 4 38400 600 26.4 153600 109577 71 + 4 38400 400 27.3 153600 105965 68 + 2 57600 900 29.1 115200 99410.3 86 + 1 115200 900 30.7 115200 94229.3 81 + 2 57600 600 30.2 115200 95789.4 83 + 3 38400 900 30.3 115200 95473.3 82 + 3 38400 600 31.2 115200 92719.2 80 + 1 115200 600 31.3 115200 92423 80 + 2 57600 400 32.3 115200 89561.6 77 + 1 115200 400 32.8 115200 88196.3 76 + 3 38400 400 33.5 115200 86353.4 74 + 2 38400 900 43.7 76800 66197.7 86 + 2 38400 600 44 76800 65746.4 85 + 2 38400 400 47.2 76800 61289 79 + 4 19200 900 50.8 76800 56945.7 74 + 4 19200 400 53.2 76800 54376.7 70 + 4 19200 600 53.7 76800 53870.4 70 + 1 57600 900 54.6 57600 52982.4 91 + 1 57600 600 56.2 57600 51474 89 + 3 19200 900 60.5 57600 47815.5 83 + 1 57600 400 60.2 57600 48053.8 83 + 3 19200 600 62 57600 46658.7 81 + 3 19200 400 64.7 57600 44711.6 77 + 1 38400 900 79.4 38400 36433.8 94 + 1 38400 600 82.4 38400 35107.3 91 + 2 19200 900 84.4 38400 34275.4 89 + 1 38400 400 86.8 38400 33327.6 86 + 2 19200 600 87.6 38400 33023.3 85 + 2 19200 400 91.2 38400 31719.7 82 + 4 9600 900 94.7 38400 30547.4 79 + 4 9600 400 106 38400 27290.9 71 + 4 9600 600 110 38400 26298.5 68 + 3 9600 900 118 28800 24515.6 85 + 3 9600 600 120 28800 24107 83 + 3 9600 400 131 28800 22082.7 76 + 1 19200 900 155 19200 18663.5 97 + 1 19200 600 161 19200 17968 93 + 1 19200 400 170 19200 17016.7 88 + 2 9600 600 176 19200 16436.6 85 + 2 9600 900 180 19200 16071.3 83 + 2 9600 400 181 19200 15982.5 83 + 1 9600 900 305 9600 9484.72 98 + 1 9600 600 314 9600 9212.87 95 + 1 9600 400 332 9600 8713.37 90 + ====== ======== === ======== ======= ======= === + +5.2. Anthony Healy's Report +--------------------------- + + :: + + Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST) + From: Antony Healey + To: Simon Janes + Subject: Re: Load Balancing + + Hi Simon, + I've installed your patch and it works great. I have trialed + it over twin SL/IP lines, just over null modems, but I was + able to data at over 48Kb/s [ISDN link -Simon]. I managed a + transfer of up to 7.5 Kbyte/s on one go, but averaged around + 6.4 Kbyte/s, which I think is pretty cool. :) diff --git a/Documentation/networking/eql.txt b/Documentation/networking/eql.txt deleted file mode 100644 index 0f1550150f05..000000000000 --- a/Documentation/networking/eql.txt +++ /dev/null @@ -1,528 +0,0 @@ - EQL Driver: Serial IP Load Balancing HOWTO - Simon "Guru Aleph-Null" Janes, simon@ncm.com - v1.1, February 27, 1995 - - This is the manual for the EQL device driver. EQL is a software device - that lets you load-balance IP serial links (SLIP or uncompressed PPP) - to increase your bandwidth. It will not reduce your latency (i.e. ping - times) except in the case where you already have lots of traffic on - your link, in which it will help them out. This driver has been tested - with the 1.1.75 kernel, and is known to have patched cleanly with - 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch - which was only created to patch cleanly in the very latest kernel - source trees. (Yes, it worked fine.) - - 1. Introduction - - Which is worse? A huge fee for a 56K leased line or two phone lines? - It's probably the former. If you find yourself craving more bandwidth, - and have a ISP that is flexible, it is now possible to bind modems - together to work as one point-to-point link to increase your - bandwidth. All without having to have a special black box on either - side. - - - The eql driver has only been tested with the Livingston PortMaster-2e - terminal server. I do not know if other terminal servers support load- - balancing, but I do know that the PortMaster does it, and does it - almost as well as the eql driver seems to do it (-- Unfortunately, in - my testing so far, the Livingston PortMaster 2e's load-balancing is a - good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps - and 14.4 Kbps connection. However, I am not sure that it really is - the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's - TCP implementation is pretty fast though.--) - - - I suggest to ISPs out there that it would probably be fair to charge - a load-balancing client 75% of the cost of the second line and 50% of - the cost of the third line etc... - - - Hey, we can all dream you know... - - - 2. Kernel Configuration - - Here I describe the general steps of getting a kernel up and working - with the eql driver. From patching, building, to installing. - - - 2.1. Patching The Kernel - - If you do not have or cannot get a copy of the kernel with the eql - driver folded into it, get your copy of the driver from - ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz. - Unpack this archive someplace obvious like /usr/local/src/. It will - create the following files: - - - - ______________________________________________________________________ - -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY - -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch - -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave - -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c - ______________________________________________________________________ - - Unpack a recent kernel (something after 1.1.92) someplace convenient - like say /usr/src/linux-1.1.92.eql. Use symbolic links to point - /usr/src/linux to this development directory. - - - Apply the patch by running the commands: - - - ______________________________________________________________________ - cd /usr/src - patch - ". Here are some example enslavings: - - - - ______________________________________________________________________ - eql_enslave eql sl0 28800 - eql_enslave eql ppp0 14400 - eql_enslave eql sl1 57600 - ______________________________________________________________________ - - - - - - When you want to free a device from its life of slavery, you can - either down the device with ifconfig (eql will automatically bury the - dead slave and remove it from its queue) or use eql_emancipate to free - it. (-- Or just ifconfig it down, and the eql driver will take it out - for you.--) - - - - ______________________________________________________________________ - eql_emancipate eql sl0 - eql_emancipate eql ppp0 - eql_emancipate eql sl1 - ______________________________________________________________________ - - - - - - 3.3. DSLIP Configuration for the eql Device - - The general idea is to bring up and keep up as many SLIP connections - as you need, automatically. - - - 3.3.1. /etc/slip/runslip.conf - - Here is an example runslip.conf: - - - - - - - - - - - - - - - - ______________________________________________________________________ - name sl-line-1 - enabled - baud 38400 - mtu 576 - ducmd -e /etc/slip/dialout/cua2-288.xp -t 9 - command eql_enslave eql $interface 28800 - address 198.67.33.239 - line /dev/cua2 - - name sl-line-2 - enabled - baud 38400 - mtu 576 - ducmd -e /etc/slip/dialout/cua3-288.xp -t 9 - command eql_enslave eql $interface 28800 - address 198.67.33.239 - line /dev/cua3 - ______________________________________________________________________ - - - - - - 3.4. Using PPP and the eql Device - - I have not yet done any load-balancing testing for PPP devices, mainly - because I don't have a PPP-connection manager like SLIP has with - DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance: - make sure you have asyncmap set to something so that control - characters are not escaped. - - - I tried to fix up a PPP script/system for redialing lost PPP - connections for use with the eql driver the weekend of Feb 25-26 '95 - (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this - year. - - - 4. About the Slave Scheduler Algorithm - - The slave scheduler probably could be replaced with a dozen other - things and push traffic much faster. The formula in the current set - up of the driver was tuned to handle slaves with wildly different - bits-per-second "priorities". - - - All testing I have done was with two 28.8 V.FC modems, one connecting - at 28800 bps or slower, and the other connecting at 14400 bps all the - time. - - - One version of the scheduler was able to push 5.3 K/s through the - 28800 and 14400 connections, but when the priorities on the links were - very wide apart (57600 vs. 14400) the "faster" modem received all - traffic and the "slower" modem starved. - - - 5. Testers' Reports - - Some people have experimented with the eql device with newer - kernels (than 1.1.75). I have since updated the driver to patch - cleanly in newer kernels because of the removal of the old "slave- - balancing" driver config option. - - - o icee from LinuxNET patched 1.1.86 without any rejects and was able - to boot the kernel and enslave a couple of ISDN PPP links. - - 5.1. Randolph Bentson's Test Report - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995 - Date: Tue, 7 Feb 95 22:57 PST - From: Randolph Bentson - To: guru@ncm.com - Subject: EQL driver tests - - - I have been checking out your eql driver. (Nice work, that!) - Although you may already done this performance testing, here - are some data I've discovered. - - Randolph Bentson - bentson@grieg.seaslug.org - - --------------------------------------------------------- - - - A pseudo-device driver, EQL, written by Simon Janes, can be used - to bundle multiple SLIP connections into what appears to be a - single connection. This allows one to improve dial-up network - connectivity gradually, without having to buy expensive DSU/CSU - hardware and services. - - I have done some testing of this software, with two goals in - mind: first, to ensure it actually works as described and - second, as a method of exercising my device driver. - - The following performance measurements were derived from a set - of SLIP connections run between two Linux systems (1.1.84) using - a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y. - (Ports 0,1,2,3 were used. A later configuration will distribute - port selection across the different Cirrus chips on the boards.) - Once a link was established, I timed a binary ftp transfer of - 289284 bytes of data. If there were no overhead (packet headers, - inter-character and inter-packet delays, etc.) the transfers - would take the following times: - - bits/sec seconds - 345600 8.3 - 234600 12.3 - 172800 16.7 - 153600 18.8 - 76800 37.6 - 57600 50.2 - 38400 75.3 - 28800 100.4 - 19200 150.6 - 9600 301.3 - - A single line running at the lower speeds and with large packets - comes to within 2% of this. Performance is limited for the higher - speeds (as predicted by the Cirrus databook) to an aggregate of - about 160 kbits/sec. The next round of testing will distribute - the load across two or more Cirrus chips. - - The good news is that one gets nearly the full advantage of the - second, third, and fourth line's bandwidth. (The bad news is - that the connection establishment seemed fragile for the higher - speeds. Once established, the connection seemed robust enough.) - - #lines speed mtu seconds theory actual %of - kbit/sec duration speed speed max - 3 115200 900 _ 345600 - 3 115200 400 18.1 345600 159825 46 - 2 115200 900 _ 230400 - 2 115200 600 18.1 230400 159825 69 - 2 115200 400 19.3 230400 149888 65 - 4 57600 900 _ 234600 - 4 57600 600 _ 234600 - 4 57600 400 _ 234600 - 3 57600 600 20.9 172800 138413 80 - 3 57600 900 21.2 172800 136455 78 - 3 115200 600 21.7 345600 133311 38 - 3 57600 400 22.5 172800 128571 74 - 4 38400 900 25.2 153600 114795 74 - 4 38400 600 26.4 153600 109577 71 - 4 38400 400 27.3 153600 105965 68 - 2 57600 900 29.1 115200 99410.3 86 - 1 115200 900 30.7 115200 94229.3 81 - 2 57600 600 30.2 115200 95789.4 83 - 3 38400 900 30.3 115200 95473.3 82 - 3 38400 600 31.2 115200 92719.2 80 - 1 115200 600 31.3 115200 92423 80 - 2 57600 400 32.3 115200 89561.6 77 - 1 115200 400 32.8 115200 88196.3 76 - 3 38400 400 33.5 115200 86353.4 74 - 2 38400 900 43.7 76800 66197.7 86 - 2 38400 600 44 76800 65746.4 85 - 2 38400 400 47.2 76800 61289 79 - 4 19200 900 50.8 76800 56945.7 74 - 4 19200 400 53.2 76800 54376.7 70 - 4 19200 600 53.7 76800 53870.4 70 - 1 57600 900 54.6 57600 52982.4 91 - 1 57600 600 56.2 57600 51474 89 - 3 19200 900 60.5 57600 47815.5 83 - 1 57600 400 60.2 57600 48053.8 83 - 3 19200 600 62 57600 46658.7 81 - 3 19200 400 64.7 57600 44711.6 77 - 1 38400 900 79.4 38400 36433.8 94 - 1 38400 600 82.4 38400 35107.3 91 - 2 19200 900 84.4 38400 34275.4 89 - 1 38400 400 86.8 38400 33327.6 86 - 2 19200 600 87.6 38400 33023.3 85 - 2 19200 400 91.2 38400 31719.7 82 - 4 9600 900 94.7 38400 30547.4 79 - 4 9600 400 106 38400 27290.9 71 - 4 9600 600 110 38400 26298.5 68 - 3 9600 900 118 28800 24515.6 85 - 3 9600 600 120 28800 24107 83 - 3 9600 400 131 28800 22082.7 76 - 1 19200 900 155 19200 18663.5 97 - 1 19200 600 161 19200 17968 93 - 1 19200 400 170 19200 17016.7 88 - 2 9600 600 176 19200 16436.6 85 - 2 9600 900 180 19200 16071.3 83 - 2 9600 400 181 19200 15982.5 83 - 1 9600 900 305 9600 9484.72 98 - 1 9600 600 314 9600 9212.87 95 - 1 9600 400 332 9600 8713.37 90 - - - - - - 5.2. Anthony Healy's Report - - - - - - - - Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST) - From: Antony Healey - To: Simon Janes - Subject: Re: Load Balancing - - Hi Simon, - I've installed your patch and it works great. I have trialed - it over twin SL/IP lines, just over null modems, but I was - able to data at over 48Kb/s [ISDN link -Simon]. I managed a - transfer of up to 7.5 Kbyte/s on one go, but averaged around - 6.4 Kbyte/s, which I think is pretty cool. :) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 313f66900bce..9ef6ef42bdc5 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -54,6 +54,7 @@ Contents: defza dns_resolver driver + eql .. only:: subproject and html diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 4ab6d343fd86..c822f4a6d166 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -126,7 +126,7 @@ config EQUALIZER Linux driver or with a Livingston Portmaster 2e. Say Y if you want this and read - . You may also want to read + . You may also want to read section 6.2 of the NET-3-HOWTO, available from . -- cgit From aee113427c5d205730b2c1a023661799f41aca23 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:35 +0200 Subject: docs: networking: convert fib_trie.txt to ReST - add SPDX header; - adjust title markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/fib_trie.rst | 149 ++++++++++++++++++++++++++++++++++ Documentation/networking/fib_trie.txt | 145 --------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 150 insertions(+), 145 deletions(-) create mode 100644 Documentation/networking/fib_trie.rst delete mode 100644 Documentation/networking/fib_trie.txt diff --git a/Documentation/networking/fib_trie.rst b/Documentation/networking/fib_trie.rst new file mode 100644 index 000000000000..f1435b7fcdb7 --- /dev/null +++ b/Documentation/networking/fib_trie.rst @@ -0,0 +1,149 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +LC-trie implementation notes +============================ + +Node types +---------- +leaf + An end node with data. This has a copy of the relevant key, along + with 'hlist' with routing table entries sorted by prefix length. + See struct leaf and struct leaf_info. + +trie node or tnode + An internal node, holding an array of child (leaf or tnode) pointers, + indexed through a subset of the key. See Level Compression. + +A few concepts explained +------------------------ +Bits (tnode) + The number of bits in the key segment used for indexing into the + child array - the "child index". See Level Compression. + +Pos (tnode) + The position (in the key) of the key segment used for indexing into + the child array. See Path Compression. + +Path Compression / skipped bits + Any given tnode is linked to from the child array of its parent, using + a segment of the key specified by the parent's "pos" and "bits" + In certain cases, this tnode's own "pos" will not be immediately + adjacent to the parent (pos+bits), but there will be some bits + in the key skipped over because they represent a single path with no + deviations. These "skipped bits" constitute Path Compression. + Note that the search algorithm will simply skip over these bits when + searching, making it necessary to save the keys in the leaves to + verify that they actually do match the key we are searching for. + +Level Compression / child arrays + the trie is kept level balanced moving, under certain conditions, the + children of a full child (see "full_children") up one level, so that + instead of a pure binary tree, each internal node ("tnode") may + contain an arbitrarily large array of links to several children. + Conversely, a tnode with a mostly empty child array (see empty_children) + may be "halved", having some of its children moved downwards one level, + in order to avoid ever-increasing child arrays. + +empty_children + the number of positions in the child array of a given tnode that are + NULL. + +full_children + the number of children of a given tnode that aren't path compressed. + (in other words, they aren't NULL or leaves and their "pos" is equal + to this tnode's "pos"+"bits"). + + (The word "full" here is used more in the sense of "complete" than + as the opposite of "empty", which might be a tad confusing.) + +Comments +--------- + +We have tried to keep the structure of the code as close to fib_hash as +possible to allow verification and help up reviewing. + +fib_find_node() + A good start for understanding this code. This function implements a + straightforward trie lookup. + +fib_insert_node() + Inserts a new leaf node in the trie. This is bit more complicated than + fib_find_node(). Inserting a new node means we might have to run the + level compression algorithm on part of the trie. + +trie_leaf_remove() + Looks up a key, deletes it and runs the level compression algorithm. + +trie_rebalance() + The key function for the dynamic trie after any change in the trie + it is run to optimize and reorganize. It will walk the trie upwards + towards the root from a given tnode, doing a resize() at each step + to implement level compression. + +resize() + Analyzes a tnode and optimizes the child array size by either inflating + or shrinking it repeatedly until it fulfills the criteria for optimal + level compression. This part follows the original paper pretty closely + and there may be some room for experimentation here. + +inflate() + Doubles the size of the child array within a tnode. Used by resize(). + +halve() + Halves the size of the child array within a tnode - the inverse of + inflate(). Used by resize(); + +fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() + The route manipulation functions. Should conform pretty closely to the + corresponding functions in fib_hash. + +fn_trie_flush() + This walks the full trie (using nextleaf()) and searches for empty + leaves which have to be removed. + +fn_trie_dump() + Dumps the routing table ordered by prefix length. This is somewhat + slower than the corresponding fib_hash function, as we have to walk the + entire trie for each prefix length. In comparison, fib_hash is organized + as one "zone"/hash per prefix length. + +Locking +------- + +fib_lock is used for an RW-lock in the same way that this is done in fib_hash. +However, the functions are somewhat separated for other possible locking +scenarios. It might conceivably be possible to run trie_rebalance via RCU +to avoid read_lock in the fn_trie_lookup() function. + +Main lookup mechanism +--------------------- +fn_trie_lookup() is the main lookup function. + +The lookup is in its simplest form just like fib_find_node(). We descend the +trie, key segment by key segment, until we find a leaf. check_leaf() does +the fib_semantic_match in the leaf's sorted prefix hlist. + +If we find a match, we are done. + +If we don't find a match, we enter prefix matching mode. The prefix length, +starting out at the same as the key length, is reduced one step at a time, +and we backtrack upwards through the trie trying to find a longest matching +prefix. The goal is always to reach a leaf and get a positive result from the +fib_semantic_match mechanism. + +Inside each tnode, the search for longest matching prefix consists of searching +through the child array, chopping off (zeroing) the least significant "1" of +the child index until we find a match or the child index consists of nothing but +zeros. + +At this point we backtrack (t->stats.backtrack++) up the trie, continuing to +chop off part of the key in order to find the longest matching prefix. + +At this point we will repeatedly descend subtries to look for a match, and there +are some optimizations available that can provide us with "shortcuts" to avoid +descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. + +To alleviate any doubts about the correctness of the route selection process, +a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which +gives userland access to fib_lookup(). diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt deleted file mode 100644 index fe719388518b..000000000000 --- a/Documentation/networking/fib_trie.txt +++ /dev/null @@ -1,145 +0,0 @@ - LC-trie implementation notes. - -Node types ----------- -leaf - An end node with data. This has a copy of the relevant key, along - with 'hlist' with routing table entries sorted by prefix length. - See struct leaf and struct leaf_info. - -trie node or tnode - An internal node, holding an array of child (leaf or tnode) pointers, - indexed through a subset of the key. See Level Compression. - -A few concepts explained ------------------------- -Bits (tnode) - The number of bits in the key segment used for indexing into the - child array - the "child index". See Level Compression. - -Pos (tnode) - The position (in the key) of the key segment used for indexing into - the child array. See Path Compression. - -Path Compression / skipped bits - Any given tnode is linked to from the child array of its parent, using - a segment of the key specified by the parent's "pos" and "bits" - In certain cases, this tnode's own "pos" will not be immediately - adjacent to the parent (pos+bits), but there will be some bits - in the key skipped over because they represent a single path with no - deviations. These "skipped bits" constitute Path Compression. - Note that the search algorithm will simply skip over these bits when - searching, making it necessary to save the keys in the leaves to - verify that they actually do match the key we are searching for. - -Level Compression / child arrays - the trie is kept level balanced moving, under certain conditions, the - children of a full child (see "full_children") up one level, so that - instead of a pure binary tree, each internal node ("tnode") may - contain an arbitrarily large array of links to several children. - Conversely, a tnode with a mostly empty child array (see empty_children) - may be "halved", having some of its children moved downwards one level, - in order to avoid ever-increasing child arrays. - -empty_children - the number of positions in the child array of a given tnode that are - NULL. - -full_children - the number of children of a given tnode that aren't path compressed. - (in other words, they aren't NULL or leaves and their "pos" is equal - to this tnode's "pos"+"bits"). - - (The word "full" here is used more in the sense of "complete" than - as the opposite of "empty", which might be a tad confusing.) - -Comments ---------- - -We have tried to keep the structure of the code as close to fib_hash as -possible to allow verification and help up reviewing. - -fib_find_node() - A good start for understanding this code. This function implements a - straightforward trie lookup. - -fib_insert_node() - Inserts a new leaf node in the trie. This is bit more complicated than - fib_find_node(). Inserting a new node means we might have to run the - level compression algorithm on part of the trie. - -trie_leaf_remove() - Looks up a key, deletes it and runs the level compression algorithm. - -trie_rebalance() - The key function for the dynamic trie after any change in the trie - it is run to optimize and reorganize. It will walk the trie upwards - towards the root from a given tnode, doing a resize() at each step - to implement level compression. - -resize() - Analyzes a tnode and optimizes the child array size by either inflating - or shrinking it repeatedly until it fulfills the criteria for optimal - level compression. This part follows the original paper pretty closely - and there may be some room for experimentation here. - -inflate() - Doubles the size of the child array within a tnode. Used by resize(). - -halve() - Halves the size of the child array within a tnode - the inverse of - inflate(). Used by resize(); - -fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() - The route manipulation functions. Should conform pretty closely to the - corresponding functions in fib_hash. - -fn_trie_flush() - This walks the full trie (using nextleaf()) and searches for empty - leaves which have to be removed. - -fn_trie_dump() - Dumps the routing table ordered by prefix length. This is somewhat - slower than the corresponding fib_hash function, as we have to walk the - entire trie for each prefix length. In comparison, fib_hash is organized - as one "zone"/hash per prefix length. - -Locking -------- - -fib_lock is used for an RW-lock in the same way that this is done in fib_hash. -However, the functions are somewhat separated for other possible locking -scenarios. It might conceivably be possible to run trie_rebalance via RCU -to avoid read_lock in the fn_trie_lookup() function. - -Main lookup mechanism ---------------------- -fn_trie_lookup() is the main lookup function. - -The lookup is in its simplest form just like fib_find_node(). We descend the -trie, key segment by key segment, until we find a leaf. check_leaf() does -the fib_semantic_match in the leaf's sorted prefix hlist. - -If we find a match, we are done. - -If we don't find a match, we enter prefix matching mode. The prefix length, -starting out at the same as the key length, is reduced one step at a time, -and we backtrack upwards through the trie trying to find a longest matching -prefix. The goal is always to reach a leaf and get a positive result from the -fib_semantic_match mechanism. - -Inside each tnode, the search for longest matching prefix consists of searching -through the child array, chopping off (zeroing) the least significant "1" of -the child index until we find a match or the child index consists of nothing but -zeros. - -At this point we backtrack (t->stats.backtrack++) up the trie, continuing to -chop off part of the key in order to find the longest matching prefix. - -At this point we will repeatedly descend subtries to look for a match, and there -are some optimizations available that can provide us with "shortcuts" to avoid -descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. - -To alleviate any doubts about the correctness of the route selection process, -a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which -gives userland access to fib_lookup(). diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 9ef6ef42bdc5..807abe25ae4b 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -55,6 +55,7 @@ Contents: dns_resolver driver eql + fib_trie .. only:: subproject and html -- cgit From cb3f0d56e153398a035eb22769d2cb2837f29747 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:36 +0200 Subject: docs: networking: convert filter.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - use footnote markup; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/bpf/index.rst | 4 +- Documentation/networking/filter.rst | 1651 ++++++++++++++++++++++++++++++ Documentation/networking/filter.txt | 1545 ---------------------------- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.txt | 2 +- MAINTAINERS | 2 +- tools/bpf/bpf_asm.c | 2 +- tools/bpf/bpf_dbg.c | 2 +- 8 files changed, 1658 insertions(+), 1551 deletions(-) create mode 100644 Documentation/networking/filter.rst delete mode 100644 Documentation/networking/filter.txt diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index f99677f3572f..38b4db8be7a2 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -7,7 +7,7 @@ Filter) facility, with a focus on the extended BPF version (eBPF). This kernel side documentation is still work in progress. The main textual documentation is (for historical reasons) described in -`Documentation/networking/filter.txt`_, which describe both classical +`Documentation/networking/filter.rst`_, which describe both classical and extended BPF instruction-set. The Cilium project also maintains a `BPF and XDP Reference Guide`_ that goes into great technical depth about the BPF Architecture. @@ -59,7 +59,7 @@ Testing and debugging BPF .. Links: -.. _Documentation/networking/filter.txt: ../networking/filter.txt +.. _Documentation/networking/filter.rst: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ .. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html .. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/ diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst new file mode 100644 index 000000000000..a1d3e192b9fa --- /dev/null +++ b/Documentation/networking/filter.rst @@ -0,0 +1,1651 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================================= +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= + +Introduction +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + +.. [1] Documentation/userspace-api/seccomp_filter.rst + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include which contains the +following relevant structures:: + + struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ + }; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code:: + + struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; + }; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +:: + + #include + #include + #include + #include + /* ... */ + + /* From the example above: tcpdump -i em1 port 22 -dd */ + struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, + }; + + struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, + }; + + sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (sock < 0) + /* ... bail out ... */ + + ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); + if (ret < 0) + /* ... bail out ... */ + + /* ... */ + close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/bpf/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + ======= ==================================================== + Element Description + ======= ==================================================== + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + ======= ==================================================== + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned):: + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + =========== =================== ===================== + Instruction Addressing mode Description + =========== =================== ===================== + ld 1, 2, 3, 4, 12 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 12 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8, 9, 10 Jump on A == + jneq 9, 10 Jump on A != + jne 9, 10 Jump on A != + jlt 9, 10 Jump on A < + jle 9, 10 Jump on A <= + jgt 7, 8, 9, 10 Jump on A > + jge 7, 8, 9, 10 Jump on A >= + jset 7, 8, 9, 10 Jump on A & + + add 0, 4 A + + sub 0, 4 A - + mul 0, 4 A * + div 0, 4 A / + mod 0, 4 A % + neg !A + and 0, 4 A & + or 0, 4 A | + xor 0, 4 A ^ + lsh 0, 4 A << + rsh 0, 4 A >> + + tax Copy A into X + txa Copy X into A + + ret 4, 11 Return + =========== =================== ===================== + +The next table shows addressing formats from the 2nd column: + + =============== =================== =============================================== + Addressing mode Syntax Description + =============== =================== =============================================== + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 9 #k,Lt Jump to Lt if predicate is true + 10 x/%x,Lt Jump to Lt if predicate is true + 11 a/%a Accumulator A + 12 extension BPF extension + =============== =================== =============================================== + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + =================================== ================================================= + Extension Description + =================================== ================================================= + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci skb_vlan_tag_get(skb) + vlan_avail skb_vlan_tag_present(skb) + vlan_tpid skb->vlan_proto + rand prandom_u32() + =================================== ================================================= + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +**ARP packets**:: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +**IPv4 TCP packets**:: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +**(Accelerated) VLAN w/ id 10**:: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +**icmp random packet sampling, 1 in 4**: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +**SECCOMP filter example**:: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code:: + + $ ./bpf_asm foo + 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output:: + + $ ./bpf_asm -c foo + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 1, 0x00000806 }, + { 0x06, 0, 0, 0xffffffff }, + { 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing:: + + # ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +* load pcap foo.pcap + + Loads standard tcpdump pcap file. + +* run [] + +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +* disassemble:: + + l0: ldh [12] + l1: jeq #0x800, l2, l5 + l2: ldb [23] + l3: jeq #0x1, l4, l5 + l4: ret #0xffff + l5: ret #0 + + Prints out BPF code disassembly. + +* dump:: + + /* { op, jt, jf, k }, */ + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 3, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 0, 1, 0x00000001 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0000000000 }, + + Prints out C-style BPF code dump. + +* breakpoint 0:: + + breakpoint at: l0: ldh [12] + +* breakpoint 1:: + + breakpoint at: l1: jeq #0x800, l2, l5 + + ... + + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + * run:: + + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + + * breakpoint:: + + breakpoints: 0 1 + + Prints currently set breakpoints. + +* step [-, +] + + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +* select + + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +* quit + + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, +PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through +CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each +attached filter from user space or for internal kernel users if it has +been previously enabled by root:: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via:: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg:: + + [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f + [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 + [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 + [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 + [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 + [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + +In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump:: + + # ./bpf_jit_disasm + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + : + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + + Issuing option `-o` will "annotate" opcodes to resulting assembler + instructions, which can be very useful for JIT developers: + + # ./bpf_jit_disasm -o + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + : + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using bpf_prog_create() for setting up the filter, resp. +bpf_prog_destroy() for destroying it. The macro +BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct bpf_prog that we +got from bpf_prog_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from bpf_check_classic() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most +32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, +sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF +instruction set. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as ``if (cond) jump_true; + else jump_false;``, they are being replaced into alternative constructs like + ``if (cond) jump_true; /* else fall-through */``. + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions:: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64:: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like:: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + :: + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program:: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like:: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x6,%esi + mov $0x7,%edx + mov $0x8,%ecx + mov $0x9,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to:: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into ``%rax`` which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid:: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + An in-kernel eBPF verifier is used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements:: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using ``exclusive add`` instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can be used as a generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts:: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + =================== =============== + Classic BPF classes eBPF classes + =================== =============== + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 BPF_JMP32 0x06 + BPF_MISC 0x07 BPF_ALU64 0x07 + =================== =============== + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + :: + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means:: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means:: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_JA 0x00 /* BPF_JMP only */ + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ + BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as +BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide +operands for the comparisons instead. + +For load and store instructions the 8-bit 'code' field is divided as:: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + +:: + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation:: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of:: + + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | | BPF_LD) and +(BPF_IND | | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to ``struct sk_buff`` and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example:: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations:: + + BPF_MEM | | BPF_STX: *(size *) (dst_reg + off) = src_reg + BPF_MEM | | BPF_ST: *(size *) (dst_reg + off) = imm32 + BPF_MEM | | BPF_LDX: dst_reg = *(size *) (src_reg + off) + BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg + BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single +instruction that loads 64-bit immediate value into a dst_reg. +Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +32-bit immediate value into a register. + +eBPF verifier +------------- +The safety of the eBPF program is determined in two steps. + +First step does DAG check to disallow loops and other CFG validation. +In particular it will detect programs that have unreachable instructions. +(though classic BPF checker allows them) + +Second step starts from the first insn and descends all possible paths. +It simulates execution of every insn and observes the state change of +registers and stack. + +At the start of the program the register R1 contains a pointer to context +and has type PTR_TO_CTX. +If verifier sees an insn that does R2=R1, then R2 has now type +PTR_TO_CTX as well and can be used on the right hand side of expression. +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, +since addition of two valid pointers makes invalid pointer. +(In 'secure' mode verifier will reject any type of pointer arithmetic to make +sure that kernel addresses don't leak to unprivileged users) + +If register was never written to, it's not readable:: + + bpf_mov R0 = R2 + bpf_exit + +will be rejected, since R2 is unreadable at the start of the program. + +After kernel function call, R1-R5 are reset to unreadable and +R0 has a return type of the function. + +Since R6-R9 are callee saved, their state is preserved across the call. + +:: + + bpf_mov R6 = 1 + bpf_call foo + bpf_mov R0 = R6 + bpf_exit + +is a correct program. If there was R1 instead of R6, it would have +been rejected. + +load/store instructions are allowed only with registers of valid types, which +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. +For example:: + + bpf_mov R1 = 1 + bpf_mov R2 = 2 + bpf_xadd *(u32 *)(R1 + 3) += R2 + bpf_exit + +will be rejected, since R1 doesn't have a valid pointer type at the time of +execution of instruction bpf_xadd. + +At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) +A callback is used to customize verifier to restrict eBPF program access to only +certain fields within ctx structure with specified size and alignment. + +For example, the following insn:: + + bpf_ld R0 = *(u32 *)(R6 + 8) + +intends to load a word from address R6 + 8 and store it into R0 +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know +that offset 8 of size 4 bytes can be accessed for reading, otherwise +the verifier will reject the program. +If R6=PTR_TO_STACK, then access should be aligned and be within +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, +so it will fail verification, since it's out of bounds. + +The verifier will allow eBPF program to read data from stack only after +it wrote into it. + +Classic BPF verifier does similar check with M[0-15] memory slots. +For example:: + + bpf_ld R0 = *(u32 *)(R10 - 4) + bpf_exit + +is invalid program. +Though R10 is correct read-only register and has type PTR_TO_STACK +and R10 - 4 is within stack bounds, there were no stores into that location. + +Pointer register spill/fill is tracked as well, since four (R6-R9) +callee saved registers may not be enough for some programs. + +Allowed function calls are customized with bpf_verifier_ops->get_func_proto() +The eBPF verifier will check that registers match argument constraints. +After the call register R0 will be set to return type of the function. + +Function calls is a main mechanism to extend functionality of eBPF programs. +Socket filters may let programs to call one set of functions, whereas tracing +filters may allow completely different set. + +If a function made accessible to eBPF program, it needs to be thought through +from safety point of view. The verifier will guarantee that the function is +called with valid arguments. + +seccomp vs socket filters have different security restrictions for classic BPF. +Seccomp solves this by two stage verifier: classic BPF verifier is followed +by seccomp verifier. In case of eBPF one configurable verifier is shared for +all use cases. + +See details of eBPF verifier in kernel/bpf/verifier.c + +Register value tracking +----------------------- +In order to determine the safety of an eBPF program, the verifier must track +the range of possible values in each register and also in each stack slot. +This is done with ``struct bpf_reg_state``, defined in include/linux/ +bpf_verifier.h, which unifies tracking of scalar and pointer values. Each +register state has a type, which is either NOT_INIT (the register has not been +written to), SCALAR_VALUE (some value which is not usable as a pointer), or a +pointer type. The types of pointers describe their base, as follows: + + + PTR_TO_CTX + Pointer to bpf_context. + CONST_PTR_TO_MAP + Pointer to struct bpf_map. "Const" because arithmetic + on these pointers is forbidden. + PTR_TO_MAP_VALUE + Pointer to the value stored in a map element. + PTR_TO_MAP_VALUE_OR_NULL + Either a pointer to a map value, or NULL; map accesses + (see section 'eBPF maps', below) return this type, + which becomes a PTR_TO_MAP_VALUE when checked != NULL. + Arithmetic on these pointers is forbidden. + PTR_TO_STACK + Frame pointer. + PTR_TO_PACKET + skb->data. + PTR_TO_PACKET_END + skb->data + headlen; arithmetic forbidden. + PTR_TO_SOCKET + Pointer to struct bpf_sock_ops, implicitly refcounted. + PTR_TO_SOCKET_OR_NULL + Either a pointer to a socket, or NULL; socket lookup + returns this type, which becomes a PTR_TO_SOCKET when + checked != NULL. PTR_TO_SOCKET is reference-counted, + so programs must release the reference through the + socket release function before the end of the program. + Arithmetic on these pointers is forbidden. + +However, a pointer may be offset from this base (as a result of pointer +arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable +offset'. The former is used when an exactly-known value (e.g. an immediate +operand) is added to a pointer, while the latter is used for values which are +not exactly known. The variable offset is also used in SCALAR_VALUEs, to track +the range of possible values in the register. + +The verifier's knowledge about the variable offset consists of: + +* minimum and maximum values as unsigned +* minimum and maximum values as signed + +* knowledge of the values of individual bits, in the form of a 'tnum': a u64 + 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; + 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both + mask and value; no bit should ever be 1 in both. For example, if a byte is read + into a register from memory, the register's top 56 bits are known zero, while + the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we + then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; + 0x1ff), because of potential carries. + +Besides arithmetic, the register state can also be updated by conditional +branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch +it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' +branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or +BPF_JSGE) would instead update the signed minimum/maximum values. Information +from the signed and unsigned bounds can be combined; for instance if a value is +first tested < 8 and then tested s> 4, the verifier will conclude that the value +is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. + +PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all +pointers sharing that same variable offset. This is important for packet range +checks: after adding a variable to a packet pointer register A, if you then copy +it to another register B and then add a constant 4 to A, both registers will +share the same 'id' but the A will have a fixed offset of +4. Then if A is +bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is +now known to have a safe range of at least 4 bytes. See 'Direct packet access', +below, for more on PTR_TO_PACKET ranges. + +The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of +the pointer returned from a map lookup. This means that when one copy is +checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. +As well as range-checking, the tracked information is also used for enforcing +alignment of pointer accesses. For instance, on most systems the packet pointer +is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump +over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting +pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 +bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through +that pointer are safe. +The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common +to all copies of the pointer returned from a socket lookup. This has similar +behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but +it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly +represents a reference to the corresponding ``struct sock``. To ensure that the +reference is not leaked, it is imperative to NULL-check the reference and in +the non-NULL case, and pass the valid reference to the socket release function. + +Direct packet access +-------------------- +In cls_bpf and act_bpf programs the verifier allows direct access to the packet +data via skb->data and skb->data_end pointers. +Ex:: + + 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ + 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ + 3: r5 = r3 + 4: r5 += 14 + 5: if r5 > r4 goto pc+16 + R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ + +this 2byte load from the packet is safe to do, since the program author +did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which +means that in the fall-through case the register R3 (which points to skb->data) +has at least 14 directly accessible bytes. The verifier marks it +as R3=pkt(id=0,off=0,r=14). +id=0 means that no additional variables were added to the register. +off=0 means that no additional constants were added. +r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. +Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points +to the packet data, but constant 14 was added to the register, so +it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) +which is zero bytes. + +More complex packet access may look like:: + + + R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ + 7: r4 = *(u8 *)(r3 +12) + 8: r4 *= 14 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ + 10: r3 += r4 + 11: r2 = r1 + 12: r2 <<= 48 + 13: r2 >>= 48 + 14: r3 += r2 + 15: r2 = r3 + 16: r2 += 8 + 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ + 18: if r2 > r1 goto pc+2 + R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp + 19: r1 = *(u8 *)(r3 +4) + +The state of the register R3 is R3=pkt(id=2,off=0,r=8) +id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some +offset within a packet and since the program author did +``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). +The verifier only allows 'add'/'sub' operations on packet registers. Any other +operation will set the register state to 'SCALAR_VALUE' and it won't be +available for direct packet access. + +Operation ``r3 += rX`` may overflow and become less than original skb->data, +therefore the verifier has to prevent that. So when it sees ``r3 += rX`` +instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 +against skb->data_end will not give us 'range' information, so attempts to read +through the pointer will give "invalid access to packet" error. + +Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is +R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits +of the register are guaranteed to be zero, and nothing is known about the lower +8 bits. After insn ``r4 *= 14`` the state becomes +R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit +value by constant 14 will keep upper 52 bits as zero, also the least significant +bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make +R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign +extending. This logic is implemented in adjust_reg_min_max_vals() function, +which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice +versa) and adjust_scalar_min_max_vals() for operations on two scalars. + +The end result is that bpf program author can access packet directly +using normal C code as:: + + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct eth_hdr *eth = data; + struct iphdr *iph = data + sizeof(*eth); + struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); + + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) + return 0; + if (eth->h_proto != htons(ETH_P_IP)) + return 0; + if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) + return 0; + if (udp->dest == 53 || udp->source == 9) + ...; + +which makes such programs easier to write comparing to LD_ABS insn +and significantly faster. + +eBPF maps +--------- +'maps' is a generic storage of different types for sharing data between kernel +and userspace. + +The maps are accessed from user space via BPF syscall, which has commands: + +- create a map with given type and attributes + ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` + using attr->map_type, attr->key_size, attr->value_size, attr->max_entries + returns process-local file descriptor or negative error + +- lookup key in a given map + ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key, attr->value + returns zero and stores found elem into value or negative error + +- create or update key/value pair in a given map + ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key, attr->value + returns zero or negative error + +- find and delete element by key in a given map + ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key + +- to delete map: close(fd) + Exiting process will delete maps automatically + +userspace programs use this syscall to create/access maps that eBPF programs +are concurrently updating. + +maps can have different types: hash, array, bloom filter, radix-tree, etc. + +The map is defined by: + + - type + - max number of elements + - key size in bytes + - value size in bytes + +Pruning +------- +The verifier does not actually walk all possible paths through the program. For +each new branch to analyse, the verifier looks at all the states it's previously +been in when at this instruction. If any of them contain the current state as a +subset, the branch is 'pruned' - that is, the fact that the previous state was +accepted implies the current state would be as well. For instance, if in the +previous state, r1 held a packet-pointer, and in the current state, r1 holds a +packet-pointer with a range as long or longer and at least as strict an +alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't +have been used by any path from that point, so any value in r2 (including +another NOT_INIT) is safe. The implementation is in the function regsafe(). +Pruning considers not only the registers but also the stack (and any spilled +registers it may hold). They must all be safe for the branch to be pruned. +This is implemented in states_equal(). + +Understanding eBPF verifier messages +------------------------------------ + +The following are few examples of invalid eBPF programs and verifier error +messages as seen in the log: + +Program with unreachable instructions:: + + static struct bpf_insn prog[] = { + BPF_EXIT_INSN(), + BPF_EXIT_INSN(), + }; + +Error: + + unreachable insn 1 + +Program that reads uninitialized register:: + + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r0 = r2 + R2 !read_ok + +Program that doesn't initialize R0 before exiting:: + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r2 = r1 + 1: (95) exit + R0 !read_ok + +Program that accesses stack out of bounds:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 +8) = 0 + invalid stack off=8 size=8 + +Program that doesn't initialize stack before passing its address into function:: + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r2 = r10 + 1: (07) r2 += -8 + 2: (b7) r1 = 0x0 + 3: (85) call 1 + invalid indirect read from stack off -8+0 size 8 + +Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + fd 0 is not pointing to valid bpf_map + +Program that doesn't check return value of map_lookup_elem() before accessing +map element:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + 5: (7a) *(u64 *)(r0 +0) = 0 + R0 invalid mem access 'map_value_or_null' + +Program that correctly checks map_lookup_elem() returned value for NULL, but +accesses the memory with incorrect alignment:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+1 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +4) = 0 + misaligned access off 4 size 8 + +Program that correctly checks map_lookup_elem() returned value for NULL and +accesses memory with correct alignment in one side of 'if' branch, but fails +to do so in the other side of 'if' branch:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+2 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +0) = 0 + 7: (95) exit + + from 5 to 8: R0=imm0 R10=fp + 8: (7a) *(u64 *)(r0 +0) = 1 + R0 invalid mem access 'imm' + +Program that performs a socket lookup then sets the pointer to NULL without +checking it:: + + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_3, 4), + BPF_MOV64_IMM(BPF_REG_4, 0), + BPF_MOV64_IMM(BPF_REG_5, 0), + BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (b7) r2 = 0 + 1: (63) *(u32 *)(r10 -8) = r2 + 2: (bf) r2 = r10 + 3: (07) r2 += -8 + 4: (b7) r3 = 4 + 5: (b7) r4 = 0 + 6: (b7) r5 = 0 + 7: (85) call bpf_sk_lookup_tcp#65 + 8: (b7) r0 = 0 + 9: (95) exit + Unreleased reference id=1, alloc_insn=7 + +Program that performs a socket lookup but does not NULL-check the returned +value:: + + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_3, 4), + BPF_MOV64_IMM(BPF_REG_4, 0), + BPF_MOV64_IMM(BPF_REG_5, 0), + BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), + BPF_EXIT_INSN(), + +Error:: + + 0: (b7) r2 = 0 + 1: (63) *(u32 *)(r10 -8) = r2 + 2: (bf) r2 = r10 + 3: (07) r2 += -8 + 4: (b7) r3 = 4 + 5: (b7) r4 = 0 + 6: (b7) r5 = 0 + 7: (85) call bpf_sk_lookup_tcp#65 + 8: (95) exit + Unreleased reference id=1, alloc_insn=7 + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig:: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + +Misc +---- + +Also trinity, the Linux syscall fuzzer, has built-in support for BPF and +SECCOMP-BPF kernel fuzzing. + +Written by +---------- + +The document was written in the hope that it is found useful and in order +to give potential BPF hackers or security auditors a better overview of +the underlying architecture. + +- Jay Schulist +- Daniel Borkmann +- Alexei Starovoitov diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt deleted file mode 100644 index 2f0f8b17dade..000000000000 --- a/Documentation/networking/filter.txt +++ /dev/null @@ -1,1545 +0,0 @@ -Linux Socket Filtering aka Berkeley Packet Filter (BPF) -======================================================= - -Introduction ------------- - -Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. -Though there are some distinct differences between the BSD and Linux -Kernel filtering, but when we speak of BPF or LSF in Linux context, we -mean the very same mechanism of filtering in the Linux kernel. - -BPF allows a user-space program to attach a filter onto any socket and -allow or disallow certain types of data to come through the socket. LSF -follows exactly the same filter code structure as BSD's BPF, so referring -to the BSD bpf.4 manpage is very helpful in creating filters. - -On Linux, BPF is much simpler than on BSD. One does not have to worry -about devices or anything like that. You simply create your filter code, -send it to the kernel via the SO_ATTACH_FILTER option and if your filter -code passes the kernel check on it, you then immediately begin filtering -data on that socket. - -You can also detach filters from your socket via the SO_DETACH_FILTER -option. This will probably not be used much since when you close a socket -that has a filter on it the filter is automagically removed. The other -less common case may be adding a different filter on the same socket where -you had another filter that is still running: the kernel takes care of -removing the old one and placing your new one in its place, assuming your -filter has passed the checks, otherwise if it fails the old filter will -remain on that socket. - -SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once -set, a filter cannot be removed or changed. This allows one process to -setup a socket, attach a filter, lock it then drop privileges and be -assured that the filter will be kept until the socket is closed. - -The biggest user of this construct might be libpcap. Issuing a high-level -filter command like `tcpdump -i em1 port 22` passes through the libpcap -internal compiler that generates a structure that can eventually be loaded -via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` -displays what is being placed into this structure. - -Although we were only speaking about sockets here, BPF in Linux is used -in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel -qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places -such as team driver, PTP code, etc where BPF is being used. - - [1] Documentation/userspace-api/seccomp_filter.rst - -Original BPF paper: - -Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new -architecture for user-level packet capture. In Proceedings of the -USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 -Conference Proceedings (USENIX'93). USENIX Association, Berkeley, -CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] - -Structure ---------- - -User space applications include which contains the -following relevant structures: - -struct sock_filter { /* Filter block */ - __u16 code; /* Actual filter code */ - __u8 jt; /* Jump true */ - __u8 jf; /* Jump false */ - __u32 k; /* Generic multiuse field */ -}; - -Such a structure is assembled as an array of 4-tuples, that contains -a code, jt, jf and k value. jt and jf are jump offsets and k a generic -value to be used for a provided code. - -struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ - unsigned short len; /* Number of filter blocks */ - struct sock_filter __user *filter; -}; - -For socket filtering, a pointer to this structure (as shown in -follow-up example) is being passed to the kernel through setsockopt(2). - -Example -------- - -#include -#include -#include -#include -/* ... */ - -/* From the example above: tcpdump -i em1 port 22 -dd */ -struct sock_filter code[] = { - { 0x28, 0, 0, 0x0000000c }, - { 0x15, 0, 8, 0x000086dd }, - { 0x30, 0, 0, 0x00000014 }, - { 0x15, 2, 0, 0x00000084 }, - { 0x15, 1, 0, 0x00000006 }, - { 0x15, 0, 17, 0x00000011 }, - { 0x28, 0, 0, 0x00000036 }, - { 0x15, 14, 0, 0x00000016 }, - { 0x28, 0, 0, 0x00000038 }, - { 0x15, 12, 13, 0x00000016 }, - { 0x15, 0, 12, 0x00000800 }, - { 0x30, 0, 0, 0x00000017 }, - { 0x15, 2, 0, 0x00000084 }, - { 0x15, 1, 0, 0x00000006 }, - { 0x15, 0, 8, 0x00000011 }, - { 0x28, 0, 0, 0x00000014 }, - { 0x45, 6, 0, 0x00001fff }, - { 0xb1, 0, 0, 0x0000000e }, - { 0x48, 0, 0, 0x0000000e }, - { 0x15, 2, 0, 0x00000016 }, - { 0x48, 0, 0, 0x00000010 }, - { 0x15, 0, 1, 0x00000016 }, - { 0x06, 0, 0, 0x0000ffff }, - { 0x06, 0, 0, 0x00000000 }, -}; - -struct sock_fprog bpf = { - .len = ARRAY_SIZE(code), - .filter = code, -}; - -sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); -if (sock < 0) - /* ... bail out ... */ - -ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); -if (ret < 0) - /* ... bail out ... */ - -/* ... */ -close(sock); - -The above example code attaches a socket filter for a PF_PACKET socket -in order to let all IPv4/IPv6 packets with port 22 pass. The rest will -be dropped for this socket. - -The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments -and SO_LOCK_FILTER for preventing the filter to be detached, takes an -integer value with 0 or 1. - -Note that socket filters are not restricted to PF_PACKET sockets only, -but can also be used on other socket families. - -Summary of system calls: - - * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); - * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); - * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); - -Normally, most use cases for socket filtering on packet sockets will be -covered by libpcap in high-level syntax, so as an application developer -you should stick to that. libpcap wraps its own layer around all that. - -Unless i) using/linking to libpcap is not an option, ii) the required BPF -filters use Linux extensions that are not supported by libpcap's compiler, -iii) a filter might be more complex and not cleanly implementable with -libpcap's compiler, or iv) particular filter codes should be optimized -differently than libpcap's internal compiler does; then in such cases -writing such a filter "by hand" can be of an alternative. For example, -xt_bpf and cls_bpf users might have requirements that could result in -more complex filter code, or one that cannot be expressed with libpcap -(e.g. different return codes for various code paths). Moreover, BPF JIT -implementors may wish to manually write test cases and thus need low-level -access to BPF code as well. - -BPF engine and instruction set ------------------------------- - -Under tools/bpf/ there's a small helper tool called bpf_asm which can -be used to write low-level filters for example scenarios mentioned in the -previous section. Asm-like syntax mentioned here has been implemented in -bpf_asm and will be used for further explanations (instead of dealing with -less readable opcodes directly, principles are the same). The syntax is -closely modelled after Steven McCanne's and Van Jacobson's BPF paper. - -The BPF architecture consists of the following basic elements: - - Element Description - - A 32 bit wide accumulator - X 32 bit wide X register - M[] 16 x 32 bit wide misc registers aka "scratch memory - store", addressable from 0 to 15 - -A program, that is translated by bpf_asm into "opcodes" is an array that -consists of the following elements (as already mentioned): - - op:16, jt:8, jf:8, k:32 - -The element op is a 16 bit wide opcode that has a particular instruction -encoded. jt and jf are two 8 bit wide jump targets, one for condition -"jump if true", the other one "jump if false". Eventually, element k -contains a miscellaneous argument that can be interpreted in different -ways depending on the given instruction in op. - -The instruction set consists of load, store, branch, alu, miscellaneous -and return instructions that are also represented in bpf_asm syntax. This -table lists all bpf_asm instructions available resp. what their underlying -opcodes as defined in linux/filter.h stand for: - - Instruction Addressing mode Description - - ld 1, 2, 3, 4, 12 Load word into A - ldi 4 Load word into A - ldh 1, 2 Load half-word into A - ldb 1, 2 Load byte into A - ldx 3, 4, 5, 12 Load word into X - ldxi 4 Load word into X - ldxb 5 Load byte into X - - st 3 Store A into M[] - stx 3 Store X into M[] - - jmp 6 Jump to label - ja 6 Jump to label - jeq 7, 8, 9, 10 Jump on A == - jneq 9, 10 Jump on A != - jne 9, 10 Jump on A != - jlt 9, 10 Jump on A < - jle 9, 10 Jump on A <= - jgt 7, 8, 9, 10 Jump on A > - jge 7, 8, 9, 10 Jump on A >= - jset 7, 8, 9, 10 Jump on A & - - add 0, 4 A + - sub 0, 4 A - - mul 0, 4 A * - div 0, 4 A / - mod 0, 4 A % - neg !A - and 0, 4 A & - or 0, 4 A | - xor 0, 4 A ^ - lsh 0, 4 A << - rsh 0, 4 A >> - - tax Copy A into X - txa Copy X into A - - ret 4, 11 Return - -The next table shows addressing formats from the 2nd column: - - Addressing mode Syntax Description - - 0 x/%x Register X - 1 [k] BHW at byte offset k in the packet - 2 [x + k] BHW at the offset X + k in the packet - 3 M[k] Word at offset k in M[] - 4 #k Literal value stored in k - 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet - 6 L Jump label L - 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf - 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf - 9 #k,Lt Jump to Lt if predicate is true - 10 x/%x,Lt Jump to Lt if predicate is true - 11 a/%a Accumulator A - 12 extension BPF extension - -The Linux kernel also has a couple of BPF extensions that are used along -with the class of load instructions by "overloading" the k argument with -a negative offset + a particular extension offset. The result of such BPF -extensions are loaded into A. - -Possible BPF extensions are shown in the following table: - - Extension Description - - len skb->len - proto skb->protocol - type skb->pkt_type - poff Payload start offset - ifidx skb->dev->ifindex - nla Netlink attribute of type X with offset A - nlan Nested Netlink attribute of type X with offset A - mark skb->mark - queue skb->queue_mapping - hatype skb->dev->type - rxhash skb->hash - cpu raw_smp_processor_id() - vlan_tci skb_vlan_tag_get(skb) - vlan_avail skb_vlan_tag_present(skb) - vlan_tpid skb->vlan_proto - rand prandom_u32() - -These extensions can also be prefixed with '#'. -Examples for low-level BPF: - -** ARP packets: - - ldh [12] - jne #0x806, drop - ret #-1 - drop: ret #0 - -** IPv4 TCP packets: - - ldh [12] - jne #0x800, drop - ldb [23] - jneq #6, drop - ret #-1 - drop: ret #0 - -** (Accelerated) VLAN w/ id 10: - - ld vlan_tci - jneq #10, drop - ret #-1 - drop: ret #0 - -** icmp random packet sampling, 1 in 4 - ldh [12] - jne #0x800, drop - ldb [23] - jneq #1, drop - # get a random uint32 number - ld rand - mod #4 - jneq #1, drop - ret #-1 - drop: ret #0 - -** SECCOMP filter example: - - ld [4] /* offsetof(struct seccomp_data, arch) */ - jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ - ld [0] /* offsetof(struct seccomp_data, nr) */ - jeq #15, good /* __NR_rt_sigreturn */ - jeq #231, good /* __NR_exit_group */ - jeq #60, good /* __NR_exit */ - jeq #0, good /* __NR_read */ - jeq #1, good /* __NR_write */ - jeq #5, good /* __NR_fstat */ - jeq #9, good /* __NR_mmap */ - jeq #14, good /* __NR_rt_sigprocmask */ - jeq #13, good /* __NR_rt_sigaction */ - jeq #35, good /* __NR_nanosleep */ - bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ - good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ - -The above example code can be placed into a file (here called "foo"), and -then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf -and cls_bpf understands and can directly be loaded with. Example with above -ARP code: - -$ ./bpf_asm foo -4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, - -In copy and paste C-like output: - -$ ./bpf_asm -c foo -{ 0x28, 0, 0, 0x0000000c }, -{ 0x15, 0, 1, 0x00000806 }, -{ 0x06, 0, 0, 0xffffffff }, -{ 0x06, 0, 0, 0000000000 }, - -In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF -filters that might not be obvious at first, it's good to test filters before -attaching to a live system. For that purpose, there's a small tool called -bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows -for testing BPF filters against given pcap files, single stepping through the -BPF code on the pcap's packets and to do BPF machine register dumps. - -Starting bpf_dbg is trivial and just requires issuing: - -# ./bpf_dbg - -In case input and output do not equal stdin/stdout, bpf_dbg takes an -alternative stdin source as a first argument, and an alternative stdout -sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. - -Other than that, a particular libreadline configuration can be set via -file "~/.bpf_dbg_init" and the command history is stored in the file -"~/.bpf_dbg_history". - -Interaction in bpf_dbg happens through a shell that also has auto-completion -support (follow-up example commands starting with '>' denote bpf_dbg shell). -The usual workflow would be to ... - -> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 - Loads a BPF filter from standard output of bpf_asm, or transformed via - e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT - debugging (next section), this command creates a temporary socket and - loads the BPF code into the kernel. Thus, this will also be useful for - JIT developers. - -> load pcap foo.pcap - Loads standard tcpdump pcap file. - -> run [] -bpf passes:1 fails:9 - Runs through all packets from a pcap to account how many passes and fails - the filter will generate. A limit of packets to traverse can be given. - -> disassemble -l0: ldh [12] -l1: jeq #0x800, l2, l5 -l2: ldb [23] -l3: jeq #0x1, l4, l5 -l4: ret #0xffff -l5: ret #0 - Prints out BPF code disassembly. - -> dump -/* { op, jt, jf, k }, */ -{ 0x28, 0, 0, 0x0000000c }, -{ 0x15, 0, 3, 0x00000800 }, -{ 0x30, 0, 0, 0x00000017 }, -{ 0x15, 0, 1, 0x00000001 }, -{ 0x06, 0, 0, 0x0000ffff }, -{ 0x06, 0, 0, 0000000000 }, - Prints out C-style BPF code dump. - -> breakpoint 0 -breakpoint at: l0: ldh [12] -> breakpoint 1 -breakpoint at: l1: jeq #0x800, l2, l5 - ... - Sets breakpoints at particular BPF instructions. Issuing a `run` command - will walk through the pcap file continuing from the current packet and - break when a breakpoint is being hit (another `run` will continue from - the currently active breakpoint executing next instructions): - - > run - -- register dump -- - pc: [0] <-- program counter - code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction - curr: l0: ldh [12] <-- disassembly of current instruction - A: [00000000][0] <-- content of A (hex, decimal) - X: [00000000][0] <-- content of X (hex, decimal) - M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) - -- packet dump -- <-- Current packet from pcap (hex) - len: 42 - 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 - 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 - 32: 00 00 00 00 00 00 0a 3b 01 01 - (breakpoint) - > - -> breakpoint -breakpoints: 0 1 - Prints currently set breakpoints. - -> step [-, +] - Performs single stepping through the BPF program from the current pc - offset. Thus, on each step invocation, above register dump is issued. - This can go forwards and backwards in time, a plain `step` will break - on the next BPF instruction, thus +1. (No `run` needs to be issued here.) - -> select - Selects a given packet from the pcap file to continue from. Thus, on - the next `run` or `step`, the BPF program is being evaluated against - the user pre-selected packet. Numbering starts just as in Wireshark - with index 1. - -> quit -# - Exits bpf_dbg. - -JIT compiler ------------- - -The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, -PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through -CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each -attached filter from user space or for internal kernel users if it has -been previously enabled by root: - - echo 1 > /proc/sys/net/core/bpf_jit_enable - -For JIT developers, doing audits etc, each compile run can output the generated -opcode image into the kernel log via: - - echo 2 > /proc/sys/net/core/bpf_jit_enable - -Example output from dmesg: - -[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f -[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 -[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 -[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 -[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 -[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 - -When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and -setting any other value than that will return in failure. This is even the case for -setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log -is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the -generally recommended approach instead. - -In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for -generating disassembly out of the kernel log's hexdump: - -# ./bpf_jit_disasm -70 bytes emitted from JIT compiler (pass:3, flen:6) -ffffffffa0069c8f + : - 0: push %rbp - 1: mov %rsp,%rbp - 4: sub $0x60,%rsp - 8: mov %rbx,-0x8(%rbp) - c: mov 0x68(%rdi),%r9d - 10: sub 0x6c(%rdi),%r9d - 14: mov 0xd8(%rdi),%r8 - 1b: mov $0xc,%esi - 20: callq 0xffffffffe0ff9442 - 25: cmp $0x800,%eax - 2a: jne 0x0000000000000042 - 2c: mov $0x17,%esi - 31: callq 0xffffffffe0ff945e - 36: cmp $0x1,%eax - 39: jne 0x0000000000000042 - 3b: mov $0xffff,%eax - 40: jmp 0x0000000000000044 - 42: xor %eax,%eax - 44: leaveq - 45: retq - -Issuing option `-o` will "annotate" opcodes to resulting assembler -instructions, which can be very useful for JIT developers: - -# ./bpf_jit_disasm -o -70 bytes emitted from JIT compiler (pass:3, flen:6) -ffffffffa0069c8f + : - 0: push %rbp - 55 - 1: mov %rsp,%rbp - 48 89 e5 - 4: sub $0x60,%rsp - 48 83 ec 60 - 8: mov %rbx,-0x8(%rbp) - 48 89 5d f8 - c: mov 0x68(%rdi),%r9d - 44 8b 4f 68 - 10: sub 0x6c(%rdi),%r9d - 44 2b 4f 6c - 14: mov 0xd8(%rdi),%r8 - 4c 8b 87 d8 00 00 00 - 1b: mov $0xc,%esi - be 0c 00 00 00 - 20: callq 0xffffffffe0ff9442 - e8 1d 94 ff e0 - 25: cmp $0x800,%eax - 3d 00 08 00 00 - 2a: jne 0x0000000000000042 - 75 16 - 2c: mov $0x17,%esi - be 17 00 00 00 - 31: callq 0xffffffffe0ff945e - e8 28 94 ff e0 - 36: cmp $0x1,%eax - 83 f8 01 - 39: jne 0x0000000000000042 - 75 07 - 3b: mov $0xffff,%eax - b8 ff ff 00 00 - 40: jmp 0x0000000000000044 - eb 02 - 42: xor %eax,%eax - 31 c0 - 44: leaveq - c9 - 45: retq - c3 - -For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful -toolchain for developing and testing the kernel's JIT compiler. - -BPF kernel internals --------------------- -Internally, for the kernel interpreter, a different instruction set -format with similar underlying principles from BPF described in previous -paragraphs is being used. However, the instruction set format is modelled -closer to the underlying architecture to mimic native instruction sets, so -that a better performance can be achieved (more details later). This new -ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which -originates from [e]xtended BPF is not the same as BPF extensions! While -eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' -of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) - -It is designed to be JITed with one to one mapping, which can also open up -the possibility for GCC/LLVM compilers to generate optimized eBPF code through -an eBPF backend that performs almost as fast as natively compiled code. - -The new instruction set was originally designed with the possible goal in -mind to write programs in "restricted C" and compile into eBPF with a optional -GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with -minimal performance overhead over two steps, that is, C -> eBPF -> native code. - -Currently, the new format is being used for running user BPF programs, which -includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, -team driver's classifier for its load-balancing mode, netfilter's xt_bpf -extension, PTP dissector/classifier, and much more. They are all internally -converted by the kernel into the new instruction set representation and run -in the eBPF interpreter. For in-kernel handlers, this all works transparently -by using bpf_prog_create() for setting up the filter, resp. -bpf_prog_destroy() for destroying it. The macro -BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed -code to run the filter. 'filter' is a pointer to struct bpf_prog that we -got from bpf_prog_create(), and 'ctx' the given context (e.g. -skb pointer). All constraints and restrictions from bpf_check_classic() apply -before a conversion to the new layout is being done behind the scenes! - -Currently, the classic BPF format is being used for JITing on most -32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, -sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF -instruction set. - -Some core changes of the new internal format: - -- Number of registers increase from 2 to 10: - - The old format had two registers A and X, and a hidden frame pointer. The - new layout extends this to be 10 internal registers and a read-only frame - pointer. Since 64-bit CPUs are passing arguments to functions via registers - the number of args from eBPF program to in-kernel function is restricted - to 5 and one register is used to accept return value from an in-kernel - function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ - sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved - registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. - - Therefore, eBPF calling convention is defined as: - - * R0 - return value from in-kernel function, and exit value for eBPF program - * R1 - R5 - arguments from eBPF program to in-kernel function - * R6 - R9 - callee saved registers that in-kernel function will preserve - * R10 - read-only frame pointer to access stack - - Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, - etc, and eBPF calling convention maps directly to ABIs used by the kernel on - 64-bit architectures. - - On 32-bit architectures JIT may map programs that use only 32-bit arithmetic - and may let more complex programs to be interpreted. - - R0 - R5 are scratch registers and eBPF program needs spill/fill them if - necessary across calls. Note that there is only one eBPF program (== one - eBPF main routine) and it cannot call other eBPF functions, it can only - call predefined in-kernel functions, though. - -- Register width increases from 32-bit to 64-bit: - - Still, the semantics of the original 32-bit ALU operations are preserved - via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower - subregisters that zero-extend into 64-bit if they are being written to. - That behavior maps directly to x86_64 and arm64 subregister definition, but - makes other JITs more difficult. - - 32-bit architectures run 64-bit internal BPF programs via interpreter. - Their JITs may convert BPF programs that only use 32-bit subregisters into - native instruction set and let the rest being interpreted. - - Operation is 64-bit, because on 64-bit architectures, pointers are also - 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, - so 32-bit eBPF registers would otherwise require to define register-pair - ABI, thus, there won't be able to use a direct eBPF register to HW register - mapping and JIT would need to do combine/split/move operations for every - register in and out of the function, which is complex, bug prone and slow. - Another reason is the use of atomic 64-bit counters. - -- Conditional jt/jf targets replaced with jt/fall-through: - - While the original design has constructs such as "if (cond) jump_true; - else jump_false;", they are being replaced into alternative constructs like - "if (cond) jump_true; /* else fall-through */". - -- Introduces bpf_call insn and register passing convention for zero overhead - calls from/to other kernel functions: - - Before an in-kernel function call, the internal BPF program needs to - place function arguments into R1 to R5 registers to satisfy calling - convention, then the interpreter will take them from registers and pass - to in-kernel function. If R1 - R5 registers are mapped to CPU registers - that are used for argument passing on given architecture, the JIT compiler - doesn't need to emit extra moves. Function arguments will be in the correct - registers and BPF_CALL instruction will be JITed as single 'call' HW - instruction. This calling convention was picked to cover common call - situations without performance penalty. - - After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has - a return value of the function. Since R6 - R9 are callee saved, their state - is preserved across the call. - - For example, consider three C functions: - - u64 f1() { return (*_f2)(1); } - u64 f2(u64 a) { return f3(a + 1, a); } - u64 f3(u64 a, u64 b) { return a - b; } - - GCC can compile f1, f3 into x86_64: - - f1: - movl $1, %edi - movq _f2(%rip), %rax - jmp *%rax - f3: - movq %rdi, %rax - subq %rsi, %rax - ret - - Function f2 in eBPF may look like: - - f2: - bpf_mov R2, R1 - bpf_add R1, 1 - bpf_call f3 - bpf_exit - - If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and - returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to - be used to call into f2. - - For practical reasons all eBPF programs have only one argument 'ctx' which is - already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs - can call kernel functions with up to 5 arguments. Calls with 6 or more arguments - are currently not supported, but these restrictions can be lifted if necessary - in the future. - - On 64-bit architectures all register map to HW registers one to one. For - example, x86_64 JIT compiler can map them as ... - - R0 - rax - R1 - rdi - R2 - rsi - R3 - rdx - R4 - rcx - R5 - r8 - R6 - rbx - R7 - r13 - R8 - r14 - R9 - r15 - R10 - rbp - - ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing - and rbx, r12 - r15 are callee saved. - - Then the following internal BPF pseudo-program: - - bpf_mov R6, R1 /* save ctx */ - bpf_mov R2, 2 - bpf_mov R3, 3 - bpf_mov R4, 4 - bpf_mov R5, 5 - bpf_call foo - bpf_mov R7, R0 /* save foo() return value */ - bpf_mov R1, R6 /* restore ctx for next call */ - bpf_mov R2, 6 - bpf_mov R3, 7 - bpf_mov R4, 8 - bpf_mov R5, 9 - bpf_call bar - bpf_add R0, R7 - bpf_exit - - After JIT to x86_64 may look like: - - push %rbp - mov %rsp,%rbp - sub $0x228,%rsp - mov %rbx,-0x228(%rbp) - mov %r13,-0x220(%rbp) - mov %rdi,%rbx - mov $0x2,%esi - mov $0x3,%edx - mov $0x4,%ecx - mov $0x5,%r8d - callq foo - mov %rax,%r13 - mov %rbx,%rdi - mov $0x6,%esi - mov $0x7,%edx - mov $0x8,%ecx - mov $0x9,%r8d - callq bar - add %r13,%rax - mov -0x228(%rbp),%rbx - mov -0x220(%rbp),%r13 - leaveq - retq - - Which is in this example equivalent in C to: - - u64 bpf_filter(u64 ctx) - { - return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); - } - - In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 - arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper - registers and place their return value into '%rax' which is R0 in eBPF. - Prologue and epilogue are emitted by JIT and are implicit in the - interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve - them across the calls as defined by calling convention. - - For example the following program is invalid: - - bpf_mov R1, 1 - bpf_call foo - bpf_mov R0, R1 - bpf_exit - - After the call the registers R1-R5 contain junk values and cannot be read. - An in-kernel eBPF verifier is used to validate internal BPF programs. - -Also in the new design, eBPF is limited to 4096 insns, which means that any -program will terminate quickly and will only call a fixed number of kernel -functions. Original BPF and the new format are two operand instructions, -which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. - -The input context pointer for invoking the interpreter function is generic, -its content is defined by a specific use case. For seccomp register R1 points -to seccomp_data, for converted BPF filters R1 points to a skb. - -A program, that is translated internally consists of the following elements: - - op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 - -So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field -has room for new instructions. Some of them may use 16/24/32 byte encoding. New -instructions must be multiple of 8 bytes to preserve backward compatibility. - -Internal BPF is a general purpose RISC instruction set. Not every register and -every instruction are used during translation from original BPF to new format. -For example, socket filters are not using 'exclusive add' instruction, but -tracing filters may do to maintain counters of events, for example. Register R9 -is not used by socket filters either, but more complex filters may be running -out of registers and would have to resort to spill/fill to stack. - -Internal BPF can be used as a generic assembler for last step performance -optimizations, socket filters and seccomp are using it as assembler. Tracing -filters may use it as assembler to generate code from kernel. In kernel usage -may not be bounded by security considerations, since generated internal BPF code -may be optimizing internal code path and not being exposed to the user space. -Safety of internal BPF can come from a verifier (TBD). In such use cases as -described, it may be used as safe instruction set. - -Just like the original BPF, the new format runs within a controlled environment, -is deterministic and the kernel can easily prove that. The safety of the program -can be determined in two steps: first step does depth-first-search to disallow -loops and other CFG validation; second step starts from the first insn and -descends all possible paths. It simulates execution of every insn and observes -the state change of registers and stack. - -eBPF opcode encoding --------------------- - -eBPF is reusing most of the opcode encoding from classic to simplify conversion -of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' -field is divided into three parts: - - +----------------+--------+--------------------+ - | 4 bits | 1 bit | 3 bits | - | operation code | source | instruction class | - +----------------+--------+--------------------+ - (MSB) (LSB) - -Three LSB bits store instruction class which is one of: - - Classic BPF classes: eBPF classes: - - BPF_LD 0x00 BPF_LD 0x00 - BPF_LDX 0x01 BPF_LDX 0x01 - BPF_ST 0x02 BPF_ST 0x02 - BPF_STX 0x03 BPF_STX 0x03 - BPF_ALU 0x04 BPF_ALU 0x04 - BPF_JMP 0x05 BPF_JMP 0x05 - BPF_RET 0x06 BPF_JMP32 0x06 - BPF_MISC 0x07 BPF_ALU64 0x07 - -When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... - - BPF_K 0x00 - BPF_X 0x08 - - * in classic BPF, this means: - - BPF_SRC(code) == BPF_X - use register X as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - - * in eBPF, this means: - - BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - -... and four MSB bits store operation code. - -If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: - - BPF_ADD 0x00 - BPF_SUB 0x10 - BPF_MUL 0x20 - BPF_DIV 0x30 - BPF_OR 0x40 - BPF_AND 0x50 - BPF_LSH 0x60 - BPF_RSH 0x70 - BPF_NEG 0x80 - BPF_MOD 0x90 - BPF_XOR 0xa0 - BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ - BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ - BPF_END 0xd0 /* eBPF only: endianness conversion */ - -If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of: - - BPF_JA 0x00 /* BPF_JMP only */ - BPF_JEQ 0x10 - BPF_JGT 0x20 - BPF_JGE 0x30 - BPF_JSET 0x40 - BPF_JNE 0x50 /* eBPF only: jump != */ - BPF_JSGT 0x60 /* eBPF only: signed '>' */ - BPF_JSGE 0x70 /* eBPF only: signed '>=' */ - BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ - BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ - BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ - BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ - BPF_JSLT 0xc0 /* eBPF only: signed '<' */ - BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ - -So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF -and eBPF. There are only two registers in classic BPF, so it means A += X. -In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, -BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous -src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. - -Classic BPF is using BPF_MISC class to represent A = X and X = A moves. -eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no -BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean -exactly the same operations as BPF_ALU, but with 64-bit wide operands -instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: -dst_reg = dst_reg + src_reg - -Classic BPF wastes the whole BPF_RET class to represent a single 'ret' -operation. Classic BPF_RET | BPF_K means copy imm32 into return register -and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT -in eBPF means function exit only. The eBPF program needs to store return -value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as -BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide -operands for the comparisons instead. - -For load and store instructions the 8-bit 'code' field is divided as: - - +--------+--------+-------------------+ - | 3 bits | 2 bits | 3 bits | - | mode | size | instruction class | - +--------+--------+-------------------+ - (MSB) (LSB) - -Size modifier is one of ... - - BPF_W 0x00 /* word */ - BPF_H 0x08 /* half word */ - BPF_B 0x10 /* byte */ - BPF_DW 0x18 /* eBPF only, double word */ - -... which encodes size of load/store operation: - - B - 1 byte - H - 2 byte - W - 4 byte - DW - 8 byte (eBPF only) - -Mode modifier is one of: - - BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ - BPF_ABS 0x20 - BPF_IND 0x40 - BPF_MEM 0x60 - BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ - BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ - BPF_XADD 0xc0 /* eBPF only, exclusive add */ - -eBPF has two non-generic instructions: (BPF_ABS | | BPF_LD) and -(BPF_IND | | BPF_LD) which are used to access packet data. - -They had to be carried over from classic to have strong performance of -socket filters running in eBPF interpreter. These instructions can only -be used when interpreter context is a pointer to 'struct sk_buff' and -have seven implicit operands. Register R6 is an implicit input that must -contain pointer to sk_buff. Register R0 is an implicit output which contains -the data fetched from the packet. Registers R1-R5 are scratch registers -and must not be used to store the data across BPF_ABS | BPF_LD or -BPF_IND | BPF_LD instructions. - -These instructions have implicit program exit condition as well. When -eBPF program is trying to access the data beyond the packet boundary, -the interpreter will abort the execution of the program. JIT compilers -therefore must preserve this property. src_reg and imm32 fields are -explicit inputs to these instructions. - -For example: - - BPF_IND | BPF_W | BPF_LD means: - - R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) - and R1 - R5 were scratched. - -Unlike classic BPF instruction set, eBPF has generic load/store operations: - -BPF_MEM | | BPF_STX: *(size *) (dst_reg + off) = src_reg -BPF_MEM | | BPF_ST: *(size *) (dst_reg + off) = imm32 -BPF_MEM | | BPF_LDX: dst_reg = *(size *) (src_reg + off) -BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg -BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg - -Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and -2 byte atomic increments are not supported. - -eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists -of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single -instruction that loads 64-bit immediate value into a dst_reg. -Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads -32-bit immediate value into a register. - -eBPF verifier -------------- -The safety of the eBPF program is determined in two steps. - -First step does DAG check to disallow loops and other CFG validation. -In particular it will detect programs that have unreachable instructions. -(though classic BPF checker allows them) - -Second step starts from the first insn and descends all possible paths. -It simulates execution of every insn and observes the state change of -registers and stack. - -At the start of the program the register R1 contains a pointer to context -and has type PTR_TO_CTX. -If verifier sees an insn that does R2=R1, then R2 has now type -PTR_TO_CTX as well and can be used on the right hand side of expression. -If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, -since addition of two valid pointers makes invalid pointer. -(In 'secure' mode verifier will reject any type of pointer arithmetic to make -sure that kernel addresses don't leak to unprivileged users) - -If register was never written to, it's not readable: - bpf_mov R0 = R2 - bpf_exit -will be rejected, since R2 is unreadable at the start of the program. - -After kernel function call, R1-R5 are reset to unreadable and -R0 has a return type of the function. - -Since R6-R9 are callee saved, their state is preserved across the call. - bpf_mov R6 = 1 - bpf_call foo - bpf_mov R0 = R6 - bpf_exit -is a correct program. If there was R1 instead of R6, it would have -been rejected. - -load/store instructions are allowed only with registers of valid types, which -are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. -For example: - bpf_mov R1 = 1 - bpf_mov R2 = 2 - bpf_xadd *(u32 *)(R1 + 3) += R2 - bpf_exit -will be rejected, since R1 doesn't have a valid pointer type at the time of -execution of instruction bpf_xadd. - -At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context') -A callback is used to customize verifier to restrict eBPF program access to only -certain fields within ctx structure with specified size and alignment. - -For example, the following insn: - bpf_ld R0 = *(u32 *)(R6 + 8) -intends to load a word from address R6 + 8 and store it into R0 -If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know -that offset 8 of size 4 bytes can be accessed for reading, otherwise -the verifier will reject the program. -If R6=PTR_TO_STACK, then access should be aligned and be within -stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, -so it will fail verification, since it's out of bounds. - -The verifier will allow eBPF program to read data from stack only after -it wrote into it. -Classic BPF verifier does similar check with M[0-15] memory slots. -For example: - bpf_ld R0 = *(u32 *)(R10 - 4) - bpf_exit -is invalid program. -Though R10 is correct read-only register and has type PTR_TO_STACK -and R10 - 4 is within stack bounds, there were no stores into that location. - -Pointer register spill/fill is tracked as well, since four (R6-R9) -callee saved registers may not be enough for some programs. - -Allowed function calls are customized with bpf_verifier_ops->get_func_proto() -The eBPF verifier will check that registers match argument constraints. -After the call register R0 will be set to return type of the function. - -Function calls is a main mechanism to extend functionality of eBPF programs. -Socket filters may let programs to call one set of functions, whereas tracing -filters may allow completely different set. - -If a function made accessible to eBPF program, it needs to be thought through -from safety point of view. The verifier will guarantee that the function is -called with valid arguments. - -seccomp vs socket filters have different security restrictions for classic BPF. -Seccomp solves this by two stage verifier: classic BPF verifier is followed -by seccomp verifier. In case of eBPF one configurable verifier is shared for -all use cases. - -See details of eBPF verifier in kernel/bpf/verifier.c - -Register value tracking ------------------------ -In order to determine the safety of an eBPF program, the verifier must track -the range of possible values in each register and also in each stack slot. -This is done with 'struct bpf_reg_state', defined in include/linux/ -bpf_verifier.h, which unifies tracking of scalar and pointer values. Each -register state has a type, which is either NOT_INIT (the register has not been -written to), SCALAR_VALUE (some value which is not usable as a pointer), or a -pointer type. The types of pointers describe their base, as follows: - PTR_TO_CTX Pointer to bpf_context. - CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic - on these pointers is forbidden. - PTR_TO_MAP_VALUE Pointer to the value stored in a map element. - PTR_TO_MAP_VALUE_OR_NULL - Either a pointer to a map value, or NULL; map accesses - (see section 'eBPF maps', below) return this type, - which becomes a PTR_TO_MAP_VALUE when checked != NULL. - Arithmetic on these pointers is forbidden. - PTR_TO_STACK Frame pointer. - PTR_TO_PACKET skb->data. - PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. - PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted. - PTR_TO_SOCKET_OR_NULL - Either a pointer to a socket, or NULL; socket lookup - returns this type, which becomes a PTR_TO_SOCKET when - checked != NULL. PTR_TO_SOCKET is reference-counted, - so programs must release the reference through the - socket release function before the end of the program. - Arithmetic on these pointers is forbidden. -However, a pointer may be offset from this base (as a result of pointer -arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable -offset'. The former is used when an exactly-known value (e.g. an immediate -operand) is added to a pointer, while the latter is used for values which are -not exactly known. The variable offset is also used in SCALAR_VALUEs, to track -the range of possible values in the register. -The verifier's knowledge about the variable offset consists of: -* minimum and maximum values as unsigned -* minimum and maximum values as signed -* knowledge of the values of individual bits, in the form of a 'tnum': a u64 -'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; -1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both -mask and value; no bit should ever be 1 in both. For example, if a byte is read -into a register from memory, the register's top 56 bits are known zero, while -the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we -then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; -0x1ff), because of potential carries. - -Besides arithmetic, the register state can also be updated by conditional -branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch -it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' -branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or -BPF_JSGE) would instead update the signed minimum/maximum values. Information -from the signed and unsigned bounds can be combined; for instance if a value is -first tested < 8 and then tested s> 4, the verifier will conclude that the value -is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. - -PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all -pointers sharing that same variable offset. This is important for packet range -checks: after adding a variable to a packet pointer register A, if you then copy -it to another register B and then add a constant 4 to A, both registers will -share the same 'id' but the A will have a fixed offset of +4. Then if A is -bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is -now known to have a safe range of at least 4 bytes. See 'Direct packet access', -below, for more on PTR_TO_PACKET ranges. - -The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of -the pointer returned from a map lookup. This means that when one copy is -checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. -As well as range-checking, the tracked information is also used for enforcing -alignment of pointer accesses. For instance, on most systems the packet pointer -is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump -over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting -pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 -bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through -that pointer are safe. -The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common -to all copies of the pointer returned from a socket lookup. This has similar -behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but -it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly -represents a reference to the corresponding 'struct sock'. To ensure that the -reference is not leaked, it is imperative to NULL-check the reference and in -the non-NULL case, and pass the valid reference to the socket release function. - -Direct packet access --------------------- -In cls_bpf and act_bpf programs the verifier allows direct access to the packet -data via skb->data and skb->data_end pointers. -Ex: -1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ -2: r3 = *(u32 *)(r1 +76) /* load skb->data */ -3: r5 = r3 -4: r5 += 14 -5: if r5 > r4 goto pc+16 -R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp -6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ - -this 2byte load from the packet is safe to do, since the program author -did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which -means that in the fall-through case the register R3 (which points to skb->data) -has at least 14 directly accessible bytes. The verifier marks it -as R3=pkt(id=0,off=0,r=14). -id=0 means that no additional variables were added to the register. -off=0 means that no additional constants were added. -r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. -Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points -to the packet data, but constant 14 was added to the register, so -it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) -which is zero bytes. - -More complex packet access may look like: - R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp - 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ - 7: r4 = *(u8 *)(r3 +12) - 8: r4 *= 14 - 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ -10: r3 += r4 -11: r2 = r1 -12: r2 <<= 48 -13: r2 >>= 48 -14: r3 += r2 -15: r2 = r3 -16: r2 += 8 -17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ -18: if r2 > r1 goto pc+2 - R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp -19: r1 = *(u8 *)(r3 +4) -The state of the register R3 is R3=pkt(id=2,off=0,r=8) -id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some -offset within a packet and since the program author did -'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). -The verifier only allows 'add'/'sub' operations on packet registers. Any other -operation will set the register state to 'SCALAR_VALUE' and it won't be -available for direct packet access. -Operation 'r3 += rX' may overflow and become less than original skb->data, -therefore the verifier has to prevent that. So when it sees 'r3 += rX' -instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 -against skb->data_end will not give us 'range' information, so attempts to read -through the pointer will give "invalid access to packet" error. -Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is -R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits -of the register are guaranteed to be zero, and nothing is known about the lower -8 bits. After insn 'r4 *= 14' the state becomes -R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit -value by constant 14 will keep upper 52 bits as zero, also the least significant -bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make -R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign -extending. This logic is implemented in adjust_reg_min_max_vals() function, -which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice -versa) and adjust_scalar_min_max_vals() for operations on two scalars. - -The end result is that bpf program author can access packet directly -using normal C code as: - void *data = (void *)(long)skb->data; - void *data_end = (void *)(long)skb->data_end; - struct eth_hdr *eth = data; - struct iphdr *iph = data + sizeof(*eth); - struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); - - if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) - return 0; - if (eth->h_proto != htons(ETH_P_IP)) - return 0; - if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) - return 0; - if (udp->dest == 53 || udp->source == 9) - ...; -which makes such programs easier to write comparing to LD_ABS insn -and significantly faster. - -eBPF maps ---------- -'maps' is a generic storage of different types for sharing data between kernel -and userspace. - -The maps are accessed from user space via BPF syscall, which has commands: -- create a map with given type and attributes - map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) - using attr->map_type, attr->key_size, attr->value_size, attr->max_entries - returns process-local file descriptor or negative error - -- lookup key in a given map - err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key, attr->value - returns zero and stores found elem into value or negative error - -- create or update key/value pair in a given map - err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key, attr->value - returns zero or negative error - -- find and delete element by key in a given map - err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key - -- to delete map: close(fd) - Exiting process will delete maps automatically - -userspace programs use this syscall to create/access maps that eBPF programs -are concurrently updating. - -maps can have different types: hash, array, bloom filter, radix-tree, etc. - -The map is defined by: - . type - . max number of elements - . key size in bytes - . value size in bytes - -Pruning -------- -The verifier does not actually walk all possible paths through the program. For -each new branch to analyse, the verifier looks at all the states it's previously -been in when at this instruction. If any of them contain the current state as a -subset, the branch is 'pruned' - that is, the fact that the previous state was -accepted implies the current state would be as well. For instance, if in the -previous state, r1 held a packet-pointer, and in the current state, r1 holds a -packet-pointer with a range as long or longer and at least as strict an -alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't -have been used by any path from that point, so any value in r2 (including -another NOT_INIT) is safe. The implementation is in the function regsafe(). -Pruning considers not only the registers but also the stack (and any spilled -registers it may hold). They must all be safe for the branch to be pruned. -This is implemented in states_equal(). - -Understanding eBPF verifier messages ------------------------------------- - -The following are few examples of invalid eBPF programs and verifier error -messages as seen in the log: - -Program with unreachable instructions: -static struct bpf_insn prog[] = { - BPF_EXIT_INSN(), - BPF_EXIT_INSN(), -}; -Error: - unreachable insn 1 - -Program that reads uninitialized register: - BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), - BPF_EXIT_INSN(), -Error: - 0: (bf) r0 = r2 - R2 !read_ok - -Program that doesn't initialize R0 before exiting: - BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), - BPF_EXIT_INSN(), -Error: - 0: (bf) r2 = r1 - 1: (95) exit - R0 !read_ok - -Program that accesses stack out of bounds: - BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 +8) = 0 - invalid stack off=8 size=8 - -Program that doesn't initialize stack before passing its address into function: - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), -Error: - 0: (bf) r2 = r10 - 1: (07) r2 += -8 - 2: (b7) r1 = 0x0 - 3: (85) call 1 - invalid indirect read from stack off -8+0 size 8 - -Program that uses invalid map_fd=0 while calling to map_lookup_elem() function: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - fd 0 is not pointing to valid bpf_map - -Program that doesn't check return value of map_lookup_elem() before accessing -map element: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - 5: (7a) *(u64 *)(r0 +0) = 0 - R0 invalid mem access 'map_value_or_null' - -Program that correctly checks map_lookup_elem() returned value for NULL, but -accesses the memory with incorrect alignment: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+1 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +4) = 0 - misaligned access off 4 size 8 - -Program that correctly checks map_lookup_elem() returned value for NULL and -accesses memory with correct alignment in one side of 'if' branch, but fails -to do so in the other side of 'if' branch: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+2 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +0) = 0 - 7: (95) exit - - from 5 to 8: R0=imm0 R10=fp - 8: (7a) *(u64 *)(r0 +0) = 1 - R0 invalid mem access 'imm' - -Program that performs a socket lookup then sets the pointer to NULL without -checking it: -value: - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_EXIT_INSN(), -Error: - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (b7) r0 = 0 - 9: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Program that performs a socket lookup but does not NULL-check the returned -value: - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_EXIT_INSN(), -Error: - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Testing -------- - -Next to the BPF toolchain, the kernel also ships a test module that contains -various test cases for classic and internal BPF that can be executed against -the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and -enabled via Kconfig: - - CONFIG_TEST_BPF=m - -After the module has been built and installed, the test suite can be executed -via insmod or modprobe against 'test_bpf' module. Results of the test cases -including timings in nsec can be found in the kernel log (dmesg). - -Misc ----- - -Also trinity, the Linux syscall fuzzer, has built-in support for BPF and -SECCOMP-BPF kernel fuzzing. - -Written by ----------- - -The document was written in the hope that it is found useful and in order -to give potential BPF hackers or security auditors a better overview of -the underlying architecture. - -Jay Schulist -Daniel Borkmann -Alexei Starovoitov diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 807abe25ae4b..144ed838c1a9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -56,6 +56,7 @@ Contents: driver eql fib_trie + filter .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 999eb41da81d..494614573c67 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -1051,7 +1051,7 @@ for more information on hardware timestamps. ------------------------------------------------------------------------------- - Packet sockets work well together with Linux socket filters, thus you also - might want to have a look at Documentation/networking/filter.txt + might want to have a look at Documentation/networking/filter.rst -------------------------------------------------------------------------------- + THANKS diff --git a/MAINTAINERS b/MAINTAINERS index 7323bfc1720f..4ec6d2741d36 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3192,7 +3192,7 @@ Q: https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git F: Documentation/bpf/ -F: Documentation/networking/filter.txt +F: Documentation/networking/filter.rst F: arch/*/net/* F: include/linux/bpf* F: include/linux/filter.h diff --git a/tools/bpf/bpf_asm.c b/tools/bpf/bpf_asm.c index e5f95e3eede3..0063c3c029e7 100644 --- a/tools/bpf/bpf_asm.c +++ b/tools/bpf/bpf_asm.c @@ -11,7 +11,7 @@ * * How to get into it: * - * 1) read Documentation/networking/filter.txt + * 1) read Documentation/networking/filter.rst * 2) Run `bpf_asm [-c] ` to translate into binary * blob that is loadable with xt_bpf, cls_bpf et al. Note: -c will * pretty print a C-like construct. diff --git a/tools/bpf/bpf_dbg.c b/tools/bpf/bpf_dbg.c index 9d3766e653a9..a0ebcdf59c31 100644 --- a/tools/bpf/bpf_dbg.c +++ b/tools/bpf/bpf_dbg.c @@ -13,7 +13,7 @@ * for making a verdict when multiple simple BPF programs are combined * into one in order to prevent parsing same headers multiple times. * - * More on how to debug BPF opcodes see Documentation/networking/filter.txt + * More on how to debug BPF opcodes see Documentation/networking/filter.rst * which is the main document on BPF. Mini howto for getting started: * * 1) `./bpf_dbg` to enter the shell (shell cmds denoted with '>'): -- cgit From 62502dff2c5012c19727bd992b0101a816095f1e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:37 +0200 Subject: docs: networking: convert fore200e.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/fore200e.rst | 66 +++++++++++++++++++++++++++++++++++ Documentation/networking/fore200e.txt | 64 --------------------------------- Documentation/networking/index.rst | 1 + drivers/atm/Kconfig | 2 +- 4 files changed, 68 insertions(+), 65 deletions(-) create mode 100644 Documentation/networking/fore200e.rst delete mode 100644 Documentation/networking/fore200e.txt diff --git a/Documentation/networking/fore200e.rst b/Documentation/networking/fore200e.rst new file mode 100644 index 000000000000..55df9ec09ac8 --- /dev/null +++ b/Documentation/networking/fore200e.rst @@ -0,0 +1,66 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================= +FORE Systems PCA-200E/SBA-200E ATM NIC driver +============================================= + +This driver adds support for the FORE Systems 200E-series ATM adapters +to the Linux operating system. It is based on the earlier PCA-200E driver +written by Uwe Dannowski. + +The driver simultaneously supports PCA-200E and SBA-200E adapters on +i386, alpha (untested), powerpc, sparc and sparc64 archs. + +The intent is to enable the use of different models of FORE adapters at the +same time, by hosts that have several bus interfaces (such as PCI+SBUS, +or PCI+EISA). + +Only PCI and SBUS devices are currently supported by the driver, but support +for other bus interfaces such as EISA should not be too hard to add. + + +Firmware Copyright Notice +------------------------- + +Please read the fore200e_firmware_copyright file present +in the linux/drivers/atm directory for details and restrictions. + + +Firmware Updates +---------------- + +The FORE Systems 200E-series driver is shipped with firmware data being +uploaded to the ATM adapters at system boot time or at module loading time. +The supplied firmware images should work with all adapters. + +However, if you encounter problems (the firmware doesn't start or the driver +is unable to read the PROM data), you may consider trying another firmware +version. Alternative binary firmware images can be found somewhere on the +ForeThought CD-ROM supplied with your adapter by FORE Systems. + +You can also get the latest firmware images from FORE Systems at +https://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to +the 'software updates' pages. The firmware binaries are part of +the various ForeThought software distributions. + +Notice that different versions of the PCA-200E firmware exist, depending +on the endianness of the host architecture. The driver is shipped with +both little and big endian PCA firmware images. + +Name and location of the new firmware images can be set at kernel +configuration time: + +1. Copy the new firmware binary files (with .bin, .bin1 or .bin2 suffix) + to some directory, such as linux/drivers/atm. + +2. Reconfigure your kernel to set the new firmware name and location. + Expected pathnames are absolute or relative to the drivers/atm directory. + +3. Rebuild and re-install your kernel or your module. + + +Feedback +-------- + +Feedback is welcome. Please send success stories/bug reports/ +patches/improvement/comments/flames to . diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt deleted file mode 100644 index 1f98f62b4370..000000000000 --- a/Documentation/networking/fore200e.txt +++ /dev/null @@ -1,64 +0,0 @@ - -FORE Systems PCA-200E/SBA-200E ATM NIC driver ---------------------------------------------- - -This driver adds support for the FORE Systems 200E-series ATM adapters -to the Linux operating system. It is based on the earlier PCA-200E driver -written by Uwe Dannowski. - -The driver simultaneously supports PCA-200E and SBA-200E adapters on -i386, alpha (untested), powerpc, sparc and sparc64 archs. - -The intent is to enable the use of different models of FORE adapters at the -same time, by hosts that have several bus interfaces (such as PCI+SBUS, -or PCI+EISA). - -Only PCI and SBUS devices are currently supported by the driver, but support -for other bus interfaces such as EISA should not be too hard to add. - - -Firmware Copyright Notice -------------------------- - -Please read the fore200e_firmware_copyright file present -in the linux/drivers/atm directory for details and restrictions. - - -Firmware Updates ----------------- - -The FORE Systems 200E-series driver is shipped with firmware data being -uploaded to the ATM adapters at system boot time or at module loading time. -The supplied firmware images should work with all adapters. - -However, if you encounter problems (the firmware doesn't start or the driver -is unable to read the PROM data), you may consider trying another firmware -version. Alternative binary firmware images can be found somewhere on the -ForeThought CD-ROM supplied with your adapter by FORE Systems. - -You can also get the latest firmware images from FORE Systems at -https://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to -the 'software updates' pages. The firmware binaries are part of -the various ForeThought software distributions. - -Notice that different versions of the PCA-200E firmware exist, depending -on the endianness of the host architecture. The driver is shipped with -both little and big endian PCA firmware images. - -Name and location of the new firmware images can be set at kernel -configuration time: - -1. Copy the new firmware binary files (with .bin, .bin1 or .bin2 suffix) - to some directory, such as linux/drivers/atm. - -2. Reconfigure your kernel to set the new firmware name and location. - Expected pathnames are absolute or relative to the drivers/atm directory. - -3. Rebuild and re-install your kernel or your module. - - -Feedback --------- - -Feedback is welcome. Please send success stories/bug reports/ -patches/improvement/comments/flames to . diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 144ed838c1a9..b2fb8b907d68 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -57,6 +57,7 @@ Contents: eql fib_trie filter + fore200e .. only:: subproject and html diff --git a/drivers/atm/Kconfig b/drivers/atm/Kconfig index 8c37294f1d1e..4af7cbdcc349 100644 --- a/drivers/atm/Kconfig +++ b/drivers/atm/Kconfig @@ -336,7 +336,7 @@ config ATM_FORE200E on PCI and SBUS hosts. Say Y (or M to compile as a module named fore_200e) here if you have one of these ATM adapters. - See the file for + See the file for further details. config ATM_FORE200E_USE_TASKLET -- cgit From 5b0d74b54c7f1cb9c65955df78dffe112e1959c1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:38 +0200 Subject: docs: networking: convert framerelay.txt to ReST - add SPDX header; - add a document title; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/framerelay.rst | 44 +++++++++++++++++++++++++++++++++ Documentation/networking/framerelay.txt | 39 ----------------------------- Documentation/networking/index.rst | 1 + drivers/net/wan/Kconfig | 4 +-- 4 files changed, 47 insertions(+), 41 deletions(-) create mode 100644 Documentation/networking/framerelay.rst delete mode 100644 Documentation/networking/framerelay.txt diff --git a/Documentation/networking/framerelay.rst b/Documentation/networking/framerelay.rst new file mode 100644 index 000000000000..6d904399ec6d --- /dev/null +++ b/Documentation/networking/framerelay.rst @@ -0,0 +1,44 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +Frame Relay (FR) +================ + +Frame Relay (FR) support for linux is built into a two tiered system of device +drivers. The upper layer implements RFC1490 FR specification, and uses the +Data Link Connection Identifier (DLCI) as its hardware address. Usually these +are assigned by your network supplier, they give you the number/numbers of +the Virtual Connections (VC) assigned to you. + +Each DLCI is a point-to-point link between your machine and a remote one. +As such, a separate device is needed to accommodate the routing. Within the +net-tools archives is 'dlcicfg'. This program will communicate with the +base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'... +The configuration script will ask you how many DLCIs you need, as well as +how many DLCIs you want to assign to each Frame Relay Access Device (FRAD). + +The DLCI uses a number of function calls to communicate with the FRAD, all +of which are stored in the FRAD's private data area. assoc/deassoc, +activate/deactivate and dlci_config. The DLCI supplies a receive function +to the FRAD to accept incoming packets. + +With this initial offering, only 1 FRAD driver is available. With many thanks +to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E & +S508 are supported. This driver is currently set up for only FR, but as +Sangoma makes more firmware modules available, it can be updated to provide +them as well. + +Configuration of the FRAD makes use of another net-tools program, 'fradcfg'. +This program makes use of a configuration file (which dlcicfg can also read) +to specify the types of boards to be configured as FRADs, as well as perform +any board specific configuration. The Sangoma module of fradcfg loads the +FR firmware into the card, sets the irq/port/memory information, and provides +an initial configuration. + +Additional FRAD device drivers can be added as hardware is available. + +At this time, the dlcicfg and fradcfg programs have not been incorporated into +the net-tools distribution. They can be found at ftp.invlogic.com, in +/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just +use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for +pre-2.0.4 and later. diff --git a/Documentation/networking/framerelay.txt b/Documentation/networking/framerelay.txt deleted file mode 100644 index 1a0b720440dd..000000000000 --- a/Documentation/networking/framerelay.txt +++ /dev/null @@ -1,39 +0,0 @@ -Frame Relay (FR) support for linux is built into a two tiered system of device -drivers. The upper layer implements RFC1490 FR specification, and uses the -Data Link Connection Identifier (DLCI) as its hardware address. Usually these -are assigned by your network supplier, they give you the number/numbers of -the Virtual Connections (VC) assigned to you. - -Each DLCI is a point-to-point link between your machine and a remote one. -As such, a separate device is needed to accommodate the routing. Within the -net-tools archives is 'dlcicfg'. This program will communicate with the -base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'... -The configuration script will ask you how many DLCIs you need, as well as -how many DLCIs you want to assign to each Frame Relay Access Device (FRAD). - -The DLCI uses a number of function calls to communicate with the FRAD, all -of which are stored in the FRAD's private data area. assoc/deassoc, -activate/deactivate and dlci_config. The DLCI supplies a receive function -to the FRAD to accept incoming packets. - -With this initial offering, only 1 FRAD driver is available. With many thanks -to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E & -S508 are supported. This driver is currently set up for only FR, but as -Sangoma makes more firmware modules available, it can be updated to provide -them as well. - -Configuration of the FRAD makes use of another net-tools program, 'fradcfg'. -This program makes use of a configuration file (which dlcicfg can also read) -to specify the types of boards to be configured as FRADs, as well as perform -any board specific configuration. The Sangoma module of fradcfg loads the -FR firmware into the card, sets the irq/port/memory information, and provides -an initial configuration. - -Additional FRAD device drivers can be added as hardware is available. - -At this time, the dlcicfg and fradcfg programs have not been incorporated into -the net-tools distribution. They can be found at ftp.invlogic.com, in -/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just -use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for -pre-2.0.4 and later. - diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b2fb8b907d68..4e225f1f7039 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -58,6 +58,7 @@ Contents: fib_trie filter fore200e + framerelay .. only:: subproject and html diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig index dbc0e3f7a3e2..3e21726c36e8 100644 --- a/drivers/net/wan/Kconfig +++ b/drivers/net/wan/Kconfig @@ -336,7 +336,7 @@ config DLCI To use frame relay, you need supporting hardware (called FRAD) and certain programs from the net-tools package as explained in - . + . To compile this driver as a module, choose M here: the module will be called dlci. @@ -361,7 +361,7 @@ config SDLA These are multi-protocol cards, but only Frame Relay is supported by the driver at this time. Please read - . + . To compile this driver as a module, choose M here: the module will be called sdla. -- cgit From 16128ad8f927850a1121b7645c6381341d9c0b63 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:39 +0200 Subject: docs: networking: convert generic-hdlc.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/generic-hdlc.rst | 170 ++++++++++++++++++++++++++++++ Documentation/networking/generic-hdlc.txt | 132 ----------------------- Documentation/networking/index.rst | 1 + 3 files changed, 171 insertions(+), 132 deletions(-) create mode 100644 Documentation/networking/generic-hdlc.rst delete mode 100644 Documentation/networking/generic-hdlc.txt diff --git a/Documentation/networking/generic-hdlc.rst b/Documentation/networking/generic-hdlc.rst new file mode 100644 index 000000000000..1c3bb5cb98d4 --- /dev/null +++ b/Documentation/networking/generic-hdlc.rst @@ -0,0 +1,170 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +Generic HDLC layer +================== + +Krzysztof Halasa + + +Generic HDLC layer currently supports: + +1. Frame Relay (ANSI, CCITT, Cisco and no LMI) + + - Normal (routed) and Ethernet-bridged (Ethernet device emulation) + interfaces can share a single PVC. + - ARP support (no InARP support in the kernel - there is an + experimental InARP user-space daemon available on: + http://www.kernel.org/pub/linux/utils/net/hdlc/). + +2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation +3. Cisco HDLC +4. PPP +5. X.25 (uses X.25 routines). + +Generic HDLC is a protocol driver only - it needs a low-level driver +for your particular hardware. + +Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible +with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging). + + +Make sure the hdlc.o and the hardware driver are loaded. It should +create a number of "hdlc" (hdlc0 etc) network devices, one for each +WAN port. You'll need the "sethdlc" utility, get it from: + + http://www.kernel.org/pub/linux/utils/net/hdlc/ + +Compile sethdlc.c utility:: + + gcc -O2 -Wall -o sethdlc sethdlc.c + +Make sure you're using a correct version of sethdlc for your kernel. + +Use sethdlc to set physical interface, clock rate, HDLC mode used, +and add any required PVCs if using Frame Relay. +Usually you want something like:: + + sethdlc hdlc0 clock int rate 128000 + sethdlc hdlc0 cisco interval 10 timeout 25 + +or:: + + sethdlc hdlc0 rs232 clock ext + sethdlc hdlc0 fr lmi ansi + sethdlc hdlc0 create 99 + ifconfig hdlc0 up + ifconfig pvc0 localIP pointopoint remoteIP + +In Frame Relay mode, ifconfig master hdlc device up (without assigning +any IP address to it) before using pvc devices. + + +Setting interface: + +* v35 | rs232 | x21 | t1 | e1 + - sets physical interface for a given port + if the card has software-selectable interfaces + loopback + - activate hardware loopback (for testing only) +* clock ext + - both RX clock and TX clock external +* clock int + - both RX clock and TX clock internal +* clock txint + - RX clock external, TX clock internal +* clock txfromrx + - RX clock external, TX clock derived from RX clock +* rate + - sets clock rate in bps (for "int" or "txint" clock only) + + +Setting protocol: + +* hdlc - sets raw HDLC (IP-only) mode + + nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code + + no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu + + crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity + +* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding + as above. + +* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported) + + interval - time in seconds between keepalive packets + + timeout - time in seconds after last received keepalive packet before + we assume the link is down + +* ppp - sets synchronous PPP mode + +* x25 - sets X.25 mode + +* fr - Frame Relay mode + + lmi ansi / ccitt / cisco / none - LMI (link management) type + + dce - Frame Relay DCE (network) side LMI instead of default DTE (user). + + It has nothing to do with clocks! + + - t391 - link integrity verification polling timer (in seconds) - user + - t392 - polling verification timer (in seconds) - network + - n391 - full status polling counter - user + - n392 - error threshold - both user and network + - n393 - monitored events count - both user and network + +Frame-Relay only: + +* create n | delete n - adds / deletes PVC interface with DLCI #n. + Newly created interface will be named pvc0, pvc1 etc. + +* create ether n | delete ether n - adds a device for Ethernet-bridged + frames. The device will be named pvceth0, pvceth1 etc. + + + + +Board-specific issues +--------------------- + +n2.o and c101.o need parameters to work:: + + insmod n2 hw=io,irq,ram,ports[:io,irq,...] + +example:: + + insmod n2 hw=0x300,10,0xD0000,01 + +or:: + + insmod c101 hw=irq,ram[:irq,...] + +example:: + + insmod c101 hw=9,0xdc000 + +If built into the kernel, these drivers need kernel (command line) parameters:: + + n2.hw=io,irq,ram,ports:... + +or:: + + c101.hw=irq,ram:... + + + +If you have a problem with N2, C101 or PLX200SYN card, you can issue the +"private" command to see port's packet descriptor rings (in kernel logs):: + + sethdlc hdlc0 private + +The hardware driver has to be build with #define DEBUG_RINGS. +Attaching this info to bug reports would be helpful. Anyway, let me know +if you have problems using this. + +For patches and other info look at: +. diff --git a/Documentation/networking/generic-hdlc.txt b/Documentation/networking/generic-hdlc.txt deleted file mode 100644 index 4eb3cc40b702..000000000000 --- a/Documentation/networking/generic-hdlc.txt +++ /dev/null @@ -1,132 +0,0 @@ -Generic HDLC layer -Krzysztof Halasa - - -Generic HDLC layer currently supports: -1. Frame Relay (ANSI, CCITT, Cisco and no LMI) - - Normal (routed) and Ethernet-bridged (Ethernet device emulation) - interfaces can share a single PVC. - - ARP support (no InARP support in the kernel - there is an - experimental InARP user-space daemon available on: - http://www.kernel.org/pub/linux/utils/net/hdlc/). -2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation -3. Cisco HDLC -4. PPP -5. X.25 (uses X.25 routines). - -Generic HDLC is a protocol driver only - it needs a low-level driver -for your particular hardware. - -Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible -with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging). - - -Make sure the hdlc.o and the hardware driver are loaded. It should -create a number of "hdlc" (hdlc0 etc) network devices, one for each -WAN port. You'll need the "sethdlc" utility, get it from: - http://www.kernel.org/pub/linux/utils/net/hdlc/ - -Compile sethdlc.c utility: - gcc -O2 -Wall -o sethdlc sethdlc.c -Make sure you're using a correct version of sethdlc for your kernel. - -Use sethdlc to set physical interface, clock rate, HDLC mode used, -and add any required PVCs if using Frame Relay. -Usually you want something like: - - sethdlc hdlc0 clock int rate 128000 - sethdlc hdlc0 cisco interval 10 timeout 25 -or - sethdlc hdlc0 rs232 clock ext - sethdlc hdlc0 fr lmi ansi - sethdlc hdlc0 create 99 - ifconfig hdlc0 up - ifconfig pvc0 localIP pointopoint remoteIP - -In Frame Relay mode, ifconfig master hdlc device up (without assigning -any IP address to it) before using pvc devices. - - -Setting interface: - -* v35 | rs232 | x21 | t1 | e1 - sets physical interface for a given port - if the card has software-selectable interfaces - loopback - activate hardware loopback (for testing only) -* clock ext - both RX clock and TX clock external -* clock int - both RX clock and TX clock internal -* clock txint - RX clock external, TX clock internal -* clock txfromrx - RX clock external, TX clock derived from RX clock -* rate - sets clock rate in bps (for "int" or "txint" clock only) - - -Setting protocol: - -* hdlc - sets raw HDLC (IP-only) mode - nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code - no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu - crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity - -* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding - as above. - -* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported) - interval - time in seconds between keepalive packets - timeout - time in seconds after last received keepalive packet before - we assume the link is down - -* ppp - sets synchronous PPP mode - -* x25 - sets X.25 mode - -* fr - Frame Relay mode - lmi ansi / ccitt / cisco / none - LMI (link management) type - dce - Frame Relay DCE (network) side LMI instead of default DTE (user). - It has nothing to do with clocks! - t391 - link integrity verification polling timer (in seconds) - user - t392 - polling verification timer (in seconds) - network - n391 - full status polling counter - user - n392 - error threshold - both user and network - n393 - monitored events count - both user and network - -Frame-Relay only: -* create n | delete n - adds / deletes PVC interface with DLCI #n. - Newly created interface will be named pvc0, pvc1 etc. - -* create ether n | delete ether n - adds a device for Ethernet-bridged - frames. The device will be named pvceth0, pvceth1 etc. - - - - -Board-specific issues ---------------------- - -n2.o and c101.o need parameters to work: - - insmod n2 hw=io,irq,ram,ports[:io,irq,...] -example: - insmod n2 hw=0x300,10,0xD0000,01 - -or - insmod c101 hw=irq,ram[:irq,...] -example: - insmod c101 hw=9,0xdc000 - -If built into the kernel, these drivers need kernel (command line) parameters: - n2.hw=io,irq,ram,ports:... -or - c101.hw=irq,ram:... - - - -If you have a problem with N2, C101 or PLX200SYN card, you can issue the -"private" command to see port's packet descriptor rings (in kernel logs): - - sethdlc hdlc0 private - -The hardware driver has to be build with #define DEBUG_RINGS. -Attaching this info to bug reports would be helpful. Anyway, let me know -if you have problems using this. - -For patches and other info look at: -. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4e225f1f7039..d34824b27264 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -59,6 +59,7 @@ Contents: filter fore200e framerelay + generic-hdlc .. only:: subproject and html -- cgit From 110662503de20f21ab22cf409753124d0977a339 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:40 +0200 Subject: docs: networking: convert generic_netlink.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/generic_netlink.rst | 9 +++++++++ Documentation/networking/generic_netlink.txt | 3 --- Documentation/networking/index.rst | 1 + 3 files changed, 10 insertions(+), 3 deletions(-) create mode 100644 Documentation/networking/generic_netlink.rst delete mode 100644 Documentation/networking/generic_netlink.txt diff --git a/Documentation/networking/generic_netlink.rst b/Documentation/networking/generic_netlink.rst new file mode 100644 index 000000000000..59e04ccf80c1 --- /dev/null +++ b/Documentation/networking/generic_netlink.rst @@ -0,0 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Generic Netlink +=============== + +A wiki document on how to use Generic Netlink can be found here: + + * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt deleted file mode 100644 index 3e071115ca90..000000000000 --- a/Documentation/networking/generic_netlink.txt +++ /dev/null @@ -1,3 +0,0 @@ -A wiki document on how to use Generic Netlink can be found here: - - * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d34824b27264..42e556509e22 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -60,6 +60,7 @@ Contents: fore200e framerelay generic-hdlc + generic_netlink .. only:: subproject and html -- cgit From 8c498935585680284e5f3e5294d3c901b7c89d57 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:41 +0200 Subject: docs: networking: convert gen_stats.txt to ReST - add SPDX header; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/gen_stats.rst | 129 +++++++++++++++++++++++++++++++++ Documentation/networking/gen_stats.txt | 119 ------------------------------ Documentation/networking/index.rst | 1 + net/core/gen_stats.c | 2 +- 4 files changed, 131 insertions(+), 120 deletions(-) create mode 100644 Documentation/networking/gen_stats.rst delete mode 100644 Documentation/networking/gen_stats.txt diff --git a/Documentation/networking/gen_stats.rst b/Documentation/networking/gen_stats.rst new file mode 100644 index 000000000000..595a83b9a61b --- /dev/null +++ b/Documentation/networking/gen_stats.rst @@ -0,0 +1,129 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== +Generic networking statistics for netlink users +=============================================== + +Statistic counters are grouped into structs: + +==================== ===================== ===================== +Struct TLV type Description +==================== ===================== ===================== +gnet_stats_basic TCA_STATS_BASIC Basic statistics +gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator +gnet_stats_queue TCA_STATS_QUEUE Queue statistics +none TCA_STATS_APP Application specific +==================== ===================== ===================== + + +Collecting: +----------- + +Declare the statistic structs you need:: + + struct mystruct { + struct gnet_stats_basic bstats; + struct gnet_stats_queue qstats; + ... + }; + +Update statistics, in dequeue() methods only, (while owning qdisc->running):: + + mystruct->tstats.packet++; + mystruct->qstats.backlog += skb->pkt_len; + + +Export to userspace (Dump): +--------------------------- + +:: + + my_dumping_routine(struct sk_buff *skb, ...) + { + struct gnet_dump dump; + + if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, + TCA_PAD) < 0) + goto rtattr_failure; + + if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || + gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 || + gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0) + goto rtattr_failure; + + if (gnet_stats_finish_copy(&dump) < 0) + goto rtattr_failure; + ... + } + +TCA_STATS/TCA_XSTATS backward compatibility: +-------------------------------------------- + +Prior users of struct tc_stats and xstats can maintain backward +compatibility by calling the compat wrappers to keep providing the +existing TLV types:: + + my_dumping_routine(struct sk_buff *skb, ...) + { + if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, + TCA_XSTATS, &mystruct->lock, &dump, + TCA_PAD) < 0) + goto rtattr_failure; + ... + } + +A struct tc_stats will be filled out during gnet_stats_copy_* calls +and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app +was called. + + +Locking: +-------- + +Locks are taken before writing and released once all statistics have +been written. Locks are always released in case of an error. You +are responsible for making sure that the lock is initialized. + + +Rate Estimator: +--------------- + +0) Prepare an estimator attribute. Most likely this would be in user + space. The value of this TLV should contain a tc_estimator structure. + As usual, such a TLV needs to be 32 bit aligned and therefore the + length needs to be appropriately set, etc. The estimator interval + and ewma log need to be converted to the appropriate values. + tc_estimator.c::tc_setup_estimator() is advisable to be used as the + conversion routine. It does a few clever things. It takes a time + interval in microsecs, a time constant also in microsecs and a struct + tc_estimator to be populated. The returned tc_estimator can be + transported to the kernel. Transfer such a structure in a TLV of type + TCA_RATE to your code in the kernel. + +In the kernel when setting up: + +1) make sure you have basic stats and rate stats setup first. +2) make sure you have initialized stats lock that is used to setup such + stats. +3) Now initialize a new estimator:: + + int ret = gen_new_estimator(my_basicstats,my_rate_est_stats, + mystats_lock, attr_with_tcestimator_struct); + + if ret == 0 + success + else + failed + +From now on, every time you dump my_rate_est_stats it will contain +up-to-date info. + +Once you are done, call gen_kill_estimator(my_basicstats, +my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats +are still valid (i.e still exist) at the time of making this call. + + +Authors: +-------- +- Thomas Graf +- Jamal Hadi Salim diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt deleted file mode 100644 index 179b18ce45ff..000000000000 --- a/Documentation/networking/gen_stats.txt +++ /dev/null @@ -1,119 +0,0 @@ -Generic networking statistics for netlink users -====================================================================== - -Statistic counters are grouped into structs: - -Struct TLV type Description ----------------------------------------------------------------------- -gnet_stats_basic TCA_STATS_BASIC Basic statistics -gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator -gnet_stats_queue TCA_STATS_QUEUE Queue statistics -none TCA_STATS_APP Application specific - - -Collecting: ------------ - -Declare the statistic structs you need: -struct mystruct { - struct gnet_stats_basic bstats; - struct gnet_stats_queue qstats; - ... -}; - -Update statistics, in dequeue() methods only, (while owning qdisc->running) -mystruct->tstats.packet++; -mystruct->qstats.backlog += skb->pkt_len; - - -Export to userspace (Dump): ---------------------------- - -my_dumping_routine(struct sk_buff *skb, ...) -{ - struct gnet_dump dump; - - if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, - TCA_PAD) < 0) - goto rtattr_failure; - - if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || - gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 || - gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0) - goto rtattr_failure; - - if (gnet_stats_finish_copy(&dump) < 0) - goto rtattr_failure; - ... -} - -TCA_STATS/TCA_XSTATS backward compatibility: --------------------------------------------- - -Prior users of struct tc_stats and xstats can maintain backward -compatibility by calling the compat wrappers to keep providing the -existing TLV types. - -my_dumping_routine(struct sk_buff *skb, ...) -{ - if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, - TCA_XSTATS, &mystruct->lock, &dump, - TCA_PAD) < 0) - goto rtattr_failure; - ... -} - -A struct tc_stats will be filled out during gnet_stats_copy_* calls -and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app -was called. - - -Locking: --------- - -Locks are taken before writing and released once all statistics have -been written. Locks are always released in case of an error. You -are responsible for making sure that the lock is initialized. - - -Rate Estimator: --------------- - -0) Prepare an estimator attribute. Most likely this would be in user - space. The value of this TLV should contain a tc_estimator structure. - As usual, such a TLV needs to be 32 bit aligned and therefore the - length needs to be appropriately set, etc. The estimator interval - and ewma log need to be converted to the appropriate values. - tc_estimator.c::tc_setup_estimator() is advisable to be used as the - conversion routine. It does a few clever things. It takes a time - interval in microsecs, a time constant also in microsecs and a struct - tc_estimator to be populated. The returned tc_estimator can be - transported to the kernel. Transfer such a structure in a TLV of type - TCA_RATE to your code in the kernel. - -In the kernel when setting up: -1) make sure you have basic stats and rate stats setup first. -2) make sure you have initialized stats lock that is used to setup such - stats. -3) Now initialize a new estimator: - - int ret = gen_new_estimator(my_basicstats,my_rate_est_stats, - mystats_lock, attr_with_tcestimator_struct); - - if ret == 0 - success - else - failed - -From now on, every time you dump my_rate_est_stats it will contain -up-to-date info. - -Once you are done, call gen_kill_estimator(my_basicstats, -my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats -are still valid (i.e still exist) at the time of making this call. - - -Authors: --------- -Thomas Graf -Jamal Hadi Salim diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 42e556509e22..33afbb67f3fa 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -61,6 +61,7 @@ Contents: framerelay generic-hdlc generic_netlink + gen_stats .. only:: subproject and html diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c index 1d653fbfcf52..e491b083b348 100644 --- a/net/core/gen_stats.c +++ b/net/core/gen_stats.c @@ -6,7 +6,7 @@ * Jamal Hadi Salim * Alexey Kuznetsov, * - * See Documentation/networking/gen_stats.txt + * See Documentation/networking/gen_stats.rst */ #include -- cgit From 81baecb6f6dc507f1b565e711b5193cdbb3fa939 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:42 +0200 Subject: docs: networking: convert gtp.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - add notes markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/gtp.rst | 251 +++++++++++++++++++++++++++++++++++++ Documentation/networking/gtp.txt | 230 --------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 252 insertions(+), 230 deletions(-) create mode 100644 Documentation/networking/gtp.rst delete mode 100644 Documentation/networking/gtp.txt diff --git a/Documentation/networking/gtp.rst b/Documentation/networking/gtp.rst new file mode 100644 index 000000000000..1563fb94b289 --- /dev/null +++ b/Documentation/networking/gtp.rst @@ -0,0 +1,251 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +The Linux kernel GTP tunneling module +===================================== + +Documentation by + Harald Welte and + Andreas Schultz + +In 'drivers/net/gtp.c' you are finding a kernel-level implementation +of a GTP tunnel endpoint. + +What is GTP +=========== + +GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for +tunneling User-IP payload between a mobile station (phone, modem) +and the interconnection between an external packet data network (such +as the internet). + +So when you start a 'data connection' from your mobile phone, the +phone will use the control plane to signal for the establishment of +such a tunnel between that external data network and the phone. The +tunnel endpoints thus reside on the phone and in the gateway. All +intermediate nodes just transport the encapsulated packet. + +The phone itself does not implement GTP but uses some other +technology-dependent protocol stack for transmitting the user IP +payload, such as LLC/SNDCP/RLC/MAC. + +At some network element inside the cellular operator infrastructure +(SGSN in case of GPRS/EGPRS or classic UMTS, hNodeB in case of a 3G +femtocell, eNodeB in case of 4G/LTE), the cellular protocol stacking +is translated into GTP *without breaking the end-to-end tunnel*. So +intermediate nodes just perform some specific relay function. + +At some point the GTP packet ends up on the so-called GGSN (GSM/UMTS) +or P-GW (LTE), which terminates the tunnel, decapsulates the packet +and forwards it onto an external packet data network. This can be +public internet, but can also be any private IP network (or even +theoretically some non-IP network like X.25). + +You can find the protocol specification in 3GPP TS 29.060, available +publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm + +A direct PDF link to v13.6.0 is provided for convenience below: +http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf + +The Linux GTP tunnelling module +=============================== + +The module implements the function of a tunnel endpoint, i.e. it is +able to decapsulate tunneled IP packets in the uplink originated by +the phone, and encapsulate raw IP packets received from the external +packet network in downlink towards the phone. + +It *only* implements the so-called 'user plane', carrying the User-IP +payload, called GTP-U. It does not implement the 'control plane', +which is a signaling protocol used for establishment and teardown of +GTP tunnels (GTP-C). + +So in order to have a working GGSN/P-GW setup, you will need a +userspace program that implements the GTP-C protocol and which then +uses the netlink interface provided by the GTP-U module in the kernel +to configure the kernel module. + +This split architecture follows the tunneling modules of other +protocols, e.g. PPPoE or L2TP, where you also run a userspace daemon +to handle the tunnel establishment, authentication etc. and only the +data plane is accelerated inside the kernel. + +Don't be confused by terminology: The GTP User Plane goes through +kernel accelerated path, while the GTP Control Plane goes to +Userspace :) + +The official homepage of the module is at +https://osmocom.org/projects/linux-kernel-gtp-u/wiki + +Userspace Programs with Linux Kernel GTP-U support +================================================== + +At the time of this writing, there are at least two Free Software +implementations that implement GTP-C and can use the netlink interface +to make use of the Linux kernel GTP-U support: + +* OpenGGSN (classic 2G/3G GGSN in C): + https://osmocom.org/projects/openggsn/wiki/OpenGGSN + +* ergw (GGSN + P-GW in Erlang): + https://github.com/travelping/ergw + +Userspace Library / Command Line Utilities +========================================== + +There is a userspace library called 'libgtpnl' which is based on +libmnl and which implements a C-language API towards the netlink +interface provided by the Kernel GTP module: + +http://git.osmocom.org/libgtpnl/ + +Protocol Versions +================= + +There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1 +[3GPP TS 29.281]. Both are implemented in the Kernel GTP module. +Version 0 is a legacy version, and deprecated from recent 3GPP +specifications. + +GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151 +for GTPv1-U and 3386 for GTPv0-U. + +There are three versions of GTP-C: v0, v1, and v2. As the kernel +doesn't implement GTP-C, we don't have to worry about this. It's the +responsibility of the control plane implementation in userspace to +implement that. + +IPv6 +==== + +The 3GPP specifications indicate either IPv4 or IPv6 can be used both +on the inner (user) IP layer, or on the outer (transport) layer. + +Unfortunately, the Kernel module currently supports IPv6 neither for +the User IP payload, nor for the outer IP layer. Patches or other +Contributions to fix this are most welcome! + +Mailing List +============ + +If you have questions regarding how to use the Kernel GTP module from +your own software, or want to contribute to the code, please use the +osmocom-net-grps mailing list for related discussion. The list can be +reached at osmocom-net-gprs@lists.osmocom.org and the mailman +interface for managing your subscription is at +https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs + +Issue Tracker +============= + +The Osmocom project maintains an issue tracker for the Kernel GTP-U +module at +https://osmocom.org/projects/linux-kernel-gtp-u/issues + +History / Acknowledgements +========================== + +The Module was originally created in 2012 by Harald Welte, but never +completed. Pablo came in to finish the mess Harald left behind. But +doe to a lack of user interest, it never got merged. + +In 2015, Andreas Schultz came to the rescue and fixed lots more bugs, +extended it with new features and finally pushed all of us to get it +mainline, where it was merged in 4.7.0. + +Architectural Details +===================== + +Local GTP-U entity and tunnel identification +-------------------------------------------- + +GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152 +for GTPv1-U and 3386 for GTPv0-U. + +There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW +instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique +per GTP-U entity. + +A specific tunnel is only defined by the destination entity. Since the +destination port is constant, only the destination IP and TEID define +a tunnel. The source IP and Port have no meaning for the tunnel. + +Therefore: + + * when sending, the remote entity is defined by the remote IP and + the tunnel endpoint id. The source IP and port have no meaning and + can be changed at any time. + + * when receiving the local entity is defined by the local + destination IP and the tunnel endpoint id. The source IP and port + have no meaning and can change at any time. + +[3GPP TS 29.281] Section 4.3.0 defines this so:: + + The TEID in the GTP-U header is used to de-multiplex traffic + incoming from remote tunnel endpoints so that it is delivered to the + User plane entities in a way that allows multiplexing of different + users, different packet protocols and different QoS levels. + Therefore no two remote GTP-U endpoints shall send traffic to a + GTP-U protocol entity using the same TEID value except + for data forwarding as part of mobility procedures. + +The definition above only defines that two remote GTP-U endpoints +*should not* send to the same TEID, it *does not* forbid or exclude +such a scenario. In fact, the mentioned mobility procedures make it +necessary that the GTP-U entity accepts traffic for TEIDs from +multiple or unknown peers. + +Therefore, the receiving side identifies tunnels exclusively based on +TEIDs, not based on the source IP! + +APN vs. Network Device +====================== + +The GTP-U driver creates a Linux network device for each Gi/SGi +interface. + +[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This +may lead to the impression that the GGSN/P-GW can have only one such +interface. + +Correct is that the Gi/SGi reference point defines the interworking +between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP +based networks. + +There is no provision in any of the 3GPP documents that limits the +number of Gi/SGi interfaces implemented by a GGSN/P-GW. + +[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a +specific Gi/SGi interfaces is made through the Access Point Name +(APN):: + + 2. each private network manages its own addressing. In general this + will result in different private networks having overlapping + address ranges. A logically separate connection (e.g. an IP in IP + tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW + and each private network. + + In this case the IP address alone is not necessarily unique. The + pair of values, Access Point Name (APN) and IPv4 address and/or + IPv6 prefixes, is unique. + +In order to support the overlapping address range use case, each APN +is mapped to a separate Gi/SGi interface (network device). + +.. note:: + + The Access Point Name is purely a control plane (GTP-C) concept. + At the GTP-U level, only Tunnel Endpoint Identifiers are present in + GTP-U packets and network devices are known + +Therefore for a given UE the mapping in IP to PDN network is: + + * network device + MS IP -> Peer IP + Peer TEID, + +and from PDN to IP network: + + * local GTP-U IP + TEID -> network device + +Furthermore, before a received T-PDU is injected into the network +device the MS IP is checked against the IP recorded in PDP context. diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt deleted file mode 100644 index 6966bbec1ecb..000000000000 --- a/Documentation/networking/gtp.txt +++ /dev/null @@ -1,230 +0,0 @@ -The Linux kernel GTP tunneling module -====================================================================== -Documentation by Harald Welte and - Andreas Schultz - -In 'drivers/net/gtp.c' you are finding a kernel-level implementation -of a GTP tunnel endpoint. - -== What is GTP == - -GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for -tunneling User-IP payload between a mobile station (phone, modem) -and the interconnection between an external packet data network (such -as the internet). - -So when you start a 'data connection' from your mobile phone, the -phone will use the control plane to signal for the establishment of -such a tunnel between that external data network and the phone. The -tunnel endpoints thus reside on the phone and in the gateway. All -intermediate nodes just transport the encapsulated packet. - -The phone itself does not implement GTP but uses some other -technology-dependent protocol stack for transmitting the user IP -payload, such as LLC/SNDCP/RLC/MAC. - -At some network element inside the cellular operator infrastructure -(SGSN in case of GPRS/EGPRS or classic UMTS, hNodeB in case of a 3G -femtocell, eNodeB in case of 4G/LTE), the cellular protocol stacking -is translated into GTP *without breaking the end-to-end tunnel*. So -intermediate nodes just perform some specific relay function. - -At some point the GTP packet ends up on the so-called GGSN (GSM/UMTS) -or P-GW (LTE), which terminates the tunnel, decapsulates the packet -and forwards it onto an external packet data network. This can be -public internet, but can also be any private IP network (or even -theoretically some non-IP network like X.25). - -You can find the protocol specification in 3GPP TS 29.060, available -publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm - -A direct PDF link to v13.6.0 is provided for convenience below: -http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf - -== The Linux GTP tunnelling module == - -The module implements the function of a tunnel endpoint, i.e. it is -able to decapsulate tunneled IP packets in the uplink originated by -the phone, and encapsulate raw IP packets received from the external -packet network in downlink towards the phone. - -It *only* implements the so-called 'user plane', carrying the User-IP -payload, called GTP-U. It does not implement the 'control plane', -which is a signaling protocol used for establishment and teardown of -GTP tunnels (GTP-C). - -So in order to have a working GGSN/P-GW setup, you will need a -userspace program that implements the GTP-C protocol and which then -uses the netlink interface provided by the GTP-U module in the kernel -to configure the kernel module. - -This split architecture follows the tunneling modules of other -protocols, e.g. PPPoE or L2TP, where you also run a userspace daemon -to handle the tunnel establishment, authentication etc. and only the -data plane is accelerated inside the kernel. - -Don't be confused by terminology: The GTP User Plane goes through -kernel accelerated path, while the GTP Control Plane goes to -Userspace :) - -The official homepage of the module is at -https://osmocom.org/projects/linux-kernel-gtp-u/wiki - -== Userspace Programs with Linux Kernel GTP-U support == - -At the time of this writing, there are at least two Free Software -implementations that implement GTP-C and can use the netlink interface -to make use of the Linux kernel GTP-U support: - -* OpenGGSN (classic 2G/3G GGSN in C): - https://osmocom.org/projects/openggsn/wiki/OpenGGSN - -* ergw (GGSN + P-GW in Erlang): - https://github.com/travelping/ergw - -== Userspace Library / Command Line Utilities == - -There is a userspace library called 'libgtpnl' which is based on -libmnl and which implements a C-language API towards the netlink -interface provided by the Kernel GTP module: - -http://git.osmocom.org/libgtpnl/ - -== Protocol Versions == - -There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1 -[3GPP TS 29.281]. Both are implemented in the Kernel GTP module. -Version 0 is a legacy version, and deprecated from recent 3GPP -specifications. - -GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151 -for GTPv1-U and 3386 for GTPv0-U. - -There are three versions of GTP-C: v0, v1, and v2. As the kernel -doesn't implement GTP-C, we don't have to worry about this. It's the -responsibility of the control plane implementation in userspace to -implement that. - -== IPv6 == - -The 3GPP specifications indicate either IPv4 or IPv6 can be used both -on the inner (user) IP layer, or on the outer (transport) layer. - -Unfortunately, the Kernel module currently supports IPv6 neither for -the User IP payload, nor for the outer IP layer. Patches or other -Contributions to fix this are most welcome! - -== Mailing List == - -If yo have questions regarding how to use the Kernel GTP module from -your own software, or want to contribute to the code, please use the -osmocom-net-grps mailing list for related discussion. The list can be -reached at osmocom-net-gprs@lists.osmocom.org and the mailman -interface for managing your subscription is at -https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs - -== Issue Tracker == - -The Osmocom project maintains an issue tracker for the Kernel GTP-U -module at -https://osmocom.org/projects/linux-kernel-gtp-u/issues - -== History / Acknowledgements == - -The Module was originally created in 2012 by Harald Welte, but never -completed. Pablo came in to finish the mess Harald left behind. But -doe to a lack of user interest, it never got merged. - -In 2015, Andreas Schultz came to the rescue and fixed lots more bugs, -extended it with new features and finally pushed all of us to get it -mainline, where it was merged in 4.7.0. - -== Architectural Details == - -=== Local GTP-U entity and tunnel identification === - -GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152 -for GTPv1-U and 3386 for GTPv0-U. - -There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW -instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique -per GTP-U entity. - -A specific tunnel is only defined by the destination entity. Since the -destination port is constant, only the destination IP and TEID define -a tunnel. The source IP and Port have no meaning for the tunnel. - -Therefore: - - * when sending, the remote entity is defined by the remote IP and - the tunnel endpoint id. The source IP and port have no meaning and - can be changed at any time. - - * when receiving the local entity is defined by the local - destination IP and the tunnel endpoint id. The source IP and port - have no meaning and can change at any time. - -[3GPP TS 29.281] Section 4.3.0 defines this so: - -> The TEID in the GTP-U header is used to de-multiplex traffic -> incoming from remote tunnel endpoints so that it is delivered to the -> User plane entities in a way that allows multiplexing of different -> users, different packet protocols and different QoS levels. -> Therefore no two remote GTP-U endpoints shall send traffic to a -> GTP-U protocol entity using the same TEID value except -> for data forwarding as part of mobility procedures. - -The definition above only defines that two remote GTP-U endpoints -*should not* send to the same TEID, it *does not* forbid or exclude -such a scenario. In fact, the mentioned mobility procedures make it -necessary that the GTP-U entity accepts traffic for TEIDs from -multiple or unknown peers. - -Therefore, the receiving side identifies tunnels exclusively based on -TEIDs, not based on the source IP! - -== APN vs. Network Device == - -The GTP-U driver creates a Linux network device for each Gi/SGi -interface. - -[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This -may lead to the impression that the GGSN/P-GW can have only one such -interface. - -Correct is that the Gi/SGi reference point defines the interworking -between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP -based networks. - -There is no provision in any of the 3GPP documents that limits the -number of Gi/SGi interfaces implemented by a GGSN/P-GW. - -[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a -specific Gi/SGi interfaces is made through the Access Point Name -(APN): - -> 2. each private network manages its own addressing. In general this -> will result in different private networks having overlapping -> address ranges. A logically separate connection (e.g. an IP in IP -> tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW -> and each private network. -> -> In this case the IP address alone is not necessarily unique. The -> pair of values, Access Point Name (APN) and IPv4 address and/or -> IPv6 prefixes, is unique. - -In order to support the overlapping address range use case, each APN -is mapped to a separate Gi/SGi interface (network device). - -NOTE: The Access Point Name is purely a control plane (GTP-C) concept. -At the GTP-U level, only Tunnel Endpoint Identifiers are present in -GTP-U packets and network devices are known - -Therefore for a given UE the mapping in IP to PDN network is: - * network device + MS IP -> Peer IP + Peer TEID, - -and from PDN to IP network: - * local GTP-U IP + TEID -> network device - -Furthermore, before a received T-PDU is injected into the network -device the MS IP is checked against the IP recorded in PDP context. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 33afbb67f3fa..b29a08d1f941 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -62,6 +62,7 @@ Contents: generic-hdlc generic_netlink gen_stats + gtp .. only:: subproject and html -- cgit From 3c3a2fde4d88bb3d6c0592b4b7754f26dab9f697 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:43 +0200 Subject: docs: networking: convert hinic.txt to ReST Not much to be done here: - add SPDX header; - adjust titles and chapters, adding proper markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/hinic.rst | 128 +++++++++++++++++++++++++++++++++++++ Documentation/networking/hinic.txt | 125 ------------------------------------ Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 130 insertions(+), 126 deletions(-) create mode 100644 Documentation/networking/hinic.rst delete mode 100644 Documentation/networking/hinic.txt diff --git a/Documentation/networking/hinic.rst b/Documentation/networking/hinic.rst new file mode 100644 index 000000000000..867ac8f4e04a --- /dev/null +++ b/Documentation/networking/hinic.rst @@ -0,0 +1,128 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family +============================================================ + +Overview: +========= +HiNIC is a network interface card for the Data Center Area. + +The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). +The driver supports also a negotiated and extendable feature set. + +Some HiNIC devices support SR-IOV. This driver is used for Physical Function +(PF). + +HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and +adaptive interrupt moderation. + +HiNIC devices support also various offload features such as checksum offload, +TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and +LRO(Large Receive Offload). + + +Supported PCI vendor ID/device IDs: +=================================== + +19e5:1822 - HiNIC PF + + +Driver Architecture and Source Code: +==================================== + +hinic_dev - Implement a Logical Network device that is independent from +specific HW details about HW data structure formats. + +hinic_hwdev - Implement the HW details of the device and include the components +for accessing the PCI NIC. + +hinic_hwdev contains the following components: +=============================================== + +HW Interface: +============= + +The interface for accessing the pci device (DMA memory and PCI BARs). +(hinic_hw_if.c, hinic_hw_if.h) + +Configuration Status Registers Area that describes the HW Registers on the +configuration and status BAR0. (hinic_hw_csr.h) + +MGMT components: +================ + +Asynchronous Event Queues(AEQs) - The event queues for receiving messages from +the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Application Programmable Interface commands(API CMD) - Interface for sending +MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) + +Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT +commands to the card and receives notifications from the MGMT modules on the +card by AEQs. Also set the addresses of the IO CMDQs in HW. +(hinic_hw_mgmt.c, hinic_hw_mgmt.h) + +IO components: +============== + +Completion Event Queues(CEQs) - The completion Event Queues that describe IO +tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Work Queues(WQ) - Contain the memory and operations for use by CMD queues and +the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains +pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). +(hinic_hw_wq.c, hinic_hw_wq.h) + +Command Queues(CMDQ) - The queues for sending commands for IO management and is +used to set the QPs addresses in HW. The commands completion events are +accumulated on the CEQ that is configured to receive the CMDQ completion events. +(hinic_hw_cmdq.c, hinic_hw_cmdq.h) + +Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting +Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) + +IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) + +HW device: +========== + +HW device - de/constructs the HW Interface, the MGMT components on the +initialization of the driver and the IO components on the case of Interface +UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) + + +hinic_dev contains the following components: +=============================================== + +PCI ID table - Contains the supported PCI Vendor/Device IDs. +(hinic_pci_tbl.h) + +Port Commands - Send commands to the HW device for port management +(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) + +Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. +The Logical Tx queue is not dependent on the format of the HW Send Queue. +(hinic_tx.c, hinic_tx.h) + +Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. +The Logical Rx queue is not dependent on the format of the HW Receive Queue. +(hinic_rx.c, hinic_rx.h) + +hinic_dev - de/constructs the Logical Tx and Rx Queues. +(hinic_main.c, hinic_dev.h) + + +Miscellaneous +============= + +Common functions that are used by HW and Logical Device. +(hinic_common.c, hinic_common.h) + + +Support +======= + +If an issue is identified with the released source code on the supported kernel +with a supported adapter, email the specific information related to the issue to +aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/hinic.txt deleted file mode 100644 index 989366a4039c..000000000000 --- a/Documentation/networking/hinic.txt +++ /dev/null @@ -1,125 +0,0 @@ -Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family -============================================================ - -Overview: -========= -HiNIC is a network interface card for the Data Center Area. - -The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). -The driver supports also a negotiated and extendable feature set. - -Some HiNIC devices support SR-IOV. This driver is used for Physical Function -(PF). - -HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and -adaptive interrupt moderation. - -HiNIC devices support also various offload features such as checksum offload, -TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and -LRO(Large Receive Offload). - - -Supported PCI vendor ID/device IDs: -=================================== - -19e5:1822 - HiNIC PF - - -Driver Architecture and Source Code: -==================================== - -hinic_dev - Implement a Logical Network device that is independent from -specific HW details about HW data structure formats. - -hinic_hwdev - Implement the HW details of the device and include the components -for accessing the PCI NIC. - -hinic_hwdev contains the following components: -=============================================== - -HW Interface: -============= - -The interface for accessing the pci device (DMA memory and PCI BARs). -(hinic_hw_if.c, hinic_hw_if.h) - -Configuration Status Registers Area that describes the HW Registers on the -configuration and status BAR0. (hinic_hw_csr.h) - -MGMT components: -================ - -Asynchronous Event Queues(AEQs) - The event queues for receiving messages from -the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) - -Application Programmable Interface commands(API CMD) - Interface for sending -MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) - -Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT -commands to the card and receives notifications from the MGMT modules on the -card by AEQs. Also set the addresses of the IO CMDQs in HW. -(hinic_hw_mgmt.c, hinic_hw_mgmt.h) - -IO components: -============== - -Completion Event Queues(CEQs) - The completion Event Queues that describe IO -tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) - -Work Queues(WQ) - Contain the memory and operations for use by CMD queues and -the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains -pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). -(hinic_hw_wq.c, hinic_hw_wq.h) - -Command Queues(CMDQ) - The queues for sending commands for IO management and is -used to set the QPs addresses in HW. The commands completion events are -accumulated on the CEQ that is configured to receive the CMDQ completion events. -(hinic_hw_cmdq.c, hinic_hw_cmdq.h) - -Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting -Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) - -IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) - -HW device: -========== - -HW device - de/constructs the HW Interface, the MGMT components on the -initialization of the driver and the IO components on the case of Interface -UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) - - -hinic_dev contains the following components: -=============================================== - -PCI ID table - Contains the supported PCI Vendor/Device IDs. -(hinic_pci_tbl.h) - -Port Commands - Send commands to the HW device for port management -(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) - -Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. -The Logical Tx queue is not dependent on the format of the HW Send Queue. -(hinic_tx.c, hinic_tx.h) - -Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. -The Logical Rx queue is not dependent on the format of the HW Receive Queue. -(hinic_rx.c, hinic_rx.h) - -hinic_dev - de/constructs the Logical Tx and Rx Queues. -(hinic_main.c, hinic_dev.h) - - -Miscellaneous: -============= - -Common functions that are used by HW and Logical Device. -(hinic_common.c, hinic_common.h) - - -Support -======= - -If an issue is identified with the released source code on the supported kernel -with a supported adapter, email the specific information related to the issue to -aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b29a08d1f941..5a7889df1375 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -63,6 +63,7 @@ Contents: generic_netlink gen_stats gtp + hinic .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 4ec6d2741d36..df5e4ccc1ccb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7815,7 +7815,7 @@ HUAWEI ETHERNET DRIVER M: Aviad Krawczyk L: netdev@vger.kernel.org S: Supported -F: Documentation/networking/hinic.txt +F: Documentation/networking/hinic.rst F: drivers/net/ethernet/huawei/hinic/ HUGETLB FILESYSTEM -- cgit From 1d2698fa05f57ba2900e1ff50ac33ec85d2087d3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:44 +0200 Subject: docs: networking: convert ila.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/ila.rst | 296 +++++++++++++++++++++++++++++++++++++ Documentation/networking/ila.txt | 285 ----------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 297 insertions(+), 285 deletions(-) create mode 100644 Documentation/networking/ila.rst delete mode 100644 Documentation/networking/ila.txt diff --git a/Documentation/networking/ila.rst b/Documentation/networking/ila.rst new file mode 100644 index 000000000000..5ac0a6270b17 --- /dev/null +++ b/Documentation/networking/ila.rst @@ -0,0 +1,296 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Identifier Locator Addressing (ILA) +=================================== + + +Introduction +============ + +Identifier-locator addressing (ILA) is a technique used with IPv6 that +differentiates between location and identity of a network node. Part of an +address expresses the immutable identity of the node, and another part +indicates the location of the node which can be dynamic. Identifier-locator +addressing can be used to efficiently implement overlay networks for +network virtualization as well as solutions for use cases in mobility. + +ILA can be thought of as means to implement an overlay network without +encapsulation. This is accomplished by performing network address +translation on destination addresses as a packet traverses a network. To +the network, an ILA translated packet appears to be no different than any +other IPv6 packet. For instance, if the transport protocol is TCP then an +ILA translated packet looks like just another TCP/IPv6 packet. The +advantage of this is that ILA is transparent to the network so that +optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work. + +The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila. + + +ILA terminology +=============== + + - Identifier + A number that identifies an addressable node in the network + independent of its location. ILA identifiers are sixty-four + bit values. + + - Locator + A network prefix that routes to a physical host. Locators + provide the topological location of an addressed node. ILA + locators are sixty-four bit prefixes. + + - ILA mapping + A mapping of an ILA identifier to a locator (or to a + locator and meta data). An ILA domain maintains a database + that contains mappings for all destinations in the domain. + + - SIR address + An IPv6 address composed of a SIR prefix (upper sixty- + four bits) and an identifier (lower sixty-four bits). + SIR addresses are visible to applications and provide a + means for them to address nodes independent of their + location. + + - ILA address + An IPv6 address composed of a locator (upper sixty-four + bits) and an identifier (low order sixty-four bits). ILA + addresses are never visible to an application. + + - ILA host + An end host that is capable of performing ILA translations + on transmit or receive. + + - ILA router + A network node that performs ILA translation and forwarding + of translated packets. + + - ILA forwarding cache + A type of ILA router that only maintains a working set + cache of mappings. + + - ILA node + A network node capable of performing ILA translations. This + can be an ILA router, ILA forwarding cache, or ILA host. + + +Operation +========= + +There are two fundamental operations with ILA: + + - Translate a SIR address to an ILA address. This is performed on ingress + to an ILA overlay. + + - Translate an ILA address to a SIR address. This is performed on egress + from the ILA overlay. + +ILA can be deployed either on end hosts or intermediate devices in the +network; these are provided by "ILA hosts" and "ILA routers" respectively. +Configuration and datapath for these two points of deployment is somewhat +different. + +The diagram below illustrates the flow of packets through ILA as well +as showing ILA hosts and routers:: + + +--------+ +--------+ + | Host A +-+ +--->| Host B | + | | | (2) ILA (') | | + +--------+ | ...addressed.... ( ) +--------+ + V +---+--+ . packet . +---+--+ (_) + (1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR + addressed +->|router| . . |router|->-+ addressed + packet +---+--+ . IPv6 . +---+--+ packet + / . Network . + / . . +--+-++--------+ + +--------+ / . . |ILA || Host | + | Host +--+ . .- -|host|| | + | | . . +--+-++--------+ + +--------+ ................ + + +Transport checksum handling +=========================== + +When an address is translated by ILA, an encapsulated transport checksum +that includes the translated address in a pseudo header may be rendered +incorrect on the wire. This is a problem for intermediate devices, +including checksum offload in NICs, that process the checksum. There are +three options to deal with this: + +- no action Allow the checksum to be incorrect on the wire. Before + a receiver verifies a checksum the ILA to SIR address + translation must be done. + +- adjust transport checksum + When ILA translation is performed the packet is parsed + and if a transport layer checksum is found then it is + adjusted to reflect the correct checksum per the + translated address. + +- checksum neutral mapping + When an address is translated the difference can be offset + elsewhere in a part of the packet that is covered by + the checksum. The low order sixteen bits of the identifier + are used. This method is preferred since it doesn't require + parsing a packet beyond the IP header and in most cases the + adjustment can be precomputed and saved with the mapping. + +Note that the checksum neutral adjustment affects the low order sixteen +bits of the identifier. When ILA to SIR address translation is done on +egress the low order bits are restored to the original value which +restores the identifier as it was originally sent. + + +Identifier types +================ + +ILA defines different types of identifiers for different use cases. + +The defined types are: + + 0: interface identifier + + 1: locally unique identifier + + 2: virtual networking identifier for IPv4 address + + 3: virtual networking identifier for IPv6 unicast address + + 4: virtual networking identifier for IPv6 multicast address + + 5: non-local address identifier + +In the current implementation of kernel ILA only locally unique identifiers +(LUID) are supported. LUID allows for a generic, unformatted 64 bit +identifier. + + +Identifier formats +================== + +Kernel ILA supports two optional fields in an identifier for formatting: +"C-bit" and "identifier type". The presence of these fields is determined +by configuration as demonstrated below. + +If the identifier type is present it occupies the three highest order +bits of an identifier. The possible values are given in the above list. + +If the C-bit is present, this is used as an indication that checksum +neutral mapping has been done. The C-bit can only be set in an +ILA address, never a SIR address. + +In the simplest format the identifier types, C-bit, and checksum +adjustment value are not present so an identifier is considered an +unstructured sixty-four bit value:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Identifier | + + + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The checksum neutral adjustment may be configured to always be +present using neutral-map-auto. In this case there is no C-bit, but the +checksum adjustment is in the low order 16 bits. The identifier is +still sixty-four bits:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Identifier | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The C-bit may used to explicitly indicate that checksum neutral +mapping has been applied to an ILA address. The format is:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | |C| Identifier | + | +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The identifier type field may be present to indicate the identifier +type. If it is not present then the type is inferred based on mapping +configuration. The checksum neutral adjustment may automatically +used with the identifier type as illustrated below:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Type| Identifier | + +-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +If the identifier type and the C-bit can be present simultaneously so +the identifier format would be:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Type|C| Identifier | + +-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + +Configuration +============= + +There are two methods to configure ILA mappings. One is by using LWT routes +and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat +is intended to be used in the receive path for ILA hosts . + +An ILA router has also been implemented in XDP. Description of that is +outside the scope of this document. + +The usage of for ILA LWT routes is: + +ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR + +Destination (DEST) can either be a SIR address (for an ILA host or ingress +ILA router) or an ILA address (egress ILA router). LOC is the sixty-four +bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four +bits of the destination address. Checksum MODE is one of "no-action", +"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is +set then the C-bit will be present. Identifier TYPE one of "luid" or +"use-format." In the case of use-format, the identifier type field is +present and the effective type is taken from that. + +The usage of ila_xlat is: + +ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE + +MATCH indicates the incoming locator that must be matched to apply +a the translaiton. LOC is the locator that overwrites the upper +sixty-four bits of the destination address. MODE and TYPE have the +same meanings as described above. + + +Some examples +============= + +:: + + # Configure an ILA route that uses checksum neutral mapping as well + # as type field. Note that the type field is set in the SIR address + # (the 2000 implies type is 1 which is LUID). + ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \ + csum-mode neutral-map ident-type use-format + + # Configure an ILA LWT route that uses auto checksum neutral mapping + # (no C-bit) and configure identifier type to be LUID so that the + # identifier type field will not be present. + ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \ + csum-mode neutral-map-auto ident-type luid + + ila_xlat configuration + + # Configure an ILA to SIR mapping that matches a locator and overwrites + # it with a SIR address (3333:0:0:1 in this example). The C-bit and + # identifier field are used. + ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ + csum-mode neutral-map-auto ident-type use-format + + # Configure an ILA to SIR mapping where checksum neutral is automatically + # set without the C-bit and the identifier type is configured to be LUID + # so that the identifier type field is not present. + ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ + csum-mode neutral-map-auto ident-type use-format diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt deleted file mode 100644 index a17dac9dc915..000000000000 --- a/Documentation/networking/ila.txt +++ /dev/null @@ -1,285 +0,0 @@ -Identifier Locator Addressing (ILA) - - -Introduction -============ - -Identifier-locator addressing (ILA) is a technique used with IPv6 that -differentiates between location and identity of a network node. Part of an -address expresses the immutable identity of the node, and another part -indicates the location of the node which can be dynamic. Identifier-locator -addressing can be used to efficiently implement overlay networks for -network virtualization as well as solutions for use cases in mobility. - -ILA can be thought of as means to implement an overlay network without -encapsulation. This is accomplished by performing network address -translation on destination addresses as a packet traverses a network. To -the network, an ILA translated packet appears to be no different than any -other IPv6 packet. For instance, if the transport protocol is TCP then an -ILA translated packet looks like just another TCP/IPv6 packet. The -advantage of this is that ILA is transparent to the network so that -optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work. - -The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila. - - -ILA terminology -=============== - - - Identifier A number that identifies an addressable node in the network - independent of its location. ILA identifiers are sixty-four - bit values. - - - Locator A network prefix that routes to a physical host. Locators - provide the topological location of an addressed node. ILA - locators are sixty-four bit prefixes. - - - ILA mapping - A mapping of an ILA identifier to a locator (or to a - locator and meta data). An ILA domain maintains a database - that contains mappings for all destinations in the domain. - - - SIR address - An IPv6 address composed of a SIR prefix (upper sixty- - four bits) and an identifier (lower sixty-four bits). - SIR addresses are visible to applications and provide a - means for them to address nodes independent of their - location. - - - ILA address - An IPv6 address composed of a locator (upper sixty-four - bits) and an identifier (low order sixty-four bits). ILA - addresses are never visible to an application. - - - ILA host An end host that is capable of performing ILA translations - on transmit or receive. - - - ILA router A network node that performs ILA translation and forwarding - of translated packets. - - - ILA forwarding cache - A type of ILA router that only maintains a working set - cache of mappings. - - - ILA node A network node capable of performing ILA translations. This - can be an ILA router, ILA forwarding cache, or ILA host. - - -Operation -========= - -There are two fundamental operations with ILA: - - - Translate a SIR address to an ILA address. This is performed on ingress - to an ILA overlay. - - - Translate an ILA address to a SIR address. This is performed on egress - from the ILA overlay. - -ILA can be deployed either on end hosts or intermediate devices in the -network; these are provided by "ILA hosts" and "ILA routers" respectively. -Configuration and datapath for these two points of deployment is somewhat -different. - -The diagram below illustrates the flow of packets through ILA as well -as showing ILA hosts and routers. - - +--------+ +--------+ - | Host A +-+ +--->| Host B | - | | | (2) ILA (') | | - +--------+ | ...addressed.... ( ) +--------+ - V +---+--+ . packet . +---+--+ (_) - (1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR - addressed +->|router| . . |router|->-+ addressed - packet +---+--+ . IPv6 . +---+--+ packet - / . Network . - / . . +--+-++--------+ - +--------+ / . . |ILA || Host | - | Host +--+ . .- -|host|| | - | | . . +--+-++--------+ - +--------+ ................ - - -Transport checksum handling -=========================== - -When an address is translated by ILA, an encapsulated transport checksum -that includes the translated address in a pseudo header may be rendered -incorrect on the wire. This is a problem for intermediate devices, -including checksum offload in NICs, that process the checksum. There are -three options to deal with this: - -- no action Allow the checksum to be incorrect on the wire. Before - a receiver verifies a checksum the ILA to SIR address - translation must be done. - -- adjust transport checksum - When ILA translation is performed the packet is parsed - and if a transport layer checksum is found then it is - adjusted to reflect the correct checksum per the - translated address. - -- checksum neutral mapping - When an address is translated the difference can be offset - elsewhere in a part of the packet that is covered by - the checksum. The low order sixteen bits of the identifier - are used. This method is preferred since it doesn't require - parsing a packet beyond the IP header and in most cases the - adjustment can be precomputed and saved with the mapping. - -Note that the checksum neutral adjustment affects the low order sixteen -bits of the identifier. When ILA to SIR address translation is done on -egress the low order bits are restored to the original value which -restores the identifier as it was originally sent. - - -Identifier types -================ - -ILA defines different types of identifiers for different use cases. - -The defined types are: - - 0: interface identifier - - 1: locally unique identifier - - 2: virtual networking identifier for IPv4 address - - 3: virtual networking identifier for IPv6 unicast address - - 4: virtual networking identifier for IPv6 multicast address - - 5: non-local address identifier - -In the current implementation of kernel ILA only locally unique identifiers -(LUID) are supported. LUID allows for a generic, unformatted 64 bit -identifier. - - -Identifier formats -================== - -Kernel ILA supports two optional fields in an identifier for formatting: -"C-bit" and "identifier type". The presence of these fields is determined -by configuration as demonstrated below. - -If the identifier type is present it occupies the three highest order -bits of an identifier. The possible values are given in the above list. - -If the C-bit is present, this is used as an indication that checksum -neutral mapping has been done. The C-bit can only be set in an -ILA address, never a SIR address. - -In the simplest format the identifier types, C-bit, and checksum -adjustment value are not present so an identifier is considered an -unstructured sixty-four bit value. - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Identifier | - + + - | | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -The checksum neutral adjustment may be configured to always be -present using neutral-map-auto. In this case there is no C-bit, but the -checksum adjustment is in the low order 16 bits. The identifier is -still sixty-four bits. - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Identifier | - | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -The C-bit may used to explicitly indicate that checksum neutral -mapping has been applied to an ILA address. The format is: - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | |C| Identifier | - | +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -The identifier type field may be present to indicate the identifier -type. If it is not present then the type is inferred based on mapping -configuration. The checksum neutral adjustment may automatically -used with the identifier type as illustrated below. - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Type| Identifier | - +-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -If the identifier type and the C-bit can be present simultaneously so -the identifier format would be: - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Type|C| Identifier | - +-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - - -Configuration -============= - -There are two methods to configure ILA mappings. One is by using LWT routes -and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat -is intended to be used in the receive path for ILA hosts . - -An ILA router has also been implemented in XDP. Description of that is -outside the scope of this document. - -The usage of for ILA LWT routes is: - -ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR - -Destination (DEST) can either be a SIR address (for an ILA host or ingress -ILA router) or an ILA address (egress ILA router). LOC is the sixty-four -bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four -bits of the destination address. Checksum MODE is one of "no-action", -"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is -set then the C-bit will be present. Identifier TYPE one of "luid" or -"use-format." In the case of use-format, the identifier type field is -present and the effective type is taken from that. - -The usage of ila_xlat is: - -ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE - -MATCH indicates the incoming locator that must be matched to apply -a the translaiton. LOC is the locator that overwrites the upper -sixty-four bits of the destination address. MODE and TYPE have the -same meanings as described above. - - -Some examples -============= - -# Configure an ILA route that uses checksum neutral mapping as well -# as type field. Note that the type field is set in the SIR address -# (the 2000 implies type is 1 which is LUID). -ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \ - csum-mode neutral-map ident-type use-format - -# Configure an ILA LWT route that uses auto checksum neutral mapping -# (no C-bit) and configure identifier type to be LUID so that the -# identifier type field will not be present. -ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \ - csum-mode neutral-map-auto ident-type luid - -ila_xlat configuration - -# Configure an ILA to SIR mapping that matches a locator and overwrites -# it with a SIR address (3333:0:0:1 in this example). The C-bit and -# identifier field are used. -ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ - csum-mode neutral-map-auto ident-type use-format - -# Configure an ILA to SIR mapping where checksum neutral is automatically -# set without the C-bit and the identifier type is configured to be LUID -# so that the identifier type field is not present. -ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ - csum-mode neutral-map-auto ident-type use-format diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5a7889df1375..488971f6b650 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -64,6 +64,7 @@ Contents: gen_stats gtp hinic + ila .. only:: subproject and html -- cgit From 7cdb25400f7e8624414260d1b0fa70da280b2303 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:45 +0200 Subject: docs: networking: convert ipddp.txt to ReST Not much to be done here: - add SPDX header; - use a document title from existing text; - adjust a chapter markup; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ipddp.rst | 78 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/ipddp.txt | 73 ----------------------------------- Documentation/networking/ltpc.txt | 2 +- drivers/net/appletalk/Kconfig | 4 +- 5 files changed, 82 insertions(+), 76 deletions(-) create mode 100644 Documentation/networking/ipddp.rst delete mode 100644 Documentation/networking/ipddp.txt diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 488971f6b650..cf85d0a73144 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -65,6 +65,7 @@ Contents: gtp hinic ila + ipddp .. only:: subproject and html diff --git a/Documentation/networking/ipddp.rst b/Documentation/networking/ipddp.rst new file mode 100644 index 000000000000..be7091b77927 --- /dev/null +++ b/Documentation/networking/ipddp.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================================= +AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation +========================================================= + +Documentation ipddp.c + +This file is written by Jay Schulist + +Introduction +------------ + +AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk +networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams +inside AppleTalk packets. + +Through this driver you can either allow your Linux box to communicate +IP over an AppleTalk network or you can provide IP gatewaying functions +for your AppleTalk users. + +You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk, +EtherTalk and PPPTalk. The only limit on the protocol is that of what +kernel AppleTalk layer and drivers are available. + +Each mode requires its own user space software. + +Compiling AppleTalk-IP Decapsulation/Encapsulation +================================================== + +AppleTalk-IP decapsulation needs to be compiled into your kernel. You +will need to turn on AppleTalk-IP driver support. Then you will need to +select ONE of the two options; IP to AppleTalk-IP encapsulation support or +AppleTalk-IP to IP decapsulation support. If you compile the driver +statically you will only be able to use the driver for the function you have +enabled in the kernel. If you compile the driver as a module you can +select what mode you want it to run in via a module loading param. +ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for +AppleTalk-IP to IP decapsulation. + +Basic instructions for user space tools +======================================= + +I will briefly describe the operation of the tools, but you will +need to consult the supporting documentation for each set of tools. + +Decapsulation - You will need to download a software package called +MacGate. In this distribution there will be a tool called MacRoute +which enables you to add routes to the kernel for your Macs by hand. +Also the tool MacRegGateWay is included to register the +proper IP Gateway and IP addresses for your machine. Included in this +distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from +ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional +but it allows automatic adding and deleting of routes for Macs. (Handy +for locations with large Mac installations) + +Encapsulation - You will need to download a software daemon called ipddpd. +This software expects there to be an AppleTalk-IP gateway on the network. +You will also need to add the proper routes to route your Linux box's IP +traffic out the ipddp interface. + +Common Uses of ipddp.c +---------------------- +Of course AppleTalk-IP decapsulation and encapsulation, but specifically +decapsulation is being used most for connecting LocalTalk networks to +IP networks. Although it has been used on EtherTalk networks to allow +Macs that are only able to tunnel IP over EtherTalk. + +Encapsulation has been used to allow a Linux box stuck on a LocalTalk +network to use IP. It should work equally well if you are stuck on an +EtherTalk only network. + +Further Assistance +------------------- +You can contact me (Jay Schulist ) with any +questions regarding decapsulation or encapsulation. Bradford W. Johnson + originally wrote the ipddp.c driver for IP +encapsulation in AppleTalk. diff --git a/Documentation/networking/ipddp.txt b/Documentation/networking/ipddp.txt deleted file mode 100644 index ba5c217fffe0..000000000000 --- a/Documentation/networking/ipddp.txt +++ /dev/null @@ -1,73 +0,0 @@ -Text file for ipddp.c: - AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation - -This text file is written by Jay Schulist - -Introduction ------------- - -AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk -networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams -inside AppleTalk packets. - -Through this driver you can either allow your Linux box to communicate -IP over an AppleTalk network or you can provide IP gatewaying functions -for your AppleTalk users. - -You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk, -EtherTalk and PPPTalk. The only limit on the protocol is that of what -kernel AppleTalk layer and drivers are available. - -Each mode requires its own user space software. - -Compiling AppleTalk-IP Decapsulation/Encapsulation -================================================= - -AppleTalk-IP decapsulation needs to be compiled into your kernel. You -will need to turn on AppleTalk-IP driver support. Then you will need to -select ONE of the two options; IP to AppleTalk-IP encapsulation support or -AppleTalk-IP to IP decapsulation support. If you compile the driver -statically you will only be able to use the driver for the function you have -enabled in the kernel. If you compile the driver as a module you can -select what mode you want it to run in via a module loading param. -ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for -AppleTalk-IP to IP decapsulation. - -Basic instructions for user space tools -======================================= - -I will briefly describe the operation of the tools, but you will -need to consult the supporting documentation for each set of tools. - -Decapsulation - You will need to download a software package called -MacGate. In this distribution there will be a tool called MacRoute -which enables you to add routes to the kernel for your Macs by hand. -Also the tool MacRegGateWay is included to register the -proper IP Gateway and IP addresses for your machine. Included in this -distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from -ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional -but it allows automatic adding and deleting of routes for Macs. (Handy -for locations with large Mac installations) - -Encapsulation - You will need to download a software daemon called ipddpd. -This software expects there to be an AppleTalk-IP gateway on the network. -You will also need to add the proper routes to route your Linux box's IP -traffic out the ipddp interface. - -Common Uses of ipddp.c ----------------------- -Of course AppleTalk-IP decapsulation and encapsulation, but specifically -decapsulation is being used most for connecting LocalTalk networks to -IP networks. Although it has been used on EtherTalk networks to allow -Macs that are only able to tunnel IP over EtherTalk. - -Encapsulation has been used to allow a Linux box stuck on a LocalTalk -network to use IP. It should work equally well if you are stuck on an -EtherTalk only network. - -Further Assistance -------------------- -You can contact me (Jay Schulist ) with any -questions regarding decapsulation or encapsulation. Bradford W. Johnson - originally wrote the ipddp.c driver for IP -encapsulation in AppleTalk. diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt index 0bf3220c715b..a005a73b76d0 100644 --- a/Documentation/networking/ltpc.txt +++ b/Documentation/networking/ltpc.txt @@ -99,7 +99,7 @@ treat the LocalTalk device like an ordinary Ethernet device, even if that's what it looks like to Netatalk. Instead, you follow the same procedure as for doing IP in EtherTalk. -See Documentation/networking/ipddp.txt for more information about the +See Documentation/networking/ipddp.rst for more information about the kernel driver and userspace tools needed. -------------------------------------- diff --git a/drivers/net/appletalk/Kconfig b/drivers/net/appletalk/Kconfig index d4e51c048f62..ccde6479050c 100644 --- a/drivers/net/appletalk/Kconfig +++ b/drivers/net/appletalk/Kconfig @@ -86,7 +86,7 @@ config IPDDP box is stuck on an AppleTalk only network) or decapsulate (e.g. if you want your Linux box to act as an Internet gateway for a zoo of AppleTalk connected Macs). Please see the file - for more information. + for more information. If you say Y here, the AppleTalk-IP support will be compiled into the kernel. In this case, you can either use encapsulation or @@ -107,4 +107,4 @@ config IPDDP_ENCAP IP packets inside AppleTalk frames; this is useful if your Linux box is stuck on an AppleTalk network (which hopefully contains a decapsulator somewhere). Please see - for more information. + for more information. -- cgit From 9de1fcdf36e7e00693a260865a5f2a58af1c7040 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:46 +0200 Subject: docs: networking: convert ip_dynaddr.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ip_dynaddr.rst | 40 +++++++++++++++++++++++++++++++++ Documentation/networking/ip_dynaddr.txt | 29 ------------------------ 3 files changed, 41 insertions(+), 29 deletions(-) create mode 100644 Documentation/networking/ip_dynaddr.rst delete mode 100644 Documentation/networking/ip_dynaddr.txt diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index cf85d0a73144..f81aeb87aa28 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -66,6 +66,7 @@ Contents: hinic ila ipddp + ip_dynaddr .. only:: subproject and html diff --git a/Documentation/networking/ip_dynaddr.rst b/Documentation/networking/ip_dynaddr.rst new file mode 100644 index 000000000000..eacc0c780c7f --- /dev/null +++ b/Documentation/networking/ip_dynaddr.rst @@ -0,0 +1,40 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +IP dynamic address hack-port v0.03 +================================== + +This stuff allows diald ONESHOT connections to get established by +dynamically changing packet source address (and socket's if local procs). +It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2). + +If enabled\ [#]_ and forwarding interface has changed: + + 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS + while in SYN_SENT state (diald-box processes). + 2) Out-bounded MASQueraded source address changes ON OUTPUT (when + internal host does retransmission) until a packet from outside is + received by the tunnel. + +This is specially helpful for auto dialup links (diald), where the +``actual`` outgoing address is unknown at the moment the link is +going up. So, the *same* (local AND masqueraded) connections requests that +bring the link up will be able to get established. + +.. [#] At boot, by default no address rewriting is attempted. + + To enable:: + + # echo 1 > /proc/sys/net/ipv4/ip_dynaddr + + To enable verbose mode:: + + # echo 2 > /proc/sys/net/ipv4/ip_dynaddr + + To disable (default):: + + # echo 0 > /proc/sys/net/ipv4/ip_dynaddr + +Enjoy! + +Juanjo diff --git a/Documentation/networking/ip_dynaddr.txt b/Documentation/networking/ip_dynaddr.txt deleted file mode 100644 index 45f3c1268e86..000000000000 --- a/Documentation/networking/ip_dynaddr.txt +++ /dev/null @@ -1,29 +0,0 @@ -IP dynamic address hack-port v0.03 -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This stuff allows diald ONESHOT connections to get established by -dynamically changing packet source address (and socket's if local procs). -It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2). - -If enabled[*] and forwarding interface has changed: - 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS - while in SYN_SENT state (diald-box processes). - 2) Out-bounded MASQueraded source address changes ON OUTPUT (when - internal host does retransmission) until a packet from outside is - received by the tunnel. - -This is specially helpful for auto dialup links (diald), where the -``actual'' outgoing address is unknown at the moment the link is -going up. So, the *same* (local AND masqueraded) connections requests that -bring the link up will be able to get established. - -[*] At boot, by default no address rewriting is attempted. - To enable: - # echo 1 > /proc/sys/net/ipv4/ip_dynaddr - To enable verbose mode: - # echo 2 > /proc/sys/net/ipv4/ip_dynaddr - To disable (default) - # echo 0 > /proc/sys/net/ipv4/ip_dynaddr - -Enjoy! - --- Juanjo -- cgit From aac86c887ed66ac4f467821ebf75373124a148d7 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:47 +0200 Subject: docs: networking: convert iphase.txt to ReST - add SPDX header; - adjust title using the proper markup; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/iphase.rst | 193 ++++++++++++++++++++++++++++++++++++ Documentation/networking/iphase.txt | 158 ----------------------------- drivers/atm/Kconfig | 2 +- 4 files changed, 195 insertions(+), 159 deletions(-) create mode 100644 Documentation/networking/iphase.rst delete mode 100644 Documentation/networking/iphase.txt diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index f81aeb87aa28..505eaa41ca2b 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -67,6 +67,7 @@ Contents: ila ipddp ip_dynaddr + iphase .. only:: subproject and html diff --git a/Documentation/networking/iphase.rst b/Documentation/networking/iphase.rst new file mode 100644 index 000000000000..92d9b757d75a --- /dev/null +++ b/Documentation/networking/iphase.rst @@ -0,0 +1,193 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +ATM (i)Chip IA Linux Driver Source +================================== + + READ ME FISRT + +-------------------------------------------------------------------------------- + + Read This Before You Begin! + +-------------------------------------------------------------------------------- + +Description +=========== + +This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver +source release. + +The features and limitations of this driver are as follows: + + - A single VPI (VPI value of 0) is supported. + - Supports 4K VCs for the server board (with 512K control memory) and 1K + VCs for the client board (with 128K control memory). + - UBR, ABR and CBR service categories are supported. + - Only AAL5 is supported. + - Supports setting of PCR on the VCs. + - Multiple adapters in a system are supported. + - All variants of Interphase ATM PCI (i)Chip adapter cards are supported, + including x575 (OC3, control memory 128K , 512K and packet memory 128K, + 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See + http://www.iphase.com/ + for details. + - Only x86 platforms are supported. + - SMP is supported. + + +Before You Start +================ + + +Installation +------------ + +1. Installing the adapters in the system + + To install the ATM adapters in the system, follow the steps below. + + a. Login as root. + b. Shut down the system and power off the system. + c. Install one or more ATM adapters in the system. + d. Connect each adapter to a port on an ATM switch. The green 'Link' + LED on the front panel of the adapter will be on if the adapter is + connected to the switch properly when the system is powered up. + e. Power on and boot the system. + +2. [ Removed ] + +3. Rebuild kernel with ABR support + + [ a. and b. removed ] + + c. Reconfigure the kernel, choose the Interphase ia driver through "make + menuconfig" or "make xconfig". + d. Rebuild the kernel, loadable modules and the atm tools. + e. Install the new built kernel and modules and reboot. + +4. Load the adapter hardware driver (ia driver) if it is built as a module + + a. Login as root. + b. Change directory to /lib/modules//atm. + c. Run "insmod suni.o;insmod iphase.o" + The yellow 'status' LED on the front panel of the adapter will blink + while the driver is loaded in the system. + d. To verify that the 'ia' driver is loaded successfully, run the + following command:: + + cat /proc/atm/devices + + If the driver is loaded successfully, the output of the command will + be similar to the following lines:: + + Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ... + 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 ) + + You can also check the system log file /var/log/messages for messages + related to the ATM driver. + +5. Ia Driver Configuration + +5.1 Configuration of adapter buffers + The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and + 1M. The RAM size decides the number of buffers and buffer size. The default + size and number of buffers are set as following: + + ========= ======= ====== ====== ====== ====== ====== + Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf + RAM size size size size size cnt cnt + ========= ======= ====== ====== ====== ====== ====== + 128K 64K 64K 10K 10K 6 6 + 512K 256K 256K 10K 10K 25 25 + 1M 512K 512K 10K 10K 51 51 + ========= ======= ====== ====== ====== ====== ====== + + These setting should work well in most environments, but can be + changed by typing the following command:: + + insmod /ia.o IA_RX_BUF= IA_RX_BUF_SZ= \ + IA_TX_BUF= IA_TX_BUF_SZ= + + Where: + + - RX_CNT = number of receive buffers in the range (1-128) + - RX_SIZE = size of receive buffers in the range (48-64K) + - TX_CNT = number of transmit buffers in the range (1-128) + - TX_SIZE = size of transmit buffers in the range (48-64K) + + 1. Transmit and receive buffer size must be a multiple of 4. + 2. Care should be taken so that the memory required for the + transmit and receive buffers is less than or equal to the + total adapter packet memory. + +5.2 Turn on ia debug trace + + When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver + can provide more debug trace if needed. There is a bit mask variable, + IADebugFlag, which controls the output of the traces. You can find the bit + map of the IADebugFlag in iphase.h. + The debug trace can be turn on through the insmod command line option, for + example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug + traces together with loading the driver. + +6. Ia Driver Test Using ttcp_atm and PVC + + For the PVC setup, the test machines can either be connected back-to-back or + through a switch. If connected through the switch, the switch must be + configured for the PVC(s). + + a. For UBR test: + + At the test machine intended to receive data, type:: + + ttcp_atm -r -a -s 0.100 + + At the other test machine, type:: + + ttcp_atm -t -a -s 0.100 -n 10000 + + Run "ttcp_atm -h" to display more options of the ttcp_atm tool. + b. For ABR test: + + It is the same as the UBR testing, but with an extra command option:: + + -Pabr:max_pcr= + + where: + + xxx = the maximum peak cell rate, from 170 - 353207. + + This option must be set on both the machines. + + c. For CBR test: + + It is the same as the UBR testing, but with an extra command option:: + + -Pcbr:max_pcr= + + where: + + xxx = the maximum peak cell rate, from 170 - 353207. + + This option may only be set on the transmit machine. + + +Outstanding Issues +================== + + + +Contact Information +------------------- + +:: + + Customer Support: + United States: Telephone: (214) 654-5555 + Fax: (214) 654-5500 + E-Mail: intouch@iphase.com + Europe: Telephone: 33 (0)1 41 15 44 00 + Fax: 33 (0)1 41 15 12 13 + World Wide Web: http://www.iphase.com + Anonymous FTP: ftp.iphase.com diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/iphase.txt deleted file mode 100644 index 670b72f16585..000000000000 --- a/Documentation/networking/iphase.txt +++ /dev/null @@ -1,158 +0,0 @@ - - READ ME FISRT - ATM (i)Chip IA Linux Driver Source --------------------------------------------------------------------------------- - Read This Before You Begin! --------------------------------------------------------------------------------- - -Description ------------ - -This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver -source release. - -The features and limitations of this driver are as follows: - - A single VPI (VPI value of 0) is supported. - - Supports 4K VCs for the server board (with 512K control memory) and 1K - VCs for the client board (with 128K control memory). - - UBR, ABR and CBR service categories are supported. - - Only AAL5 is supported. - - Supports setting of PCR on the VCs. - - Multiple adapters in a system are supported. - - All variants of Interphase ATM PCI (i)Chip adapter cards are supported, - including x575 (OC3, control memory 128K , 512K and packet memory 128K, - 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See - http://www.iphase.com/ - for details. - - Only x86 platforms are supported. - - SMP is supported. - - -Before You Start ----------------- - - -Installation ------------- - -1. Installing the adapters in the system - To install the ATM adapters in the system, follow the steps below. - a. Login as root. - b. Shut down the system and power off the system. - c. Install one or more ATM adapters in the system. - d. Connect each adapter to a port on an ATM switch. The green 'Link' - LED on the front panel of the adapter will be on if the adapter is - connected to the switch properly when the system is powered up. - e. Power on and boot the system. - -2. [ Removed ] - -3. Rebuild kernel with ABR support - [ a. and b. removed ] - c. Reconfigure the kernel, choose the Interphase ia driver through "make - menuconfig" or "make xconfig". - d. Rebuild the kernel, loadable modules and the atm tools. - e. Install the new built kernel and modules and reboot. - -4. Load the adapter hardware driver (ia driver) if it is built as a module - a. Login as root. - b. Change directory to /lib/modules//atm. - c. Run "insmod suni.o;insmod iphase.o" - The yellow 'status' LED on the front panel of the adapter will blink - while the driver is loaded in the system. - d. To verify that the 'ia' driver is loaded successfully, run the - following command: - - cat /proc/atm/devices - - If the driver is loaded successfully, the output of the command will - be similar to the following lines: - - Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ... - 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 ) - - You can also check the system log file /var/log/messages for messages - related to the ATM driver. - -5. Ia Driver Configuration - -5.1 Configuration of adapter buffers - The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and - 1M. The RAM size decides the number of buffers and buffer size. The default - size and number of buffers are set as following: - - Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf - RAM size size size size size cnt cnt - -------- ------ ------ ------ ------ ------ ------ - 128K 64K 64K 10K 10K 6 6 - 512K 256K 256K 10K 10K 25 25 - 1M 512K 512K 10K 10K 51 51 - - These setting should work well in most environments, but can be - changed by typing the following command: - - insmod /ia.o IA_RX_BUF= IA_RX_BUF_SZ= \ - IA_TX_BUF= IA_TX_BUF_SZ= - Where: - RX_CNT = number of receive buffers in the range (1-128) - RX_SIZE = size of receive buffers in the range (48-64K) - TX_CNT = number of transmit buffers in the range (1-128) - TX_SIZE = size of transmit buffers in the range (48-64K) - - 1. Transmit and receive buffer size must be a multiple of 4. - 2. Care should be taken so that the memory required for the - transmit and receive buffers is less than or equal to the - total adapter packet memory. - -5.2 Turn on ia debug trace - - When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver - can provide more debug trace if needed. There is a bit mask variable, - IADebugFlag, which controls the output of the traces. You can find the bit - map of the IADebugFlag in iphase.h. - The debug trace can be turn on through the insmod command line option, for - example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug - traces together with loading the driver. - -6. Ia Driver Test Using ttcp_atm and PVC - - For the PVC setup, the test machines can either be connected back-to-back or - through a switch. If connected through the switch, the switch must be - configured for the PVC(s). - - a. For UBR test: - At the test machine intended to receive data, type: - ttcp_atm -r -a -s 0.100 - At the other test machine, type: - ttcp_atm -t -a -s 0.100 -n 10000 - Run "ttcp_atm -h" to display more options of the ttcp_atm tool. - b. For ABR test: - It is the same as the UBR testing, but with an extra command option: - -Pabr:max_pcr= - where: - xxx = the maximum peak cell rate, from 170 - 353207. - This option must be set on both the machines. - c. For CBR test: - It is the same as the UBR testing, but with an extra command option: - -Pcbr:max_pcr= - where: - xxx = the maximum peak cell rate, from 170 - 353207. - This option may only be set on the transmit machine. - - -OUTSTANDING ISSUES ------------------- - - - -Contact Information -------------------- - - Customer Support: - United States: Telephone: (214) 654-5555 - Fax: (214) 654-5500 - E-Mail: intouch@iphase.com - Europe: Telephone: 33 (0)1 41 15 44 00 - Fax: 33 (0)1 41 15 12 13 - World Wide Web: http://www.iphase.com - Anonymous FTP: ftp.iphase.com diff --git a/drivers/atm/Kconfig b/drivers/atm/Kconfig index 4af7cbdcc349..cfb0d16b60ad 100644 --- a/drivers/atm/Kconfig +++ b/drivers/atm/Kconfig @@ -306,7 +306,7 @@ config ATM_IA for more info about the cards. Say Y (or M to compile as a module named iphase) here if you have one of these cards. - See the file for further + See the file for further details. config ATM_IA_DEBUG -- cgit From 355e656e017c3b42deb57d125d86c4cbd277d6db Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:48 +0200 Subject: docs: networking: convert ipsec.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ipsec.rst | 46 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/ipsec.txt | 38 ------------------------------- 3 files changed, 47 insertions(+), 38 deletions(-) create mode 100644 Documentation/networking/ipsec.rst delete mode 100644 Documentation/networking/ipsec.txt diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 505eaa41ca2b..3efb4608649a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -68,6 +68,7 @@ Contents: ipddp ip_dynaddr iphase + ipsec .. only:: subproject and html diff --git a/Documentation/networking/ipsec.rst b/Documentation/networking/ipsec.rst new file mode 100644 index 000000000000..afe9d7b48be3 --- /dev/null +++ b/Documentation/networking/ipsec.rst @@ -0,0 +1,46 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +IPsec +===== + + +Here documents known IPsec corner cases which need to be keep in mind when +deploy various IPsec configuration in real world production environment. + +1. IPcomp: + Small IP packet won't get compressed at sender, and failed on + policy check on receiver. + +Quote from RFC3173:: + + 2.2. Non-Expansion Policy + + If the total size of a compressed payload and the IPComp header, as + defined in section 3, is not smaller than the size of the original + payload, the IP datagram MUST be sent in the original non-compressed + form. To clarify: If an IP datagram is sent non-compressed, no + + IPComp header is added to the datagram. This policy ensures saving + the decompression processing cycles and avoiding incurring IP + datagram fragmentation when the expanded datagram is larger than the + MTU. + + Small IP datagrams are likely to expand as a result of compression. + Therefore, a numeric threshold should be applied before compression, + where IP datagrams of size smaller than the threshold are sent in the + original form without attempting compression. The numeric threshold + is implementation dependent. + +Current IPComp implementation is indeed by the book, while as in practice +when sending non-compressed packet to the peer (whether or not packet len +is smaller than the threshold or the compressed len is larger than original +packet len), the packet is dropped when checking the policy as this packet +matches the selector but not coming from any XFRM layer, i.e., with no +security path. Such naked packet will not eventually make it to upper layer. +The result is much more wired to the user when ping peer with different +payload length. + +One workaround is try to set "level use" for each policy if user observed +above scenario. The consequence of doing so is small packet(uncompressed) +will skip policy checking on receiver side. diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt deleted file mode 100644 index ba794b7e51be..000000000000 --- a/Documentation/networking/ipsec.txt +++ /dev/null @@ -1,38 +0,0 @@ - -Here documents known IPsec corner cases which need to be keep in mind when -deploy various IPsec configuration in real world production environment. - -1. IPcomp: Small IP packet won't get compressed at sender, and failed on - policy check on receiver. - -Quote from RFC3173: -2.2. Non-Expansion Policy - - If the total size of a compressed payload and the IPComp header, as - defined in section 3, is not smaller than the size of the original - payload, the IP datagram MUST be sent in the original non-compressed - form. To clarify: If an IP datagram is sent non-compressed, no - - IPComp header is added to the datagram. This policy ensures saving - the decompression processing cycles and avoiding incurring IP - datagram fragmentation when the expanded datagram is larger than the - MTU. - - Small IP datagrams are likely to expand as a result of compression. - Therefore, a numeric threshold should be applied before compression, - where IP datagrams of size smaller than the threshold are sent in the - original form without attempting compression. The numeric threshold - is implementation dependent. - -Current IPComp implementation is indeed by the book, while as in practice -when sending non-compressed packet to the peer (whether or not packet len -is smaller than the threshold or the compressed len is larger than original -packet len), the packet is dropped when checking the policy as this packet -matches the selector but not coming from any XFRM layer, i.e., with no -security path. Such naked packet will not eventually make it to upper layer. -The result is much more wired to the user when ping peer with different -payload length. - -One workaround is try to set "level use" for each policy if user observed -above scenario. The consequence of doing so is small packet(uncompressed) -will skip policy checking on receiver side. -- cgit From 1cec2cacaaec5d53adc04dd3ecfdb687b26c0e89 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:49 +0200 Subject: docs: networking: convert ip-sysctl.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark lists as such; - mark tables as such; - use footnote markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/sysctl/net.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/ip-sysctl.rst | 2649 +++++++++++++++++++++++ Documentation/networking/ip-sysctl.txt | 2374 -------------------- Documentation/networking/snmp_counter.rst | 2 +- net/Kconfig | 2 +- net/ipv4/Kconfig | 2 +- net/ipv4/icmp.c | 2 +- 9 files changed, 2656 insertions(+), 2380 deletions(-) create mode 100644 Documentation/networking/ip-sysctl.rst delete mode 100644 Documentation/networking/ip-sysctl.txt diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b23ab11587a6..e37db6f1be64 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4910,7 +4910,7 @@ Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total ram pages. This is used to specify the TCP metrics - cache size. See Documentation/networking/ip-sysctl.txt + cache size. See Documentation/networking/ip-sysctl.rst "tcp_no_metrics_save" section for more details. tdfx= [HW,DRM] diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index e043c9213388..84e3348a9543 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -353,7 +353,7 @@ socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings ------------------------------------- -Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for +Please see: Documentation/networking/ip-sysctl.rst and ipvs-sysctl.txt for descriptions of these entries. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3efb4608649a..7d133d8dbe2a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -69,6 +69,7 @@ Contents: ip_dynaddr iphase ipsec + ip-sysctl .. only:: subproject and html diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst new file mode 100644 index 000000000000..38f811d4b2f0 --- /dev/null +++ b/Documentation/networking/ip-sysctl.rst @@ -0,0 +1,2649 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========= +IP Sysctl +========= + +/proc/sys/net/ipv4/* Variables +============================== + +ip_forward - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + Forward Packets between interfaces. + + This variable is special, its change resets all configuration + parameters to their default state (RFC1122 for hosts, RFC1812 + for routers) + +ip_default_ttl - INTEGER + Default value of TTL field (Time To Live) for outgoing (but not + forwarded) IP packets. Should be between 1 and 255 inclusive. + Default: 64 (as recommended by RFC1700) + +ip_no_pmtu_disc - INTEGER + Disable Path MTU Discovery. If enabled in mode 1 and a + fragmentation-required ICMP is received, the PMTU to this + destination will be set to min_pmtu (see below). You will need + to raise min_pmtu to the smallest interface MTU on your system + manually if you want to avoid locally generated fragments. + + In mode 2 incoming Path MTU Discovery messages will be + discarded. Outgoing frames are handled the same as in mode 1, + implicitly setting IP_PMTUDISC_DONT on every created socket. + + Mode 3 is a hardened pmtu discover mode. The kernel will only + accept fragmentation-needed errors if the underlying protocol + can verify them besides a plain socket lookup. Current + protocols for which pmtu events will be honored are TCP, SCTP + and DCCP as they verify e.g. the sequence number or the + association. This mode should not be enabled globally but is + only intended to secure e.g. name servers in namespaces where + TCP path mtu must still work but path MTU information of other + protocols should be discarded. If enabled globally this mode + could break other protocols. + + Possible values: 0-3 + + Default: FALSE + +min_pmtu - INTEGER + default 552 - minimum discovered Path MTU + +ip_forward_use_pmtu - BOOLEAN + By default we don't trust protocol path MTUs while forwarding + because they could be easily forged and can lead to unwanted + fragmentation by the router. + You only need to enable this if you have user-space software + which tries to discover path mtus by itself and depends on the + kernel honoring this information. This is normally not the + case. + + Default: 0 (disabled) + + Possible values: + + - 0 - disabled + - 1 - enabled + +fwmark_reflect - BOOLEAN + Controls the fwmark of kernel-generated IPv4 reply packets that are not + associated with a socket for example, TCP RSTs or ICMP echo replies). + If unset, these packets have a fwmark of zero. If set, they have the + fwmark of the packet they are replying to. + + Default: 0 + +fib_multipath_use_neigh - BOOLEAN + Use status of existing neighbor entry when determining nexthop for + multipath routes. If disabled, neighbor information is not used and + packets could be directed to a failed nexthop. Only valid for kernels + built with CONFIG_IP_ROUTE_MULTIPATH enabled. + + Default: 0 (disabled) + + Possible values: + + - 0 - disabled + - 1 - enabled + +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. Only valid + for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. + + Default: 0 (Layer 3) + + Possible values: + + - 0 - Layer 3 + - 1 - Layer 4 + - 2 - Layer 3 or inner Layer 3 if present + +fib_sync_mem - UNSIGNED INTEGER + Amount of dirty memory from fib entries that can be backlogged before + synchronize_rcu is forced. + + Default: 512kB Minimum: 64kB Maximum: 64MB + +ip_forward_update_priority - INTEGER + Whether to update SKB priority from "TOS" field in IPv4 header after it + is forwarded. The new SKB priority is mapped from TOS field value + according to an rt_tos2priority table (see e.g. man tc-prio). + + Default: 1 (Update priority.) + + Possible values: + + - 0 - Do not update priority. + - 1 - Update priority. + +route/max_size - INTEGER + Maximum number of routes allowed in the kernel. Increase + this when using large numbers of interfaces and/or routes. + + From linux kernel 3.6 onwards, this is deprecated for ipv4 + as route cache is no longer used. + +neigh/default/gc_thresh1 - INTEGER + Minimum number of entries to keep. Garbage collector will not + purge entries if there are fewer than this number. + + Default: 128 + +neigh/default/gc_thresh2 - INTEGER + Threshold when garbage collector becomes more aggressive about + purging entries. Entries older than 5 seconds will be cleared + when over this number. + + Default: 512 + +neigh/default/gc_thresh3 - INTEGER + Maximum number of non-PERMANENT neighbor entries allowed. Increase + this when using large numbers of interfaces and when communicating + with large numbers of directly-connected peers. + + Default: 1024 + +neigh/default/unres_qlen_bytes - INTEGER + The maximum number of bytes which may be used by packets + queued for each unresolved address by other network layers. + (added in linux 3.3) + + Setting negative value is meaningless and will return error. + + Default: SK_WMEM_MAX, (same as net.core.wmem_default). + + Exact value depends on architecture and kernel options, + but should be enough to allow queuing 256 packets + of medium size. + +neigh/default/unres_qlen - INTEGER + The maximum number of packets which may be queued for each + unresolved address by other network layers. + + (deprecated in linux 3.3) : use unres_qlen_bytes instead. + + Prior to linux 3.3, the default value is 3 which may cause + unexpected packet loss. The current default value is calculated + according to default value of unres_qlen_bytes and true size of + packet. + + Default: 101 + +mtu_expires - INTEGER + Time, in seconds, that cached PMTU information is kept. + +min_adv_mss - INTEGER + The advertised MSS depends on the first hop route MTU, but will + never be lower than this setting. + +IP Fragmentation: + +ipfrag_high_thresh - LONG INTEGER + Maximum memory used to reassemble IP fragments. + +ipfrag_low_thresh - LONG INTEGER + (Obsolete since linux-4.17) + Maximum memory used to reassemble IP fragments before the kernel + begins to remove incomplete fragment queues to free up resources. + The kernel still accepts new fragments for defragmentation. + +ipfrag_time - INTEGER + Time in seconds to keep an IP fragment in memory. + +ipfrag_max_dist - INTEGER + ipfrag_max_dist is a non-negative integer value which defines the + maximum "disorder" which is allowed among fragments which share a + common IP source address. Note that reordering of packets is + not unusual, but if a large number of fragments arrive from a source + IP address while a particular fragment queue remains incomplete, it + probably indicates that one or more fragments belonging to that queue + have been lost. When ipfrag_max_dist is positive, an additional check + is done on fragments before they are added to a reassembly queue - if + ipfrag_max_dist (or more) fragments have arrived from a particular IP + address between additions to any IP fragment queue using that source + address, it's presumed that one or more fragments in the queue are + lost. The existing fragment queue will be dropped, and a new one + started. An ipfrag_max_dist value of zero disables this check. + + Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can + result in unnecessarily dropping fragment queues when normal + reordering of packets occurs, which could lead to poor application + performance. Using a very large value, e.g. 50000, increases the + likelihood of incorrectly reassembling IP fragments that originate + from different IP datagrams, which could result in data corruption. + Default: 64 + +INET peer storage +================= + +inet_peer_threshold - INTEGER + The approximate size of the storage. Starting from this threshold + entries will be thrown aggressively. This threshold also determines + entries' time-to-live and time intervals between garbage collection + passes. More entries, less time-to-live, less GC interval. + +inet_peer_minttl - INTEGER + Minimum time-to-live of entries. Should be enough to cover fragment + time-to-live on the reassembling side. This minimum time-to-live is + guaranteed if the pool size is less than inet_peer_threshold. + Measured in seconds. + +inet_peer_maxttl - INTEGER + Maximum time-to-live of entries. Unused entries will expire after + this period of time if there is no memory pressure on the pool (i.e. + when the number of entries in the pool is very small). + Measured in seconds. + +TCP variables +============= + +somaxconn - INTEGER + Limit of socket listen() backlog, known in userspace as SOMAXCONN. + Defaults to 4096. (Was 128 before linux-5.4) + See also tcp_max_syn_backlog for additional tuning for TCP sockets. + +tcp_abort_on_overflow - BOOLEAN + If listening service is too slow to accept new connections, + reset them. Default state is FALSE. It means that if overflow + occurred due to a burst, connection will recover. Enable this + option _only_ if you are really sure that listening daemon + cannot be tuned to accept connections faster. Enabling this + option can harm clients of your server. + +tcp_adv_win_scale - INTEGER + Count buffering overhead as bytes/2^tcp_adv_win_scale + (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), + if it is <= 0. + + Possible values are [-31, 31], inclusive. + + Default: 1 + +tcp_allowed_congestion_control - STRING + Show/set the congestion control choices available to non-privileged + processes. The list is a subset of those listed in + tcp_available_congestion_control. + + Default is "reno" and the default setting (tcp_congestion_control). + +tcp_app_win - INTEGER + Reserve max(window/2^tcp_app_win, mss) of window for application + buffer. Value 0 is special, it means that nothing is reserved. + + Default: 31 + +tcp_autocorking - BOOLEAN + Enable TCP auto corking : + When applications do consecutive small write()/sendmsg() system calls, + we try to coalesce these small writes as much as possible, to lower + total amount of sent packets. This is done if at least one prior + packet for the flow is waiting in Qdisc queues or device transmit + queue. Applications can still use TCP_CORK for optimal behavior + when they know how/when to uncork their sockets. + + Default : 1 + +tcp_available_congestion_control - STRING + Shows the available congestion control choices that are registered. + More congestion control algorithms may be available as modules, + but not loaded. + +tcp_base_mss - INTEGER + The initial value of search_low to be used by the packetization layer + Path MTU discovery (MTU probing). If MTU probing is enabled, + this is the initial MSS used by the connection. + +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + +tcp_min_snd_mss - INTEGER + TCP SYN and SYNACK messages usually advertise an ADVMSS option, + as described in RFC 1122 and RFC 6691. + + If this ADVMSS option is smaller than tcp_min_snd_mss, + it is silently capped to tcp_min_snd_mss. + + Default : 48 (at least 8 bytes of payload per segment) + +tcp_congestion_control - STRING + Set the congestion control algorithm to be used for new + connections. The algorithm "reno" is always available, but + additional choices may be available based on kernel configuration. + Default is set as part of kernel configuration. + For passive connections, the listener congestion control choice + is inherited. + + [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] + +tcp_dsack - BOOLEAN + Allows TCP to send "duplicate" SACKs. + +tcp_early_retrans - INTEGER + Tail loss probe (TLP) converts RTOs occurring due to tail + losses into fast recovery (draft-ietf-tcpm-rack). Note that + TLP requires RACK to function properly (see tcp_recovery below) + + Possible values: + + - 0 disables TLP + - 3 or 4 enables TLP + + Default: 3 + +tcp_ecn - INTEGER + Control use of Explicit Congestion Notification (ECN) by TCP. + ECN is used only when both ends of the TCP connection indicate + support for it. This feature is useful in avoiding losses due + to congestion by allowing supporting routers to signal + congestion before having to drop packets. + + Possible values are: + + = ===================================================== + 0 Disable ECN. Neither initiate nor accept ECN. + 1 Enable ECN when requested by incoming connections and + also request ECN on outgoing connection attempts. + 2 Enable ECN when requested by incoming connections + but do not request ECN on outgoing connections. + = ===================================================== + + Default: 2 + +tcp_ecn_fallback - BOOLEAN + If the kernel detects that ECN connection misbehaves, enable fall + back to non-ECN. Currently, this knob implements the fallback + from RFC3168, section 6.1.1.1., but we reserve that in future, + additional detection mechanisms could be implemented under this + knob. The value is not used, if tcp_ecn or per route (or congestion + control) ECN settings are disabled. + + Default: 1 (fallback enabled) + +tcp_fack - BOOLEAN + This is a legacy option, it has no effect anymore. + +tcp_fin_timeout - INTEGER + The length of time an orphaned (no longer referenced by any + application) connection will remain in the FIN_WAIT_2 state + before it is aborted at the local end. While a perfectly + valid "receive only" state for an un-orphaned connection, an + orphaned connection in FIN_WAIT_2 state could otherwise wait + forever for the remote to close its end of the connection. + + Cf. tcp_max_orphans + + Default: 60 seconds + +tcp_frto - INTEGER + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. + F-RTO is an enhanced recovery algorithm for TCP retransmission + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. + +tcp_fwmark_accept - BOOLEAN + If set, incoming connections to listening sockets that do not have a + socket mark will set the mark of the accepting socket to the fwmark of + the incoming SYN packet. This will cause all packets on that connection + (starting from the first SYNACK) to be sent with that fwmark. The + listening socket's mark is unchanged. Listening sockets that already + have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are + unaffected. + + Default: 0 + +tcp_invalid_ratelimit - INTEGER + Limit the maximal rate for sending duplicate acknowledgments + in response to incoming TCP packets that are for an existing + connection but that are invalid due to any of these reasons: + + (a) out-of-window sequence number, + (b) out-of-window acknowledgment number, or + (c) PAWS (Protection Against Wrapped Sequence numbers) check failure + + This can help mitigate simple "ack loop" DoS attacks, wherein + a buggy or malicious middlebox or man-in-the-middle can + rewrite TCP header fields in manner that causes each endpoint + to think that the other is sending invalid TCP segments, thus + causing each side to send an unterminating stream of duplicate + acknowledgments for invalid segments. + + Using 0 disables rate-limiting of dupacks in response to + invalid segments; otherwise this value specifies the minimal + space between sending such dupacks, in milliseconds. + + Default: 500 (milliseconds). + +tcp_keepalive_time - INTEGER + How often TCP sends out keepalive messages when keepalive is enabled. + Default: 2hours. + +tcp_keepalive_probes - INTEGER + How many keepalive probes TCP sends out, until it decides that the + connection is broken. Default value: 9. + +tcp_keepalive_intvl - INTEGER + How frequently the probes are send out. Multiplied by + tcp_keepalive_probes it is time to kill not responding connection, + after probes started. Default value: 75sec i.e. connection + will be aborted after ~11 minutes of retries. + +tcp_l3mdev_accept - BOOLEAN + Enables child sockets to inherit the L3 master device index. + Enabling this option allows a "global" listen socket to work + across L3 master domains (e.g., VRFs) with connected sockets + derived from the listen socket to be bound to the L3 domain in + which the packets originated. Only valid when the kernel was + compiled with CONFIG_NET_L3_MASTER_DEV. + + Default: 0 (disabled) + +tcp_low_latency - BOOLEAN + This is a legacy option, it has no effect anymore. + +tcp_max_orphans - INTEGER + Maximal number of TCP sockets not attached to any user file handle, + held by system. If this number is exceeded orphaned connections are + reset immediately and warning is printed. This limit exists + only to prevent simple DoS attacks, you _must_ not rely on this + or lower the limit artificially, but rather increase it + (probably, after increasing installed memory), + if network conditions require more than default value, + and tune network services to linger and kill such states + more aggressively. Let me to remind again: each orphan eats + up to ~64K of unswappable memory. + +tcp_max_syn_backlog - INTEGER + Maximal number of remembered connection requests (SYN_RECV), + which have not received an acknowledgment from connecting client. + + This is a per-listener limit. + + The minimal value is 128 for low memory machines, and it will + increase in proportion to the memory of machine. + + If server suffers from overload, try increasing this number. + + Remember to also check /proc/sys/net/core/somaxconn + A SYN_RECV request socket consumes about 304 bytes of memory. + +tcp_max_tw_buckets - INTEGER + Maximal number of timewait sockets held by system simultaneously. + If this number is exceeded time-wait socket is immediately destroyed + and warning is printed. This limit exists only to prevent + simple DoS attacks, you _must_ not lower the limit artificially, + but rather increase it (probably, after increasing installed memory), + if network conditions require more than default value. + +tcp_mem - vector of 3 INTEGERs: min, pressure, max + min: below this number of pages TCP is not bothered about its + memory appetite. + + pressure: when amount of memory allocated by TCP exceeds this number + of pages, TCP moderates its memory consumption and enters memory + pressure mode, which is exited when memory consumption falls + under "min". + + max: number of pages allowed for queueing by all TCP sockets. + + Defaults are calculated at boot time from amount of available + memory. + +tcp_min_rtt_wlen - INTEGER + The window length of the windowed min filter to track the minimum RTT. + A shorter window lets a flow more quickly pick up new (higher) + minimum RTT when it is moved to a longer path (e.g., due to traffic + engineering). A longer window makes the filter more resistant to RTT + inflations such as transient congestion. The unit is seconds. + + Possible values: 0 - 86400 (1 day) + + Default: 300 + +tcp_moderate_rcvbuf - BOOLEAN + If set, TCP performs receive buffer auto-tuning, attempting to + automatically size the buffer (no greater than tcp_rmem[2]) to + match the size required by the path for full throughput. Enabled by + default. + +tcp_mtu_probing - INTEGER + Controls TCP Packetization-Layer Path MTU Discovery. Takes three + values: + + - 0 - Disabled + - 1 - Disabled by default, enabled when an ICMP black hole detected + - 2 - Always enabled, use initial MSS of tcp_base_mss. + +tcp_probe_interval - UNSIGNED INTEGER + Controls how often to start TCP Packetization-Layer Path MTU + Discovery reprobe. The default is reprobing every 10 minutes as + per RFC4821. + +tcp_probe_threshold - INTEGER + Controls when TCP Packetization-Layer Path MTU Discovery probing + will stop in respect to the width of search range in bytes. Default + is 8 bytes. + +tcp_no_metrics_save - BOOLEAN + By default, TCP saves various connection metrics in the route cache + when the connection closes, so that connections established in the + near future can use these to set initial conditions. Usually, this + increases overall performance, but may sometimes cause performance + degradation. If set, TCP will not cache metrics on closing + connections. + +tcp_no_ssthresh_metrics_save - BOOLEAN + Controls whether TCP saves ssthresh metrics in the route cache. + + Default is 1, which disables ssthresh metrics. + +tcp_orphan_retries - INTEGER + This value influences the timeout of a locally closed TCP connection, + when RTO retransmissions remain unacknowledged. + See tcp_retries2 for more details. + + The default value is 8. + + If your machine is a loaded WEB server, + you should think about lowering this value, such sockets + may consume significant resources. Cf. tcp_max_orphans. + +tcp_recovery - INTEGER + This value is a bitmap to enable various experimental loss recovery + features. + + ========= ============================================================= + RACK: 0x1 enables the RACK loss detection for fast detection of lost + retransmissions and tail drops. It also subsumes and disables + RFC6675 recovery for SACK connections. + + RACK: 0x2 makes RACK's reordering window static (min_rtt/4). + + RACK: 0x4 disables RACK's DUPACK threshold heuristic + ========= ============================================================= + + Default: 0x1 + +tcp_reordering - INTEGER + Initial reordering level of packets in a TCP stream. + TCP stack can then dynamically adjust flow reordering level + between this initial value and tcp_max_reordering + + Default: 3 + +tcp_max_reordering - INTEGER + Maximal reordering level of packets in a TCP stream. + 300 is a fairly conservative value, but you might increase it + if paths are using per packet load balancing (like bonding rr mode) + + Default: 300 + +tcp_retrans_collapse - BOOLEAN + Bug-to-bug compatibility with some broken printers. + On retransmit try to send bigger packets to work around bugs in + certain TCP stacks. + +tcp_retries1 - INTEGER + This value influences the time, after which TCP decides, that + something is wrong due to unacknowledged RTO retransmissions, + and reports this suspicion to the network layer. + See tcp_retries2 for more details. + + RFC 1122 recommends at least 3 retransmissions, which is the + default. + +tcp_retries2 - INTEGER + This value influences the timeout of an alive TCP connection, + when RTO retransmissions remain unacknowledged. + Given a value of N, a hypothetical TCP connection following + exponential backoff with an initial RTO of TCP_RTO_MIN would + retransmit N times before killing the connection at the (N+1)th RTO. + + The default value of 15 yields a hypothetical timeout of 924.6 + seconds and is a lower bound for the effective timeout. + TCP will effectively time out at the first RTO which exceeds the + hypothetical timeout. + + RFC 1122 recommends at least 100 seconds for the timeout, + which corresponds to a value of at least 8. + +tcp_rfc1337 - BOOLEAN + If set, the TCP stack behaves conforming to RFC1337. If unset, + we are not conforming to RFC, but prevent TCP TIME_WAIT + assassination. + + Default: 0 + +tcp_rmem - vector of 3 INTEGERs: min, default, max + min: Minimal size of receive buffer used by TCP sockets. + It is guaranteed to each TCP socket, even under moderate memory + pressure. + + Default: 4K + + default: initial size of receive buffer used by TCP sockets. + This value overrides net.core.rmem_default used by other protocols. + Default: 87380 bytes. This value results in window of 65535 with + default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit + less for default tcp_app_win. See below about these variables. + + max: maximal size of receive buffer allowed for automatically + selected receiver buffers for TCP socket. This value does not override + net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables + automatic tuning of that socket's receive buffer size, in which + case this value is ignored. + Default: between 87380B and 6MB, depending on RAM size. + +tcp_sack - BOOLEAN + Enable select acknowledgments (SACKS). + +tcp_comp_sack_delay_ns - LONG INTEGER + TCP tries to reduce number of SACK sent, using a timer + based on 5% of SRTT, capped by this sysctl, in nano seconds. + The default is 1ms, based on TSO autosizing period. + + Default : 1,000,000 ns (1 ms) + +tcp_comp_sack_nr - INTEGER + Max number of SACK that can be compressed. + Using 0 disables SACK compression. + + Default : 44 + +tcp_slow_start_after_idle - BOOLEAN + If set, provide RFC2861 behavior and time out the congestion + window after an idle period. An idle period is defined at + the current RTO. If unset, the congestion window will not + be timed out after an idle period. + + Default: 1 + +tcp_stdurg - BOOLEAN + Use the Host requirements interpretation of the TCP urgent pointer field. + Most hosts use the older BSD interpretation, so if you turn this on + Linux might not communicate correctly with them. + + Default: FALSE + +tcp_synack_retries - INTEGER + Number of times SYNACKs for a passive TCP connection attempt will + be retransmitted. Should not be higher than 255. Default value + is 5, which corresponds to 31seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for a passive TCP connection will happen after 63seconds. + +tcp_syncookies - INTEGER + Only valid when the kernel was compiled with CONFIG_SYN_COOKIES + Send out syncookies when the syn backlog queue of a socket + overflows. This is to prevent against the common 'SYN flood attack' + Default: 1 + + Note, that syncookies is fallback facility. + It MUST NOT be used to help highly loaded servers to stand + against legal connection rate. If you see SYN flood warnings + in your logs, but investigation shows that they occur + because of overload with legal connections, you should tune + another parameters until this warning disappear. + See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. + + syncookies seriously violate TCP protocol, do not allow + to use TCP extensions, can result in serious degradation + of some services (f.e. SMTP relaying), visible not by you, + but your clients and relays, contacting you. While you see + SYN flood warnings in logs not being really flooded, your server + is seriously misconfigured. + + If you want to test which effects syncookies have to your + network connections you can set this knob to 2 to enable + unconditionally generation of syncookies. + +tcp_fastopen - INTEGER + Enable TCP Fast Open (RFC7413) to send and accept data in the opening + SYN packet. + + The client support is enabled by flag 0x1 (on by default). The client + then must use sendmsg() or sendto() with the MSG_FASTOPEN flag, + rather than connect() to send data in SYN. + + The server support is enabled by flag 0x2 (off by default). Then + either enable for all listeners with another flag (0x400) or + enable individual listeners via TCP_FASTOPEN socket option with + the option value being the length of the syn-data backlog. + + The values (bitmap) are + + ===== ======== ====================================================== + 0x1 (client) enables sending data in the opening SYN on the client. + 0x2 (server) enables the server support, i.e., allowing data in + a SYN packet to be accepted and passed to the + application before 3-way handshake finishes. + 0x4 (client) send data in the opening SYN regardless of cookie + availability and without a cookie option. + 0x200 (server) accept data-in-SYN w/o any cookie option present. + 0x400 (server) enable all listeners to support Fast Open by + default without explicit TCP_FASTOPEN socket option. + ===== ======== ====================================================== + + Default: 0x1 + + Note that that additional client or server features are only + effective if the basic support (0x1 and 0x2) are enabled respectively. + +tcp_fastopen_blackhole_timeout_sec - INTEGER + Initial time period in second to disable Fastopen on active TCP sockets + when a TFO firewall blackhole issue happens. + This time period will grow exponentially when more blackhole issues + get detected right after Fastopen is re-enabled and will reset to + initial value when the blackhole issue goes away. + 0 to disable the blackhole detection. + + By default, it is set to 1hr. + +tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs + The list consists of a primary key and an optional backup key. The + primary key is used for both creating and validating cookies, while the + optional backup key is only used for validating cookies. The purpose of + the backup key is to maximize TFO validation when keys are rotated. + + A randomly chosen primary key may be configured by the kernel if + the tcp_fastopen sysctl is set to 0x400 (see above), or if the + TCP_FASTOPEN setsockopt() optname is set and a key has not been + previously configured via sysctl. If keys are configured via + setsockopt() by using the TCP_FASTOPEN_KEY optname, then those + per-socket keys will be used instead of any keys that are specified via + sysctl. + + A key is specified as 4 8-digit hexadecimal integers which are separated + by a '-' as: xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx. Leading zeros may be + omitted. A primary and a backup key may be specified by separating them + by a comma. If only one key is specified, it becomes the primary key and + any previously configured backup keys are removed. + +tcp_syn_retries - INTEGER + Number of times initial SYNs for an active TCP connection attempt + will be retransmitted. Should not be higher than 127. Default value + is 6, which corresponds to 63seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for an active TCP connection attempt will happen after 127seconds. + +tcp_timestamps - INTEGER + Enable timestamps as defined in RFC1323. + + - 0: Disabled. + - 1: Enable timestamps as defined in RFC1323 and use random offset for + each connection rather than only using the current time. + - 2: Like 1, but without random offsets. + + Default: 1 + +tcp_min_tso_segs - INTEGER + Minimal number of segments per TSO frame. + + Since linux-3.12, TCP does an automatic sizing of TSO frames, + depending on flow rate, instead of filling 64Kbytes packets. + For specific usages, it's possible to force TCP to build big + TSO frames. Note that TCP stack might split too big TSO packets + if available window is too small. + + Default: 2 + +tcp_pacing_ss_ratio - INTEGER + sk->sk_pacing_rate is set by TCP stack using a ratio applied + to current rate. (current_rate = cwnd * mss / srtt) + If TCP is in slow start, tcp_pacing_ss_ratio is applied + to let TCP probe for bigger speeds, assuming cwnd can be + doubled every other RTT. + + Default: 200 + +tcp_pacing_ca_ratio - INTEGER + sk->sk_pacing_rate is set by TCP stack using a ratio applied + to current rate. (current_rate = cwnd * mss / srtt) + If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio + is applied to conservatively probe for bigger throughput. + + Default: 120 + +tcp_tso_win_divisor - INTEGER + This allows control over what percentage of the congestion window + can be consumed by a single TSO frame. + The setting of this parameter is a choice between burstiness and + building larger TSO frames. + + Default: 3 + +tcp_tw_reuse - INTEGER + Enable reuse of TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. + + - 0 - disable + - 1 - global enable + - 2 - enable for loopback traffic only + + It should not be changed without advice/request of technical + experts. + + Default: 2 + +tcp_window_scaling - BOOLEAN + Enable window scaling as defined in RFC1323. + +tcp_wmem - vector of 3 INTEGERs: min, default, max + min: Amount of memory reserved for send buffers for TCP sockets. + Each TCP socket has rights to use it due to fact of its birth. + + Default: 4K + + default: initial size of send buffer used by TCP sockets. This + value overrides net.core.wmem_default used by other protocols. + + It is usually lower than net.core.wmem_default. + + Default: 16K + + max: Maximal amount of memory allowed for automatically tuned + send buffers for TCP sockets. This value does not override + net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables + automatic tuning of that socket's send buffer size, in which case + this value is ignored. + + Default: between 64K and 4MB, depending on RAM size. + +tcp_notsent_lowat - UNSIGNED INTEGER + A TCP socket can control the amount of unsent bytes in its write queue, + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() + reports POLLOUT events if the amount of unsent bytes is below a per + socket value, and if the write queue is not full. sendmsg() will + also not add new buffers if the limit is hit. + + This global variable controls the amount of unsent data for + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change + to the global variable has immediate effect. + + Default: UINT_MAX (0xFFFFFFFF) + +tcp_workaround_signed_windows - BOOLEAN + If set, assume no receipt of a window scaling option means the + remote TCP is broken and treats the window as a signed quantity. + If unset, assume the remote TCP is not broken even if we do + not receive a window scaling option from them. + + Default: 0 + +tcp_thin_linear_timeouts - BOOLEAN + Enable dynamic triggering of linear timeouts for thin streams. + If set, a check is performed upon retransmission by timeout to + determine if the stream is thin (less than 4 packets in flight). + As long as the stream is found to be thin, up to 6 linear + timeouts may be performed before exponential backoff mode is + initiated. This improves retransmission latency for + non-aggressive thin streams, often found to be time-dependent. + For more information on thin streams, see + Documentation/networking/tcp-thin.txt + + Default: 0 + +tcp_limit_output_bytes - INTEGER + Controls TCP Small Queue limit per tcp socket. + TCP bulk sender tends to increase packets in flight until it + gets losses notifications. With SNDBUF autotuning, this can + result in a large amount of packets queued on the local machine + (e.g.: qdiscs, CPU backlog, or device) hurting latency of other + flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes + limits the number of bytes on qdisc or device to reduce artificial + RTT/cwnd and reduce bufferbloat. + + Default: 1048576 (16 * 65536) + +tcp_challenge_ack_limit - INTEGER + Limits number of Challenge ACK sent per second, as recommended + in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) + Default: 1000 + +tcp_rx_skb_cache - BOOLEAN + Controls a per TCP socket cache of one skb, that might help + performance of some workloads. This might be dangerous + on systems with a lot of TCP sockets, since it increases + memory usage. + + Default: 0 (disabled) + +UDP variables +============= + +udp_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + + Default: 0 (disabled) + +udp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all UDP sockets. + + min: Below this number of pages UDP is not bothered about its + memory appetite. When amount of memory allocated by UDP exceeds + this number, UDP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all UDP sockets. + + Default is calculated at boot time from amount of available memory. + +udp_rmem_min - INTEGER + Minimal size of receive buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for receiving data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + + Default: 4K + +udp_wmem_min - INTEGER + Minimal size of send buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for sending data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + + Default: 4K + +RAW variables +============= + +raw_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + + Default: 1 (enabled) + +CIPSOv4 Variables +================= + +cipso_cache_enable - BOOLEAN + If set, enable additions to and lookups from the CIPSO label mapping + cache. If unset, additions are ignored and lookups always result in a + miss. However, regardless of the setting the cache is still + invalidated when required when means you can safely toggle this on and + off and the cache will always be "safe". + + Default: 1 + +cipso_cache_bucket_size - INTEGER + The CIPSO label cache consists of a fixed size hash table with each + hash bucket containing a number of cache entries. This variable limits + the number of entries in each hash bucket; the larger the value the + more CIPSO label mappings that can be cached. When the number of + entries in a given hash bucket reaches this limit adding new entries + causes the oldest entry in the bucket to be removed to make room. + + Default: 10 + +cipso_rbm_optfmt - BOOLEAN + Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of + the CIPSO draft specification (see Documentation/netlabel for details). + This means that when set the CIPSO tag will be padded with empty + categories in order to make the packet data 32-bit aligned. + + Default: 0 + +cipso_rbm_structvalid - BOOLEAN + If set, do a very strict check of the CIPSO option when + ip_options_compile() is called. If unset, relax the checks done during + ip_options_compile(). Either way is "safe" as errors are caught else + where in the CIPSO processing code but setting this to 0 (False) should + result in less work (i.e. it should be faster) but could cause problems + with other implementations that require strict checking. + + Default: 0 + +IP Variables +============ + +ip_local_port_range - 2 INTEGERS + Defines the local port range that is used by TCP and UDP to + choose the local port. The first number is the first, the + second the last local port number. + If possible, it is better these numbers have different parity + (one even and one odd value). + Must be greater than or equal to ip_unprivileged_port_start. + The default values are 32768 and 60999 respectively. + +ip_local_reserved_ports - list of comma separated ranges + Specify the ports which are reserved for known third-party + applications. These ports will not be used by automatic port + assignments (e.g. when calling connect() or bind() with port + number 0). Explicit port allocation behavior is unchanged. + + The format used for both input and output is a comma separated + list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and + 10). Writing to the file will clear all previously reserved + ports and update the current list with the one given in the + input. + + Note that ip_local_port_range and ip_local_reserved_ports + settings are independent and both are considered by the kernel + when determining which ports are available for automatic port + assignments. + + You can reserve ports which are not in the current + ip_local_port_range, e.g.:: + + $ cat /proc/sys/net/ipv4/ip_local_port_range + 32000 60999 + $ cat /proc/sys/net/ipv4/ip_local_reserved_ports + 8080,9148 + + although this is redundant. However such a setting is useful + if later the port range is changed to a value that will + include the reserved ports. + + Default: Empty + +ip_unprivileged_port_start - INTEGER + This is a per-namespace sysctl. It defines the first + unprivileged port in the network namespace. Privileged ports + require root or CAP_NET_BIND_SERVICE in order to bind to them. + To disable all privileged ports, set this to 0. They must not + overlap with the ip_local_port_range. + + Default: 1024 + +ip_nonlocal_bind - BOOLEAN + If set, allows processes to bind() to non-local IP addresses, + which can be quite useful - but may break some applications. + + Default: 0 + +ip_autobind_reuse - BOOLEAN + By default, bind() does not select the ports automatically even if + the new socket and all sockets bound to the port have SO_REUSEADDR. + ip_autobind_reuse allows bind() to reuse the port and this is useful + when you use bind()+connect(), but may break some applications. + The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this + option should only be set by experts. + Default: 0 + +ip_dynaddr - BOOLEAN + If set non-zero, enables support for dynamic addresses. + If set to a non-zero value larger than 1, a kernel log + message will be printed when dynamic address rewriting + occurs. + + Default: 0 + +ip_early_demux - BOOLEAN + Optimize input packet processing down to one demux for + certain kinds of local sockets. Currently we only do this + for established TCP and connected UDP sockets. + + It may add an additional cost for pure routing workloads that + reduces overall throughput, in such case you should disable it. + + Default: 1 + +ping_group_range - 2 INTEGERS + Restrict ICMP_PROTO datagram sockets to users in the group range. + The default is "1 0", meaning, that nobody (not even root) may + create ping sockets. Setting it to "100 100" would grant permissions + to the single group. "0 4294967295" would enable it for the world, "100 + 4294967295" would enable it for the users, but not daemons. + +tcp_early_demux - BOOLEAN + Enable early demux for established TCP sockets. + + Default: 1 + +udp_early_demux - BOOLEAN + Enable early demux for connected UDP sockets. Disable this if + your system could experience more unconnected load. + + Default: 1 + +icmp_echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it. + + Default: 0 + +icmp_echo_ignore_broadcasts - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO and + TIMESTAMP requests sent to it via broadcast/multicast. + + Default: 1 + +icmp_ratelimit - INTEGER + Limit the maximal rates for sending ICMP packets whose type matches + icmp_ratemask (see below) to specific targets. + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + Note that another sysctl, icmp_msgs_per_sec limits the number + of ICMP packets sent on all targets. + + Default: 1000 + +icmp_msgs_per_sec - INTEGER + Limit maximal number of ICMP packets sent per second from this host. + Only messages whose type matches icmp_ratemask (see below) are + controlled by this limit. + + Default: 1000 + +icmp_msgs_burst - INTEGER + icmp_msgs_per_sec controls number of ICMP packets sent per second, + while icmp_msgs_burst controls the burst size of these packets. + + Default: 50 + +icmp_ratemask - INTEGER + Mask made of ICMP types for which rates are being limited. + + Significant bits: IHGFEDCBA9876543210 + + Default mask: 0000001100000011000 (6168) + + Bit definitions (see include/linux/icmp.h): + + = ========================= + 0 Echo Reply + 3 Destination Unreachable [1]_ + 4 Source Quench [1]_ + 5 Redirect + 8 Echo Request + B Time Exceeded [1]_ + C Parameter Problem [1]_ + D Timestamp Request + E Timestamp Reply + F Info Request + G Info Reply + H Address Mask Request + I Address Mask Reply + = ========================= + + .. [1] These are rate limited by default (see default mask above) + +icmp_ignore_bogus_error_responses - BOOLEAN + Some routers violate RFC1122 by sending bogus responses to broadcast + frames. Such violations are normally logged via a kernel warning. + If this is set to TRUE, the kernel will not give such warnings, which + will avoid log file clutter. + + Default: 1 + +icmp_errors_use_inbound_ifaddr - BOOLEAN + + If zero, icmp error messages are sent with the primary address of + the exiting interface. + + If non-zero, the message will be sent with the primary address of + the interface that received the packet that caused the icmp error. + This is the behaviour network many administrators will expect from + a router. And it can make debugging complicated network layouts + much easier. + + Note that if no primary address exists for the interface selected, + then the primary address of the first non-loopback interface that + has one will be used regardless of this setting. + + Default: 0 + +igmp_max_memberships - INTEGER + Change the maximum number of multicast groups we can subscribe to. + Default: 20 + + Theoretical maximum value is bounded by having to send a membership + report in a single datagram (i.e. the report can't span multiple + datagrams, or risk confusing the switch and leaving groups you don't + intend to). + + The number of supported groups 'M' is bounded by the number of group + report entries you can fit into a single datagram of 65535 bytes. + + M = 65536-sizeof (ip header)/(sizeof(Group record)) + + Group records are variable length, with a minimum of 12 bytes. + So net.ipv4.igmp_max_memberships should not be set higher than: + + (65536-24) / 12 = 5459 + + The value 5459 assumes no IP header options, so in practice + this number may be lower. + +igmp_max_msf - INTEGER + Maximum number of addresses allowed in the source filter list for a + multicast group. + + Default: 10 + +igmp_qrv - INTEGER + Controls the IGMP query robustness variable (see RFC2236 8.1). + + Default: 2 (as specified by RFC2236 8.1) + + Minimum: 1 (as specified by RFC6636 4.5) + +force_igmp_version - INTEGER + - 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback + allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier + Present timer expires. + - 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if + receive IGMPv2/v3 query. + - 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive + IGMPv1 query message. Will reply report if receive IGMPv3 query. + - 3 - Enforce to use IGMP version 3. The same react with default 0. + + .. note:: + + this is not the same with force_mld_version because IGMPv3 RFC3376 + Security Considerations does not have clear description that we could + ignore other version messages completely as MLDv2 RFC3810. So make + this value as default 0 is recommended. + +``conf/interface/*`` + changes special settings per interface (where + interface" is the name of your network interface) + +``conf/all/*`` + is special, changes the settings for all interfaces + +log_martians - BOOLEAN + Log packets with impossible addresses to kernel log. + log_martians for the interface will be enabled if at least one of + conf/{all,interface}/log_martians is set to TRUE, + it will be disabled otherwise + +accept_redirects - BOOLEAN + Accept ICMP redirect messages. + accept_redirects for the interface will be enabled if: + + - both conf/{all,interface}/accept_redirects are TRUE in the case + forwarding for the interface is enabled + + or + + - at least one of conf/{all,interface}/accept_redirects is TRUE in the + case forwarding for the interface is disabled + + accept_redirects for the interface will be disabled otherwise + + default: + + - TRUE (host) + - FALSE (router) + +forwarding - BOOLEAN + Enable IP forwarding on this interface. This controls whether packets + received _on_ this interface can be forwarded. + +mc_forwarding - BOOLEAN + Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE + and a multicast routing daemon is required. + conf/all/mc_forwarding must also be set to TRUE to enable multicast + routing for the interface + +medium_id - INTEGER + Integer value used to differentiate the devices by the medium they + are attached to. Two devices can have different id values when + the broadcast packets are received only on one of them. + The default value 0 means that the device is the only interface + to its medium, value of -1 means that medium is not known. + + Currently, it is used to change the proxy_arp behavior: + the proxy_arp feature is enabled for packets forwarded between + two devices attached to different media. + +proxy_arp - BOOLEAN + Do proxy arp. + + proxy_arp for the interface will be enabled if at least one of + conf/{all,interface}/proxy_arp is set to TRUE, + it will be disabled otherwise + +proxy_arp_pvlan - BOOLEAN + Private VLAN proxy arp. + + Basically allow proxy arp replies back to the same interface + (from which the ARP request/solicitation was received). + + This is done to support (ethernet) switch features, like RFC + 3069, where the individual ports are NOT allowed to + communicate with each other, but they are allowed to talk to + the upstream router. As described in RFC 3069, it is possible + to allow these hosts to communicate through the upstream + router by proxy_arp'ing. Don't need to be used together with + proxy_arp. + + This technology is known by different names: + + In RFC 3069 it is called VLAN Aggregation. + Cisco and Allied Telesyn call it Private VLAN. + Hewlett-Packard call it Source-Port filtering or port-isolation. + Ericsson call it MAC-Forced Forwarding (RFC Draft). + +shared_media - BOOLEAN + Send(router) or accept(host) RFC1620 shared media redirects. + Overrides secure_redirects. + + shared_media for the interface will be enabled if at least one of + conf/{all,interface}/shared_media is set to TRUE, + it will be disabled otherwise + + default TRUE + +secure_redirects - BOOLEAN + Accept ICMP redirect messages only to gateways listed in the + interface's current gateway list. Even if disabled, RFC1122 redirect + rules still apply. + + Overridden by shared_media. + + secure_redirects for the interface will be enabled if at least one of + conf/{all,interface}/secure_redirects is set to TRUE, + it will be disabled otherwise + + default TRUE + +send_redirects - BOOLEAN + Send redirects, if router. + + send_redirects for the interface will be enabled if at least one of + conf/{all,interface}/send_redirects is set to TRUE, + it will be disabled otherwise + + Default: TRUE + +bootp_relay - BOOLEAN + Accept packets with source address 0.b.c.d destined + not to this host as local ones. It is supposed, that + BOOTP relay daemon will catch and forward such packets. + conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay + for the interface + + default FALSE + + Not Implemented Yet. + +accept_source_route - BOOLEAN + Accept packets with SRR option. + conf/all/accept_source_route must also be set to TRUE to accept packets + with SRR option on the interface + + default + + - TRUE (router) + - FALSE (host) + +accept_local - BOOLEAN + Accept packets with local source addresses. In combination with + suitable routing, this can be used to direct packets between two + local interfaces over the wire and have them accepted properly. + default FALSE + +route_localnet - BOOLEAN + Do not consider loopback addresses as martian source or destination + while routing. This enables the use of 127/8 for local routing purposes. + + default FALSE + +rp_filter - INTEGER + - 0 - No source validation. + - 1 - Strict mode as defined in RFC3704 Strict Reverse Path + Each incoming packet is tested against the FIB and if the interface + is not the best reverse path the packet check will fail. + By default failed packets are discarded. + - 2 - Loose mode as defined in RFC3704 Loose Reverse Path + Each incoming packet's source address is also tested against the FIB + and if the source address is not reachable via any interface + the packet check will fail. + + Current recommended practice in RFC3704 is to enable strict mode + to prevent IP spoofing from DDos attacks. If using asymmetric routing + or other complicated routing, then loose mode is recommended. + + The max value from conf/{all,interface}/rp_filter is used + when doing source validation on the {interface}. + + Default value is 0. Note that some distributions enable it + in startup scripts. + +arp_filter - BOOLEAN + - 1 - Allows you to have multiple network interfaces on the same + subnet, and have the ARPs for each interface be answered + based on whether or not the kernel would route a packet from + the ARP'd IP out that interface (therefore you must use source + based routing for this to work). In other words it allows control + of which cards (usually 1) will respond to an arp request. + + - 0 - (default) The kernel can respond to arp requests with addresses + from other interfaces. This may seem wrong but it usually makes + sense, because it increases the chance of successful communication. + IP addresses are owned by the complete host on Linux, not by + particular interfaces. Only for more complex setups like load- + balancing, does this behaviour cause problems. + + arp_filter for the interface will be enabled if at least one of + conf/{all,interface}/arp_filter is set to TRUE, + it will be disabled otherwise + +arp_announce - INTEGER + Define different restriction levels for announcing the local + source IP address from IP packets in ARP requests sent on + interface: + + - 0 - (default) Use any local address, configured on any interface + - 1 - Try to avoid local addresses that are not in the target's + subnet for this interface. This mode is useful when target + hosts reachable via this interface require the source IP + address in ARP requests to be part of their logical network + configured on the receiving interface. When we generate the + request we will check all our subnets that include the + target IP and will preserve the source address if it is from + such subnet. If there is no such subnet we select source + address according to the rules for level 2. + - 2 - Always use the best local address for this target. + In this mode we ignore the source address in the IP packet + and try to select local address that we prefer for talks with + the target host. Such local address is selected by looking + for primary IP addresses on all our subnets on the outgoing + interface that include the target IP address. If no suitable + local address is found we select the first local address + we have on the outgoing interface or on all other interfaces, + with the hope we will receive reply for our request and + even sometimes no matter the source IP address we announce. + + The max value from conf/{all,interface}/arp_announce is used. + + Increasing the restriction level gives more chance for + receiving answer from the resolved target while decreasing + the level announces more valid sender's information. + +arp_ignore - INTEGER + Define different modes for sending replies in response to + received ARP requests that resolve local target IP addresses: + + - 0 - (default): reply for any local target IP address, configured + on any interface + - 1 - reply only if the target IP address is local address + configured on the incoming interface + - 2 - reply only if the target IP address is local address + configured on the incoming interface and both with the + sender's IP address are part from same subnet on this interface + - 3 - do not reply for local addresses configured with scope host, + only resolutions for global and link addresses are replied + - 4-7 - reserved + - 8 - do not reply for all local addresses + + The max value from conf/{all,interface}/arp_ignore is used + when ARP request is received on the {interface} + +arp_notify - BOOLEAN + Define mode for notification of address and device changes. + + == ========================================================== + 0 (default): do nothing + 1 Generate gratuitous arp requests when device is brought up + or hardware address changes. + == ========================================================== + +arp_accept - BOOLEAN + Define behavior for gratuitous ARP frames who's IP is not + already present in the ARP table: + + - 0 - don't create new entries in the ARP table + - 1 - create new entries in the ARP table + + Both replies and requests type gratuitous arp will trigger the + ARP table to be updated, if this setting is on. + + If the ARP table already contains the IP address of the + gratuitous arp frame, the arp table will be updated regardless + if this setting is on or off. + +mcast_solicit - INTEGER + The maximum number of multicast probes in INCOMPLETE state, + when the associated hardware address is unknown. Defaults + to 3. + +ucast_solicit - INTEGER + The maximum number of unicast probes in PROBE state, when + the hardware address is being reconfirmed. Defaults to 3. + +app_solicit - INTEGER + The maximum number of probes to send to the user space ARP daemon + via netlink before dropping back to multicast probes (see + mcast_resolicit). Defaults to 0. + +mcast_resolicit - INTEGER + The maximum number of multicast probes after unicast and + app probes in PROBE state. Defaults to 0. + +disable_policy - BOOLEAN + Disable IPSEC policy (SPD) for this interface + +disable_xfrm - BOOLEAN + Disable IPSEC encryption on this interface, whatever the policy + +igmpv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv1 or IGMPv2 report retransmit will take place. + + Default: 10000 (10 seconds) + +igmpv3_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv3 report retransmit will take place. + + Default: 1000 (1 seconds) + +promote_secondaries - BOOLEAN + When a primary IP address is removed from this interface + promote a corresponding secondary IP address instead of + removing all the corresponding secondary IP addresses. + +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IP packets that are received in link-layer + multicast (or broadcast) frames. + + This behavior (for multicast) is actually a SHOULD in RFC + 1122, but is disabled by default for compatibility reasons. + + Default: off (0) + +drop_gratuitous_arp - BOOLEAN + Drop all gratuitous ARP frames, for example if there's a known + good ARP proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + + Default: off (0) + + +tag - INTEGER + Allows you to write a number, which can be used as required. + + Default value is 0. + +xfrm4_gc_thresh - INTEGER + (Obsolete since linux-4.14) + The threshold at which we will start garbage collecting for IPv4 + destination cache entries. At twice this value the system will + refuse new allocations. + +igmp_link_local_mcast_reports - BOOLEAN + Enable IGMP reports for link local multicast groups in the + 224.0.0.X range. + + Default TRUE + +Alexey Kuznetsov. +kuznet@ms2.inr.ac.ru + +Updated by: + +- Andi Kleen + ak@muc.de +- Nicolas Delon + delon.nicolas@wanadoo.fr + + + + +/proc/sys/net/ipv6/* Variables +============================== + +IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also +apply to IPv6 [XXX?]. + +bindv6only - BOOLEAN + Default value for IPV6_V6ONLY socket option, + which restricts use of the IPv6 socket to IPv6 communication + only. + + - TRUE: disable IPv4-mapped address feature + - FALSE: enable IPv4-mapped address feature + + Default: FALSE (as specified in RFC3493) + +flowlabel_consistency - BOOLEAN + Protect the consistency (and unicity) of flow label. + You have to disable it to use IPV6_FL_F_REFLECT flag on the + flow label manager. + + - TRUE: enabled + - FALSE: disabled + + Default: TRUE + +auto_flowlabels - INTEGER + Automatically generate flow labels based on a flow hash of the + packet. This allows intermediate devices, such as routers, to + identify packet flows for mechanisms like Equal Cost Multipath + Routing (see RFC 6438). + + = =========================================================== + 0 automatic flow labels are completely disabled + 1 automatic flow labels are enabled by default, they can be + disabled on a per socket basis using the IPV6_AUTOFLOWLABEL + socket option + 2 automatic flow labels are allowed, they may be enabled on a + per socket basis using the IPV6_AUTOFLOWLABEL socket option + 3 automatic flow labels are enabled and enforced, they cannot + be disabled by the socket option + = =========================================================== + + Default: 1 + +flowlabel_state_ranges - BOOLEAN + Split the flow label number space into two ranges. 0-0x7FFFF is + reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF + is reserved for stateless flow labels as described in RFC6437. + + - TRUE: enabled + - FALSE: disabled + + Default: true + +flowlabel_reflect - INTEGER + Control flow label reflection. Needed for Path MTU + Discovery to work with Equal Cost Multipath Routing in anycast + environments. See RFC 7690 and: + https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 + + This is a bitmask. + + - 1: enabled for established flows + + Note that this prevents automatic flowlabel changes, as done + in "tcp: change IPv6 flow-label upon receiving spurious retransmission" + and "tcp: Change txhash on every SYN and RTO retransmit" + + - 2: enabled for TCP RESET packets (no active listener) + If set, a RST packet sent in response to a SYN packet on a closed + port will reflect the incoming flow label. + + - 4: enabled for ICMPv6 echo reply messages. + + Default: 0 + +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. + + Default: 0 (Layer 3) + + Possible values: + + - 0 - Layer 3 (source and destination addresses plus flow label) + - 1 - Layer 4 (standard 5-tuple) + - 2 - Layer 3 or inner Layer 3 if present + +anycast_src_echo_reply - BOOLEAN + Controls the use of anycast addresses as source addresses for ICMPv6 + echo reply + + - TRUE: enabled + - FALSE: disabled + + Default: FALSE + +idgen_delay - INTEGER + Controls the delay in seconds after which time to retry + privacy stable address generation if a DAD conflict is + detected. + + Default: 1 (as specified in RFC7217) + +idgen_retries - INTEGER + Controls the number of retries to generate a stable privacy + address if a DAD conflict is detected. + + Default: 3 (as specified in RFC7217) + +mld_qrv - INTEGER + Controls the MLD query robustness variable (see RFC3810 9.1). + + Default: 2 (as specified by RFC3810 9.1) + + Minimum: 1 (as specified by RFC6636 4.5) + +max_dst_opts_number - INTEGER + Maximum number of non-padding TLVs allowed in a Destination + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + + Default: 8 + +max_hbh_opts_number - INTEGER + Maximum number of non-padding TLVs allowed in a Hop-by-Hop + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + + Default: 8 + +max_dst_opts_length - INTEGER + Maximum length allowed for a Destination options extension + header. + + Default: INT_MAX (unlimited) + +max_hbh_length - INTEGER + Maximum length allowed for a Hop-by-Hop options extension + header. + + Default: INT_MAX (unlimited) + +skip_notify_on_dev_down - BOOLEAN + Controls whether an RTM_DELROUTE message is generated for routes + removed when a device is taken down or deleted. IPv4 does not + generate this message; IPv6 does by default. Setting this sysctl + to true skips the message, making IPv4 and IPv6 on par in relying + on userspace caches to track link events and evict routes. + + Default: false (generate message) + +nexthop_compat_mode - BOOLEAN + New nexthop API provides a means for managing nexthops independent of + prefixes. Backwards compatibilty with old route format is enabled by + default which means route dumps and notifications contain the new + nexthop attribute but also the full, expanded nexthop definition. + Further, updates or deletes of a nexthop configuration generate route + notifications for each fib entry using the nexthop. Once a system + understands the new API, this sysctl can be disabled to achieve full + performance benefits of the new API by disabling the nexthop expansion + and extraneous notifications. + Default: true (backward compat mode) + +IPv6 Fragmentation: + +ip6frag_high_thresh - INTEGER + Maximum memory used to reassemble IPv6 fragments. When + ip6frag_high_thresh bytes of memory is allocated for this purpose, + the fragment handler will toss packets until ip6frag_low_thresh + is reached. + +ip6frag_low_thresh - INTEGER + See ip6frag_high_thresh + +ip6frag_time - INTEGER + Time in seconds to keep an IPv6 fragment in memory. + +IPv6 Segment Routing: + +seg6_flowlabel - INTEGER + Controls the behaviour of computing the flowlabel of outer + IPv6 header in case of SR T.encaps + + == ======================================================= + -1 set flowlabel to zero. + 0 copy flowlabel from Inner packet in case of Inner IPv6 + (Set flowlabel to 0 in case IPv4/L2) + 1 Compute the flowlabel using seg6_make_flowlabel() + == ======================================================= + + Default is 0. + +``conf/default/*``: + Change the interface-specific default settings. + + +``conf/all/*``: + Change all the interface-specific settings. + + [XXX: Other special features than forwarding?] + +conf/all/forwarding - BOOLEAN + Enable global IPv6 forwarding between all interfaces. + + IPv4 and IPv6 work differently here; e.g. netfilter must be used + to control which interfaces may forward packets and which not. + + This also sets all interfaces' Host/Router setting + 'forwarding' to the specified value. See below for details. + + This referred to as global forwarding. + +proxy_ndp - BOOLEAN + Do proxy ndp. + +fwmark_reflect - BOOLEAN + Controls the fwmark of kernel-generated IPv6 reply packets that are not + associated with a socket for example, TCP RSTs or ICMPv6 echo replies). + If unset, these packets have a fwmark of zero. If set, they have the + fwmark of the packet they are replying to. + + Default: 0 + +``conf/interface/*``: + Change special settings per interface. + + The functional behaviour for certain settings is different + depending on whether local forwarding is enabled or not. + +accept_ra - INTEGER + Accept Router Advertisements; autoconfigure using them. + + It also determines whether or not to transmit Router + Solicitations. If and only if the functional setting is to + accept Router Advertisements, Router Solicitations will be + transmitted. + + Possible values are: + + == =========================================================== + 0 Do not accept Router Advertisements. + 1 Accept Router Advertisements if forwarding is disabled. + 2 Overrule forwarding behaviour. Accept Router Advertisements + even if forwarding is enabled. + == =========================================================== + + Functional default: + + - enabled if local forwarding is disabled. + - disabled if local forwarding is enabled. + +accept_ra_defrtr - BOOLEAN + Learn default router in Router Advertisement. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_ra_from_local - BOOLEAN + Accept RA with source-address that is found on local machine + if the RA is otherwise proper and able to be accepted. + + Default is to NOT accept these as it may be an un-intended + network loop. + + Functional default: + + - enabled if accept_ra_from_local is enabled + on a specific interface. + - disabled if accept_ra_from_local is disabled + on a specific interface. + +accept_ra_min_hop_limit - INTEGER + Minimum hop limit Information in Router Advertisement. + + Hop limit Information in Router Advertisement less than this + variable shall be ignored. + + Default: 1 + +accept_ra_pinfo - BOOLEAN + Learn Prefix Information in Router Advertisement. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_ra_rt_info_min_plen - INTEGER + Minimum prefix length of Route Information in RA. + + Route Information w/ prefix smaller than this variable shall + be ignored. + + Functional default: + + * 0 if accept_ra_rtr_pref is enabled. + * -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rt_info_max_plen - INTEGER + Maximum prefix length of Route Information in RA. + + Route Information w/ prefix larger than this variable shall + be ignored. + + Functional default: + + * 0 if accept_ra_rtr_pref is enabled. + * -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rtr_pref - BOOLEAN + Accept Router Preference in RA. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_ra_mtu - BOOLEAN + Apply the MTU value specified in RA option 5 (RFC4861). If + disabled, the MTU specified in the RA will be ignored. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_redirects - BOOLEAN + Accept Redirects. + + Functional default: + + - enabled if local forwarding is disabled. + - disabled if local forwarding is enabled. + +accept_source_route - INTEGER + Accept source routing (routing extension header). + + - >= 0: Accept only routing header type 2. + - < 0: Do not accept routing header. + + Default: 0 + +autoconf - BOOLEAN + Autoconfigure addresses using Prefix Information in Router + Advertisements. + + Functional default: + + - enabled if accept_ra_pinfo is enabled. + - disabled if accept_ra_pinfo is disabled. + +dad_transmits - INTEGER + The amount of Duplicate Address Detection probes to send. + + Default: 1 + +forwarding - INTEGER + Configure interface-specific Host/Router behaviour. + + .. note:: + + It is recommended to have the same setting on all + interfaces; mixed router/host scenarios are rather uncommon. + + Possible values are: + + - 0 Forwarding disabled + - 1 Forwarding enabled + + **FALSE (0)**: + + By default, Host behaviour is assumed. This means: + + 1. IsRouter flag is not set in Neighbour Advertisements. + 2. If accept_ra is TRUE (default), transmit Router + Solicitations. + 3. If accept_ra is TRUE (default), accept Router + Advertisements (and do autoconfiguration). + 4. If accept_redirects is TRUE (default), accept Redirects. + + **TRUE (1)**: + + If local forwarding is enabled, Router behaviour is assumed. + This means exactly the reverse from the above: + + 1. IsRouter flag is set in Neighbour Advertisements. + 2. Router Solicitations are not sent unless accept_ra is 2. + 3. Router Advertisements are ignored unless accept_ra is 2. + 4. Redirects are ignored. + + Default: 0 (disabled) if global forwarding is disabled (default), + otherwise 1 (enabled). + +hop_limit - INTEGER + Default Hop Limit to set. + + Default: 64 + +mtu - INTEGER + Default Maximum Transfer Unit + + Default: 1280 (IPv6 required minimum) + +ip_nonlocal_bind - BOOLEAN + If set, allows processes to bind() to non-local IPv6 addresses, + which can be quite useful - but may break some applications. + + Default: 0 + +router_probe_interval - INTEGER + Minimum interval (in seconds) between Router Probing described + in RFC4191. + + Default: 60 + +router_solicitation_delay - INTEGER + Number of seconds to wait after interface is brought up + before sending Router Solicitations. + + Default: 1 + +router_solicitation_interval - INTEGER + Number of seconds to wait between Router Solicitations. + + Default: 4 + +router_solicitations - INTEGER + Number of Router Solicitations to send until assuming no + routers are present. + + Default: 3 + +use_oif_addrs_only - BOOLEAN + When enabled, the candidate source addresses for destinations + routed via this interface are restricted to the set of addresses + configured on this interface (vis. RFC 6724, section 4). + + Default: false + +use_tempaddr - INTEGER + Preference for Privacy Extensions (RFC3041). + + * <= 0 : disable Privacy Extensions + * == 1 : enable Privacy Extensions, but prefer public + addresses over temporary addresses. + * > 1 : enable Privacy Extensions and prefer temporary + addresses over public addresses. + + Default: + + * 0 (for most devices) + * -1 (for point-to-point devices and loopback devices) + +temp_valid_lft - INTEGER + valid lifetime (in seconds) for temporary addresses. + + Default: 604800 (7 days) + +temp_prefered_lft - INTEGER + Preferred lifetime (in seconds) for temporary addresses. + + Default: 86400 (1 day) + +keep_addr_on_down - INTEGER + Keep all IPv6 addresses on an interface down event. If set static + global addresses with no expiration time are not flushed. + + * >0 : enabled + * 0 : system default + * <0 : disabled + + Default: 0 (addresses are removed) + +max_desync_factor - INTEGER + Maximum value for DESYNC_FACTOR, which is a random value + that ensures that clients don't synchronize with each + other and generate new addresses at exactly the same time. + value is in seconds. + + Default: 600 + +regen_max_retry - INTEGER + Number of attempts before give up attempting to generate + valid temporary addresses. + + Default: 5 + +max_addresses - INTEGER + Maximum number of autoconfigured addresses per interface. Setting + to zero disables the limitation. It is not recommended to set this + value too large (or to zero) because it would be an easy way to + crash the kernel by allowing too many addresses to be created. + + Default: 16 + +disable_ipv6 - BOOLEAN + Disable IPv6 operation. If accept_dad is set to 2, this value + will be dynamically set to TRUE if DAD fails for the link-local + address. + + Default: FALSE (enable IPv6 operation) + + When this value is changed from 1 to 0 (IPv6 is being enabled), + it will dynamically create a link-local address on the given + interface and start Duplicate Address Detection, if necessary. + + When this value is changed from 0 to 1 (IPv6 is being disabled), + it will dynamically delete all addresses and routes on the given + interface. From now on it will not possible to add addresses/routes + to the selected interface. + +accept_dad - INTEGER + Whether to accept DAD (Duplicate Address Detection). + + == ============================================================== + 0 Disable DAD + 1 Enable DAD (default) + 2 Enable DAD, and disable IPv6 operation if MAC-based duplicate + link-local address has been found. + == ============================================================== + + DAD operation and mode on a given interface will be selected according + to the maximum value of conf/{all,interface}/accept_dad. + +force_tllao - BOOLEAN + Enable sending the target link-layer address option even when + responding to a unicast neighbor solicitation. + + Default: FALSE + + Quoting from RFC 2461, section 4.4, Target link-layer address: + + "The option MUST be included for multicast solicitations in order to + avoid infinite Neighbor Solicitation "recursion" when the peer node + does not have a cache entry to return a Neighbor Advertisements + message. When responding to unicast solicitations, the option can be + omitted since the sender of the solicitation has the correct link- + layer address; otherwise it would not have be able to send the unicast + solicitation in the first place. However, including the link-layer + address in this case adds little overhead and eliminates a potential + race condition where the sender deletes the cached link-layer address + prior to receiving a response to a previous solicitation." + +ndisc_notify - BOOLEAN + Define mode for notification of address and device changes. + + * 0 - (default): do nothing + * 1 - Generate unsolicited neighbour advertisements when device is brought + up or hardware address changes. + +ndisc_tclass - INTEGER + The IPv6 Traffic Class to use by default when sending IPv6 Neighbor + Discovery (Router Solicitation, Router Advertisement, Neighbor + Solicitation, Neighbor Advertisement, Redirect) messages. + These 8 bits can be interpreted as 6 high order bits holding the DSCP + value and 2 low order bits representing ECN (which you probably want + to leave cleared). + + * 0 - (default) + +mldv1_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv1 report retransmit will take place. + + Default: 10000 (10 seconds) + +mldv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv2 report retransmit will take place. + + Default: 1000 (1 second) + +force_mld_version - INTEGER + * 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed + * 1 - Enforce to use MLD version 1 + * 2 - Enforce to use MLD version 2 + +suppress_frag_ndisc - INTEGER + Control RFC 6980 (Security Implications of IPv6 Fragmentation + with IPv6 Neighbor Discovery) behavior: + + * 1 - (default) discard fragmented neighbor discovery packets + * 0 - allow fragmented neighbor discovery packets + +optimistic_dad - BOOLEAN + Whether to perform Optimistic Duplicate Address Detection (RFC 4429). + + * 0: disabled (default) + * 1: enabled + + Optimistic Duplicate Address Detection for the interface will be enabled + if at least one of conf/{all,interface}/optimistic_dad is set to 1, + it will be disabled otherwise. + +use_optimistic - BOOLEAN + If enabled, do not classify optimistic addresses as deprecated during + source address selection. Preferred addresses will still be chosen + before optimistic addresses, subject to other ranking in the source + address selection algorithm. + + * 0: disabled (default) + * 1: enabled + + This will be enabled if at least one of + conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. + +stable_secret - IPv6 address + This IPv6 address will be used as a secret to generate IPv6 + addresses for link-local addresses and autoconfigured + ones. All addresses generated after setting this secret will + be stable privacy ones by default. This can be changed via the + addrgenmode ip-link. conf/default/stable_secret is used as the + secret for the namespace, the interface specific ones can + overwrite that. Writes to conf/all/stable_secret are refused. + + It is recommended to generate this secret during installation + of a system and keep it stable after that. + + By default the stable secret is unset. + +addr_gen_mode - INTEGER + Defines how link-local and autoconf addresses are generated. + + = ================================================================= + 0 generate address based on EUI64 (default) + 1 do no generate a link-local address, use EUI64 for addresses + generated from autoconf + 2 generate stable privacy addresses, using the secret from + stable_secret (RFC7217) + 3 generate stable privacy addresses, using a random secret if unset + = ================================================================= + +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IPv6 packets that are received in link-layer + multicast (or broadcast) frames. + + By default this is turned off. + +drop_unsolicited_na - BOOLEAN + Drop all unsolicited neighbor advertisements, for example if there's + a known good NA proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + + By default this is turned off. + +enhanced_dad - BOOLEAN + Include a nonce option in the IPv6 neighbor solicitation messages used for + duplicate address detection per RFC7527. A received DAD NS will only signal + a duplicate address if the nonce is different. This avoids any false + detection of duplicates due to loopback of the NS messages that we send. + The nonce option will be sent on an interface unless both of + conf/{all,interface}/enhanced_dad are set to FALSE. + + Default: TRUE + +``icmp/*``: +=========== + +ratelimit - INTEGER + Limit the maximal rates for sending ICMPv6 messages. + + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + + Default: 1000 + +ratemask - list of comma separated ranges + For ICMPv6 message types matching the ranges in the ratemask, limit + the sending of the message according to ratelimit parameter. + + The format used for both input and output is a comma separated + list of ranges (e.g. "0-127,129" for ICMPv6 message type 0 to 127 and + 129). Writing to the file will clear all previous ranges of ICMPv6 + message types and update the current list with the input. + + Refer to: https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml + for numerical values of ICMPv6 message types, e.g. echo request is 128 + and echo reply is 129. + + Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big) + +echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol. + + Default: 0 + +echo_ignore_multicast - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol via multicast. + + Default: 0 + +echo_ignore_anycast - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol destined to anycast address. + + Default: 0 + +xfrm6_gc_thresh - INTEGER + (Obsolete since linux-4.14) + The threshold at which we will start garbage collecting for IPv6 + destination cache entries. At twice this value the system will + refuse new allocations. + + +IPv6 Update by: +Pekka Savola +YOSHIFUJI Hideaki / USAGI Project + + +/proc/sys/net/bridge/* Variables: +================================= + +bridge-nf-call-arptables - BOOLEAN + - 1 : pass bridged ARP traffic to arptables' FORWARD chain. + - 0 : disable this. + + Default: 1 + +bridge-nf-call-iptables - BOOLEAN + - 1 : pass bridged IPv4 traffic to iptables' chains. + - 0 : disable this. + + Default: 1 + +bridge-nf-call-ip6tables - BOOLEAN + - 1 : pass bridged IPv6 traffic to ip6tables' chains. + - 0 : disable this. + + Default: 1 + +bridge-nf-filter-vlan-tagged - BOOLEAN + - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. + - 0 : disable this. + + Default: 0 + +bridge-nf-filter-pppoe-tagged - BOOLEAN + - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. + - 0 : disable this. + + Default: 0 + +bridge-nf-pass-vlan-input-dev - BOOLEAN + - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan + interface on the bridge and set the netfilter input device to the + vlan. This allows use of e.g. "iptables -i br0.1" and makes the + REDIRECT target work with vlan-on-top-of-bridge interfaces. When no + matching vlan interface is found, or this switch is off, the input + device is set to the bridge interface. + + - 0: disable bridge netfilter vlan interface lookup. + + Default: 0 + +``proc/sys/net/sctp/*`` Variables: +================================== + +addip_enable - BOOLEAN + Enable or disable extension of Dynamic Address Reconfiguration + (ADD-IP) functionality specified in RFC5061. This extension provides + the ability to dynamically add and remove new addresses for the SCTP + associations. + + 1: Enable extension. + + 0: Disable extension. + + Default: 0 + +pf_enable - INTEGER + Enable or disable pf (pf is short for potentially failed) state. A value + of pf_retrans > path_max_retrans also disables pf state. That is, one of + both pf_enable and pf_retrans > path_max_retrans can disable pf state. + Since pf_retrans and path_max_retrans can be changed by userspace + application, sometimes user expects to disable pf state by the value of + pf_retrans > path_max_retrans, but occasionally the value of pf_retrans + or path_max_retrans is changed by the user application, this pf state is + enabled. As such, it is necessary to add this to dynamically enable + and disable pf state. See: + https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for + details. + + 1: Enable pf. + + 0: Disable pf. + + Default: 1 + +pf_expose - INTEGER + Unset or enable/disable pf (pf is short for potentially failed) state + exposure. Applications can control the exposure of the PF path state + in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO + sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with + SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info + can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, + a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming + SCTP_PF state and a SCTP_PF-state transport info can be got via + SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no + SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when + trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO + sockopt. + + 0: Unset pf state exposure, Compatible with old applications. + + 1: Disable pf state exposure. + + 2: Enable pf state exposure. + + Default: 0 + +addip_noauth_enable - BOOLEAN + Dynamic Address Reconfiguration (ADD-IP) requires the use of + authentication to protect the operations of adding or removing new + addresses. This requirement is mandated so that unauthorized hosts + would not be able to hijack associations. However, older + implementations may not have implemented this requirement while + allowing the ADD-IP extension. For reasons of interoperability, + we provide this variable to control the enforcement of the + authentication requirement. + + == =============================================================== + 1 Allow ADD-IP extension to be used without authentication. This + should only be set in a closed environment for interoperability + with older implementations. + + 0 Enforce the authentication requirement + == =============================================================== + + Default: 0 + +auth_enable - BOOLEAN + Enable or disable Authenticated Chunks extension. This extension + provides the ability to send and receive authenticated chunks and is + required for secure operation of Dynamic Address Reconfiguration + (ADD-IP) extension. + + - 1: Enable this extension. + - 0: Disable this extension. + + Default: 0 + +prsctp_enable - BOOLEAN + Enable or disable the Partial Reliability extension (RFC3758) which + is used to notify peers that a given DATA should no longer be expected. + + - 1: Enable extension + - 0: Disable + + Default: 1 + +max_burst - INTEGER + The limit of the number of new packets that can be initially sent. It + controls how bursty the generated traffic can be. + + Default: 4 + +association_max_retrans - INTEGER + Set the maximum number for retransmissions that an association can + attempt deciding that the remote end is unreachable. If this value + is exceeded, the association is terminated. + + Default: 10 + +max_init_retransmits - INTEGER + The maximum number of retransmissions of INIT and COOKIE-ECHO chunks + that an association will attempt before declaring the destination + unreachable and terminating. + + Default: 8 + +path_max_retrans - INTEGER + The maximum number of retransmissions that will be attempted on a given + path. Once this threshold is exceeded, the path is considered + unreachable, and new traffic will use a different path when the + association is multihomed. + + Default: 5 + +pf_retrans - INTEGER + The number of retransmissions that will be attempted on a given path + before traffic is redirected to an alternate transport (should one + exist). Note this is distinct from path_max_retrans, as a path that + passes the pf_retrans threshold can still be used. Its only + deprioritized when a transmission path is selected by the stack. This + setting is primarily used to enable fast failover mechanisms without + having to reduce path_max_retrans to a very low value. See: + http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt + for details. Note also that a value of pf_retrans > path_max_retrans + disables this feature. Since both pf_retrans and path_max_retrans can + be changed by userspace application, a variable pf_enable is used to + disable pf state. + + Default: 0 + +ps_retrans - INTEGER + Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming + from section-5 "Primary Path Switchover" in rfc7829. The primary path + will be changed to another active path when the path error counter on + the old primary path exceeds PSMR, so that "the SCTP sender is allowed + to continue data transmission on a new working path even when the old + primary destination address becomes active again". Note this feature + is disabled by initializing 'ps_retrans' per netns as 0xffff by default, + and its value can't be less than 'pf_retrans' when changing by sysctl. + + Default: 0xffff + +rto_initial - INTEGER + The initial round trip timeout value in milliseconds that will be used + in calculating round trip times. This is the initial time interval + for retransmissions. + + Default: 3000 + +rto_max - INTEGER + The maximum value (in milliseconds) of the round trip timeout. This + is the largest time interval that can elapse between retransmissions. + + Default: 60000 + +rto_min - INTEGER + The minimum value (in milliseconds) of the round trip timeout. This + is the smallest time interval the can elapse between retransmissions. + + Default: 1000 + +hb_interval - INTEGER + The interval (in milliseconds) between HEARTBEAT chunks. These chunks + are sent at the specified interval on idle paths to probe the state of + a given path between 2 associations. + + Default: 30000 + +sack_timeout - INTEGER + The amount of time (in milliseconds) that the implementation will wait + to send a SACK. + + Default: 200 + +valid_cookie_life - INTEGER + The default lifetime of the SCTP cookie (in milliseconds). The cookie + is used during association establishment. + + Default: 60000 + +cookie_preserve_enable - BOOLEAN + Enable or disable the ability to extend the lifetime of the SCTP cookie + that is used during the establishment phase of SCTP association + + - 1: Enable cookie lifetime extension. + - 0: Disable + + Default: 1 + +cookie_hmac_alg - STRING + Select the hmac algorithm used when generating the cookie value sent by + a listening sctp socket to a connecting client in the INIT-ACK chunk. + Valid values are: + + * md5 + * sha1 + * none + + Ability to assign md5 or sha1 as the selected alg is predicated on the + configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and + CONFIG_CRYPTO_SHA1). + + Default: Dependent on configuration. MD5 if available, else SHA1 if + available, else none. + +rcvbuf_policy - INTEGER + Determines if the receive buffer is attributed to the socket or to + association. SCTP supports the capability to create multiple + associations on a single socket. When using this capability, it is + possible that a single stalled association that's buffering a lot + of data may block other associations from delivering their data by + consuming all of the receive buffer space. To work around this, + the rcvbuf_policy could be set to attribute the receiver buffer space + to each association instead of the socket. This prevents the described + blocking. + + - 1: rcvbuf space is per association + - 0: rcvbuf space is per socket + + Default: 0 + +sndbuf_policy - INTEGER + Similar to rcvbuf_policy above, this applies to send buffer space. + + - 1: Send buffer is tracked per association + - 0: Send buffer is tracked per socket. + + Default: 0 + +sctp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all SCTP sockets. + + min: Below this number of pages SCTP is not bothered about its + memory appetite. When amount of memory allocated by SCTP exceeds + this number, SCTP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all SCTP sockets. + + Default is calculated at boot time from amount of available memory. + +sctp_rmem - vector of 3 INTEGERs: min, default, max + Only the first value ("min") is used, "default" and "max" are + ignored. + + min: Minimal size of receive buffer used by SCTP socket. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. + + Default: 4K + +sctp_wmem - vector of 3 INTEGERs: min, default, max + Currently this tunable has no effect. + +addr_scope_policy - INTEGER + Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + + - 0 - Disable IPv4 address scoping + - 1 - Enable IPv4 address scoping + - 2 - Follow draft but allow IPv4 private addresses + - 3 - Follow draft but allow IPv4 link local addresses + + Default: 1 + + +``/proc/sys/net/core/*`` +======================== + + Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. + + +``/proc/sys/net/unix/*`` +======================== + +max_dgram_qlen - INTEGER + The maximum length of dgram socket receive queue + + Default: 10 + diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt deleted file mode 100644 index 5cdc37c34830..000000000000 --- a/Documentation/networking/ip-sysctl.txt +++ /dev/null @@ -1,2374 +0,0 @@ -/proc/sys/net/ipv4/* Variables: - -ip_forward - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - Forward Packets between interfaces. - - This variable is special, its change resets all configuration - parameters to their default state (RFC1122 for hosts, RFC1812 - for routers) - -ip_default_ttl - INTEGER - Default value of TTL field (Time To Live) for outgoing (but not - forwarded) IP packets. Should be between 1 and 255 inclusive. - Default: 64 (as recommended by RFC1700) - -ip_no_pmtu_disc - INTEGER - Disable Path MTU Discovery. If enabled in mode 1 and a - fragmentation-required ICMP is received, the PMTU to this - destination will be set to min_pmtu (see below). You will need - to raise min_pmtu to the smallest interface MTU on your system - manually if you want to avoid locally generated fragments. - - In mode 2 incoming Path MTU Discovery messages will be - discarded. Outgoing frames are handled the same as in mode 1, - implicitly setting IP_PMTUDISC_DONT on every created socket. - - Mode 3 is a hardened pmtu discover mode. The kernel will only - accept fragmentation-needed errors if the underlying protocol - can verify them besides a plain socket lookup. Current - protocols for which pmtu events will be honored are TCP, SCTP - and DCCP as they verify e.g. the sequence number or the - association. This mode should not be enabled globally but is - only intended to secure e.g. name servers in namespaces where - TCP path mtu must still work but path MTU information of other - protocols should be discarded. If enabled globally this mode - could break other protocols. - - Possible values: 0-3 - Default: FALSE - -min_pmtu - INTEGER - default 552 - minimum discovered Path MTU - -ip_forward_use_pmtu - BOOLEAN - By default we don't trust protocol path MTUs while forwarding - because they could be easily forged and can lead to unwanted - fragmentation by the router. - You only need to enable this if you have user-space software - which tries to discover path mtus by itself and depends on the - kernel honoring this information. This is normally not the - case. - Default: 0 (disabled) - Possible values: - 0 - disabled - 1 - enabled - -fwmark_reflect - BOOLEAN - Controls the fwmark of kernel-generated IPv4 reply packets that are not - associated with a socket for example, TCP RSTs or ICMP echo replies). - If unset, these packets have a fwmark of zero. If set, they have the - fwmark of the packet they are replying to. - Default: 0 - -fib_multipath_use_neigh - BOOLEAN - Use status of existing neighbor entry when determining nexthop for - multipath routes. If disabled, neighbor information is not used and - packets could be directed to a failed nexthop. Only valid for kernels - built with CONFIG_IP_ROUTE_MULTIPATH enabled. - Default: 0 (disabled) - Possible values: - 0 - disabled - 1 - enabled - -fib_multipath_hash_policy - INTEGER - Controls which hash policy to use for multipath routes. Only valid - for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. - Default: 0 (Layer 3) - Possible values: - 0 - Layer 3 - 1 - Layer 4 - 2 - Layer 3 or inner Layer 3 if present - -fib_sync_mem - UNSIGNED INTEGER - Amount of dirty memory from fib entries that can be backlogged before - synchronize_rcu is forced. - Default: 512kB Minimum: 64kB Maximum: 64MB - -ip_forward_update_priority - INTEGER - Whether to update SKB priority from "TOS" field in IPv4 header after it - is forwarded. The new SKB priority is mapped from TOS field value - according to an rt_tos2priority table (see e.g. man tc-prio). - Default: 1 (Update priority.) - Possible values: - 0 - Do not update priority. - 1 - Update priority. - -route/max_size - INTEGER - Maximum number of routes allowed in the kernel. Increase - this when using large numbers of interfaces and/or routes. - From linux kernel 3.6 onwards, this is deprecated for ipv4 - as route cache is no longer used. - -neigh/default/gc_thresh1 - INTEGER - Minimum number of entries to keep. Garbage collector will not - purge entries if there are fewer than this number. - Default: 128 - -neigh/default/gc_thresh2 - INTEGER - Threshold when garbage collector becomes more aggressive about - purging entries. Entries older than 5 seconds will be cleared - when over this number. - Default: 512 - -neigh/default/gc_thresh3 - INTEGER - Maximum number of non-PERMANENT neighbor entries allowed. Increase - this when using large numbers of interfaces and when communicating - with large numbers of directly-connected peers. - Default: 1024 - -neigh/default/unres_qlen_bytes - INTEGER - The maximum number of bytes which may be used by packets - queued for each unresolved address by other network layers. - (added in linux 3.3) - Setting negative value is meaningless and will return error. - Default: SK_WMEM_MAX, (same as net.core.wmem_default). - Exact value depends on architecture and kernel options, - but should be enough to allow queuing 256 packets - of medium size. - -neigh/default/unres_qlen - INTEGER - The maximum number of packets which may be queued for each - unresolved address by other network layers. - (deprecated in linux 3.3) : use unres_qlen_bytes instead. - Prior to linux 3.3, the default value is 3 which may cause - unexpected packet loss. The current default value is calculated - according to default value of unres_qlen_bytes and true size of - packet. - Default: 101 - -mtu_expires - INTEGER - Time, in seconds, that cached PMTU information is kept. - -min_adv_mss - INTEGER - The advertised MSS depends on the first hop route MTU, but will - never be lower than this setting. - -IP Fragmentation: - -ipfrag_high_thresh - LONG INTEGER - Maximum memory used to reassemble IP fragments. - -ipfrag_low_thresh - LONG INTEGER - (Obsolete since linux-4.17) - Maximum memory used to reassemble IP fragments before the kernel - begins to remove incomplete fragment queues to free up resources. - The kernel still accepts new fragments for defragmentation. - -ipfrag_time - INTEGER - Time in seconds to keep an IP fragment in memory. - -ipfrag_max_dist - INTEGER - ipfrag_max_dist is a non-negative integer value which defines the - maximum "disorder" which is allowed among fragments which share a - common IP source address. Note that reordering of packets is - not unusual, but if a large number of fragments arrive from a source - IP address while a particular fragment queue remains incomplete, it - probably indicates that one or more fragments belonging to that queue - have been lost. When ipfrag_max_dist is positive, an additional check - is done on fragments before they are added to a reassembly queue - if - ipfrag_max_dist (or more) fragments have arrived from a particular IP - address between additions to any IP fragment queue using that source - address, it's presumed that one or more fragments in the queue are - lost. The existing fragment queue will be dropped, and a new one - started. An ipfrag_max_dist value of zero disables this check. - - Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can - result in unnecessarily dropping fragment queues when normal - reordering of packets occurs, which could lead to poor application - performance. Using a very large value, e.g. 50000, increases the - likelihood of incorrectly reassembling IP fragments that originate - from different IP datagrams, which could result in data corruption. - Default: 64 - -INET peer storage: - -inet_peer_threshold - INTEGER - The approximate size of the storage. Starting from this threshold - entries will be thrown aggressively. This threshold also determines - entries' time-to-live and time intervals between garbage collection - passes. More entries, less time-to-live, less GC interval. - -inet_peer_minttl - INTEGER - Minimum time-to-live of entries. Should be enough to cover fragment - time-to-live on the reassembling side. This minimum time-to-live is - guaranteed if the pool size is less than inet_peer_threshold. - Measured in seconds. - -inet_peer_maxttl - INTEGER - Maximum time-to-live of entries. Unused entries will expire after - this period of time if there is no memory pressure on the pool (i.e. - when the number of entries in the pool is very small). - Measured in seconds. - -TCP variables: - -somaxconn - INTEGER - Limit of socket listen() backlog, known in userspace as SOMAXCONN. - Defaults to 4096. (Was 128 before linux-5.4) - See also tcp_max_syn_backlog for additional tuning for TCP sockets. - -tcp_abort_on_overflow - BOOLEAN - If listening service is too slow to accept new connections, - reset them. Default state is FALSE. It means that if overflow - occurred due to a burst, connection will recover. Enable this - option _only_ if you are really sure that listening daemon - cannot be tuned to accept connections faster. Enabling this - option can harm clients of your server. - -tcp_adv_win_scale - INTEGER - Count buffering overhead as bytes/2^tcp_adv_win_scale - (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), - if it is <= 0. - Possible values are [-31, 31], inclusive. - Default: 1 - -tcp_allowed_congestion_control - STRING - Show/set the congestion control choices available to non-privileged - processes. The list is a subset of those listed in - tcp_available_congestion_control. - Default is "reno" and the default setting (tcp_congestion_control). - -tcp_app_win - INTEGER - Reserve max(window/2^tcp_app_win, mss) of window for application - buffer. Value 0 is special, it means that nothing is reserved. - Default: 31 - -tcp_autocorking - BOOLEAN - Enable TCP auto corking : - When applications do consecutive small write()/sendmsg() system calls, - we try to coalesce these small writes as much as possible, to lower - total amount of sent packets. This is done if at least one prior - packet for the flow is waiting in Qdisc queues or device transmit - queue. Applications can still use TCP_CORK for optimal behavior - when they know how/when to uncork their sockets. - Default : 1 - -tcp_available_congestion_control - STRING - Shows the available congestion control choices that are registered. - More congestion control algorithms may be available as modules, - but not loaded. - -tcp_base_mss - INTEGER - The initial value of search_low to be used by the packetization layer - Path MTU discovery (MTU probing). If MTU probing is enabled, - this is the initial MSS used by the connection. - -tcp_mtu_probe_floor - INTEGER - If MTU probing is enabled this caps the minimum MSS used for search_low - for the connection. - - Default : 48 - -tcp_min_snd_mss - INTEGER - TCP SYN and SYNACK messages usually advertise an ADVMSS option, - as described in RFC 1122 and RFC 6691. - If this ADVMSS option is smaller than tcp_min_snd_mss, - it is silently capped to tcp_min_snd_mss. - - Default : 48 (at least 8 bytes of payload per segment) - -tcp_congestion_control - STRING - Set the congestion control algorithm to be used for new - connections. The algorithm "reno" is always available, but - additional choices may be available based on kernel configuration. - Default is set as part of kernel configuration. - For passive connections, the listener congestion control choice - is inherited. - [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] - -tcp_dsack - BOOLEAN - Allows TCP to send "duplicate" SACKs. - -tcp_early_retrans - INTEGER - Tail loss probe (TLP) converts RTOs occurring due to tail - losses into fast recovery (draft-ietf-tcpm-rack). Note that - TLP requires RACK to function properly (see tcp_recovery below) - Possible values: - 0 disables TLP - 3 or 4 enables TLP - Default: 3 - -tcp_ecn - INTEGER - Control use of Explicit Congestion Notification (ECN) by TCP. - ECN is used only when both ends of the TCP connection indicate - support for it. This feature is useful in avoiding losses due - to congestion by allowing supporting routers to signal - congestion before having to drop packets. - Possible values are: - 0 Disable ECN. Neither initiate nor accept ECN. - 1 Enable ECN when requested by incoming connections and - also request ECN on outgoing connection attempts. - 2 Enable ECN when requested by incoming connections - but do not request ECN on outgoing connections. - Default: 2 - -tcp_ecn_fallback - BOOLEAN - If the kernel detects that ECN connection misbehaves, enable fall - back to non-ECN. Currently, this knob implements the fallback - from RFC3168, section 6.1.1.1., but we reserve that in future, - additional detection mechanisms could be implemented under this - knob. The value is not used, if tcp_ecn or per route (or congestion - control) ECN settings are disabled. - Default: 1 (fallback enabled) - -tcp_fack - BOOLEAN - This is a legacy option, it has no effect anymore. - -tcp_fin_timeout - INTEGER - The length of time an orphaned (no longer referenced by any - application) connection will remain in the FIN_WAIT_2 state - before it is aborted at the local end. While a perfectly - valid "receive only" state for an un-orphaned connection, an - orphaned connection in FIN_WAIT_2 state could otherwise wait - forever for the remote to close its end of the connection. - Cf. tcp_max_orphans - Default: 60 seconds - -tcp_frto - INTEGER - Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. - F-RTO is an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in networks where the - RTT fluctuates (e.g., wireless). F-RTO is sender-side only - modification. It does not require any support from the peer. - - By default it's enabled with a non-zero value. 0 disables F-RTO. - -tcp_fwmark_accept - BOOLEAN - If set, incoming connections to listening sockets that do not have a - socket mark will set the mark of the accepting socket to the fwmark of - the incoming SYN packet. This will cause all packets on that connection - (starting from the first SYNACK) to be sent with that fwmark. The - listening socket's mark is unchanged. Listening sockets that already - have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are - unaffected. - - Default: 0 - -tcp_invalid_ratelimit - INTEGER - Limit the maximal rate for sending duplicate acknowledgments - in response to incoming TCP packets that are for an existing - connection but that are invalid due to any of these reasons: - - (a) out-of-window sequence number, - (b) out-of-window acknowledgment number, or - (c) PAWS (Protection Against Wrapped Sequence numbers) check failure - - This can help mitigate simple "ack loop" DoS attacks, wherein - a buggy or malicious middlebox or man-in-the-middle can - rewrite TCP header fields in manner that causes each endpoint - to think that the other is sending invalid TCP segments, thus - causing each side to send an unterminating stream of duplicate - acknowledgments for invalid segments. - - Using 0 disables rate-limiting of dupacks in response to - invalid segments; otherwise this value specifies the minimal - space between sending such dupacks, in milliseconds. - - Default: 500 (milliseconds). - -tcp_keepalive_time - INTEGER - How often TCP sends out keepalive messages when keepalive is enabled. - Default: 2hours. - -tcp_keepalive_probes - INTEGER - How many keepalive probes TCP sends out, until it decides that the - connection is broken. Default value: 9. - -tcp_keepalive_intvl - INTEGER - How frequently the probes are send out. Multiplied by - tcp_keepalive_probes it is time to kill not responding connection, - after probes started. Default value: 75sec i.e. connection - will be aborted after ~11 minutes of retries. - -tcp_l3mdev_accept - BOOLEAN - Enables child sockets to inherit the L3 master device index. - Enabling this option allows a "global" listen socket to work - across L3 master domains (e.g., VRFs) with connected sockets - derived from the listen socket to be bound to the L3 domain in - which the packets originated. Only valid when the kernel was - compiled with CONFIG_NET_L3_MASTER_DEV. - Default: 0 (disabled) - -tcp_low_latency - BOOLEAN - This is a legacy option, it has no effect anymore. - -tcp_max_orphans - INTEGER - Maximal number of TCP sockets not attached to any user file handle, - held by system. If this number is exceeded orphaned connections are - reset immediately and warning is printed. This limit exists - only to prevent simple DoS attacks, you _must_ not rely on this - or lower the limit artificially, but rather increase it - (probably, after increasing installed memory), - if network conditions require more than default value, - and tune network services to linger and kill such states - more aggressively. Let me to remind again: each orphan eats - up to ~64K of unswappable memory. - -tcp_max_syn_backlog - INTEGER - Maximal number of remembered connection requests (SYN_RECV), - which have not received an acknowledgment from connecting client. - This is a per-listener limit. - The minimal value is 128 for low memory machines, and it will - increase in proportion to the memory of machine. - If server suffers from overload, try increasing this number. - Remember to also check /proc/sys/net/core/somaxconn - A SYN_RECV request socket consumes about 304 bytes of memory. - -tcp_max_tw_buckets - INTEGER - Maximal number of timewait sockets held by system simultaneously. - If this number is exceeded time-wait socket is immediately destroyed - and warning is printed. This limit exists only to prevent - simple DoS attacks, you _must_ not lower the limit artificially, - but rather increase it (probably, after increasing installed memory), - if network conditions require more than default value. - -tcp_mem - vector of 3 INTEGERs: min, pressure, max - min: below this number of pages TCP is not bothered about its - memory appetite. - - pressure: when amount of memory allocated by TCP exceeds this number - of pages, TCP moderates its memory consumption and enters memory - pressure mode, which is exited when memory consumption falls - under "min". - - max: number of pages allowed for queueing by all TCP sockets. - - Defaults are calculated at boot time from amount of available - memory. - -tcp_min_rtt_wlen - INTEGER - The window length of the windowed min filter to track the minimum RTT. - A shorter window lets a flow more quickly pick up new (higher) - minimum RTT when it is moved to a longer path (e.g., due to traffic - engineering). A longer window makes the filter more resistant to RTT - inflations such as transient congestion. The unit is seconds. - Possible values: 0 - 86400 (1 day) - Default: 300 - -tcp_moderate_rcvbuf - BOOLEAN - If set, TCP performs receive buffer auto-tuning, attempting to - automatically size the buffer (no greater than tcp_rmem[2]) to - match the size required by the path for full throughput. Enabled by - default. - -tcp_mtu_probing - INTEGER - Controls TCP Packetization-Layer Path MTU Discovery. Takes three - values: - 0 - Disabled - 1 - Disabled by default, enabled when an ICMP black hole detected - 2 - Always enabled, use initial MSS of tcp_base_mss. - -tcp_probe_interval - UNSIGNED INTEGER - Controls how often to start TCP Packetization-Layer Path MTU - Discovery reprobe. The default is reprobing every 10 minutes as - per RFC4821. - -tcp_probe_threshold - INTEGER - Controls when TCP Packetization-Layer Path MTU Discovery probing - will stop in respect to the width of search range in bytes. Default - is 8 bytes. - -tcp_no_metrics_save - BOOLEAN - By default, TCP saves various connection metrics in the route cache - when the connection closes, so that connections established in the - near future can use these to set initial conditions. Usually, this - increases overall performance, but may sometimes cause performance - degradation. If set, TCP will not cache metrics on closing - connections. - -tcp_no_ssthresh_metrics_save - BOOLEAN - Controls whether TCP saves ssthresh metrics in the route cache. - Default is 1, which disables ssthresh metrics. - -tcp_orphan_retries - INTEGER - This value influences the timeout of a locally closed TCP connection, - when RTO retransmissions remain unacknowledged. - See tcp_retries2 for more details. - - The default value is 8. - If your machine is a loaded WEB server, - you should think about lowering this value, such sockets - may consume significant resources. Cf. tcp_max_orphans. - -tcp_recovery - INTEGER - This value is a bitmap to enable various experimental loss recovery - features. - - RACK: 0x1 enables the RACK loss detection for fast detection of lost - retransmissions and tail drops. It also subsumes and disables - RFC6675 recovery for SACK connections. - RACK: 0x2 makes RACK's reordering window static (min_rtt/4). - RACK: 0x4 disables RACK's DUPACK threshold heuristic - - Default: 0x1 - -tcp_reordering - INTEGER - Initial reordering level of packets in a TCP stream. - TCP stack can then dynamically adjust flow reordering level - between this initial value and tcp_max_reordering - Default: 3 - -tcp_max_reordering - INTEGER - Maximal reordering level of packets in a TCP stream. - 300 is a fairly conservative value, but you might increase it - if paths are using per packet load balancing (like bonding rr mode) - Default: 300 - -tcp_retrans_collapse - BOOLEAN - Bug-to-bug compatibility with some broken printers. - On retransmit try to send bigger packets to work around bugs in - certain TCP stacks. - -tcp_retries1 - INTEGER - This value influences the time, after which TCP decides, that - something is wrong due to unacknowledged RTO retransmissions, - and reports this suspicion to the network layer. - See tcp_retries2 for more details. - - RFC 1122 recommends at least 3 retransmissions, which is the - default. - -tcp_retries2 - INTEGER - This value influences the timeout of an alive TCP connection, - when RTO retransmissions remain unacknowledged. - Given a value of N, a hypothetical TCP connection following - exponential backoff with an initial RTO of TCP_RTO_MIN would - retransmit N times before killing the connection at the (N+1)th RTO. - - The default value of 15 yields a hypothetical timeout of 924.6 - seconds and is a lower bound for the effective timeout. - TCP will effectively time out at the first RTO which exceeds the - hypothetical timeout. - - RFC 1122 recommends at least 100 seconds for the timeout, - which corresponds to a value of at least 8. - -tcp_rfc1337 - BOOLEAN - If set, the TCP stack behaves conforming to RFC1337. If unset, - we are not conforming to RFC, but prevent TCP TIME_WAIT - assassination. - Default: 0 - -tcp_rmem - vector of 3 INTEGERs: min, default, max - min: Minimal size of receive buffer used by TCP sockets. - It is guaranteed to each TCP socket, even under moderate memory - pressure. - Default: 4K - - default: initial size of receive buffer used by TCP sockets. - This value overrides net.core.rmem_default used by other protocols. - Default: 87380 bytes. This value results in window of 65535 with - default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit - less for default tcp_app_win. See below about these variables. - - max: maximal size of receive buffer allowed for automatically - selected receiver buffers for TCP socket. This value does not override - net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables - automatic tuning of that socket's receive buffer size, in which - case this value is ignored. - Default: between 87380B and 6MB, depending on RAM size. - -tcp_sack - BOOLEAN - Enable select acknowledgments (SACKS). - -tcp_comp_sack_delay_ns - LONG INTEGER - TCP tries to reduce number of SACK sent, using a timer - based on 5% of SRTT, capped by this sysctl, in nano seconds. - The default is 1ms, based on TSO autosizing period. - - Default : 1,000,000 ns (1 ms) - -tcp_comp_sack_nr - INTEGER - Max number of SACK that can be compressed. - Using 0 disables SACK compression. - - Default : 44 - -tcp_slow_start_after_idle - BOOLEAN - If set, provide RFC2861 behavior and time out the congestion - window after an idle period. An idle period is defined at - the current RTO. If unset, the congestion window will not - be timed out after an idle period. - Default: 1 - -tcp_stdurg - BOOLEAN - Use the Host requirements interpretation of the TCP urgent pointer field. - Most hosts use the older BSD interpretation, so if you turn this on - Linux might not communicate correctly with them. - Default: FALSE - -tcp_synack_retries - INTEGER - Number of times SYNACKs for a passive TCP connection attempt will - be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to 31seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for a passive TCP connection will happen after 63seconds. - -tcp_syncookies - INTEGER - Only valid when the kernel was compiled with CONFIG_SYN_COOKIES - Send out syncookies when the syn backlog queue of a socket - overflows. This is to prevent against the common 'SYN flood attack' - Default: 1 - - Note, that syncookies is fallback facility. - It MUST NOT be used to help highly loaded servers to stand - against legal connection rate. If you see SYN flood warnings - in your logs, but investigation shows that they occur - because of overload with legal connections, you should tune - another parameters until this warning disappear. - See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. - - syncookies seriously violate TCP protocol, do not allow - to use TCP extensions, can result in serious degradation - of some services (f.e. SMTP relaying), visible not by you, - but your clients and relays, contacting you. While you see - SYN flood warnings in logs not being really flooded, your server - is seriously misconfigured. - - If you want to test which effects syncookies have to your - network connections you can set this knob to 2 to enable - unconditionally generation of syncookies. - -tcp_fastopen - INTEGER - Enable TCP Fast Open (RFC7413) to send and accept data in the opening - SYN packet. - - The client support is enabled by flag 0x1 (on by default). The client - then must use sendmsg() or sendto() with the MSG_FASTOPEN flag, - rather than connect() to send data in SYN. - - The server support is enabled by flag 0x2 (off by default). Then - either enable for all listeners with another flag (0x400) or - enable individual listeners via TCP_FASTOPEN socket option with - the option value being the length of the syn-data backlog. - - The values (bitmap) are - 0x1: (client) enables sending data in the opening SYN on the client. - 0x2: (server) enables the server support, i.e., allowing data in - a SYN packet to be accepted and passed to the - application before 3-way handshake finishes. - 0x4: (client) send data in the opening SYN regardless of cookie - availability and without a cookie option. - 0x200: (server) accept data-in-SYN w/o any cookie option present. - 0x400: (server) enable all listeners to support Fast Open by - default without explicit TCP_FASTOPEN socket option. - - Default: 0x1 - - Note that that additional client or server features are only - effective if the basic support (0x1 and 0x2) are enabled respectively. - -tcp_fastopen_blackhole_timeout_sec - INTEGER - Initial time period in second to disable Fastopen on active TCP sockets - when a TFO firewall blackhole issue happens. - This time period will grow exponentially when more blackhole issues - get detected right after Fastopen is re-enabled and will reset to - initial value when the blackhole issue goes away. - 0 to disable the blackhole detection. - By default, it is set to 1hr. - -tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs - The list consists of a primary key and an optional backup key. The - primary key is used for both creating and validating cookies, while the - optional backup key is only used for validating cookies. The purpose of - the backup key is to maximize TFO validation when keys are rotated. - - A randomly chosen primary key may be configured by the kernel if - the tcp_fastopen sysctl is set to 0x400 (see above), or if the - TCP_FASTOPEN setsockopt() optname is set and a key has not been - previously configured via sysctl. If keys are configured via - setsockopt() by using the TCP_FASTOPEN_KEY optname, then those - per-socket keys will be used instead of any keys that are specified via - sysctl. - - A key is specified as 4 8-digit hexadecimal integers which are separated - by a '-' as: xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx. Leading zeros may be - omitted. A primary and a backup key may be specified by separating them - by a comma. If only one key is specified, it becomes the primary key and - any previously configured backup keys are removed. - -tcp_syn_retries - INTEGER - Number of times initial SYNs for an active TCP connection attempt - will be retransmitted. Should not be higher than 127. Default value - is 6, which corresponds to 63seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for an active TCP connection attempt will happen after 127seconds. - -tcp_timestamps - INTEGER -Enable timestamps as defined in RFC1323. - 0: Disabled. - 1: Enable timestamps as defined in RFC1323 and use random offset for - each connection rather than only using the current time. - 2: Like 1, but without random offsets. - Default: 1 - -tcp_min_tso_segs - INTEGER - Minimal number of segments per TSO frame. - Since linux-3.12, TCP does an automatic sizing of TSO frames, - depending on flow rate, instead of filling 64Kbytes packets. - For specific usages, it's possible to force TCP to build big - TSO frames. Note that TCP stack might split too big TSO packets - if available window is too small. - Default: 2 - -tcp_pacing_ss_ratio - INTEGER - sk->sk_pacing_rate is set by TCP stack using a ratio applied - to current rate. (current_rate = cwnd * mss / srtt) - If TCP is in slow start, tcp_pacing_ss_ratio is applied - to let TCP probe for bigger speeds, assuming cwnd can be - doubled every other RTT. - Default: 200 - -tcp_pacing_ca_ratio - INTEGER - sk->sk_pacing_rate is set by TCP stack using a ratio applied - to current rate. (current_rate = cwnd * mss / srtt) - If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio - is applied to conservatively probe for bigger throughput. - Default: 120 - -tcp_tso_win_divisor - INTEGER - This allows control over what percentage of the congestion window - can be consumed by a single TSO frame. - The setting of this parameter is a choice between burstiness and - building larger TSO frames. - Default: 3 - -tcp_tw_reuse - INTEGER - Enable reuse of TIME-WAIT sockets for new connections when it is - safe from protocol viewpoint. - 0 - disable - 1 - global enable - 2 - enable for loopback traffic only - It should not be changed without advice/request of technical - experts. - Default: 2 - -tcp_window_scaling - BOOLEAN - Enable window scaling as defined in RFC1323. - -tcp_wmem - vector of 3 INTEGERs: min, default, max - min: Amount of memory reserved for send buffers for TCP sockets. - Each TCP socket has rights to use it due to fact of its birth. - Default: 4K - - default: initial size of send buffer used by TCP sockets. This - value overrides net.core.wmem_default used by other protocols. - It is usually lower than net.core.wmem_default. - Default: 16K - - max: Maximal amount of memory allowed for automatically tuned - send buffers for TCP sockets. This value does not override - net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables - automatic tuning of that socket's send buffer size, in which case - this value is ignored. - Default: between 64K and 4MB, depending on RAM size. - -tcp_notsent_lowat - UNSIGNED INTEGER - A TCP socket can control the amount of unsent bytes in its write queue, - thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() - reports POLLOUT events if the amount of unsent bytes is below a per - socket value, and if the write queue is not full. sendmsg() will - also not add new buffers if the limit is hit. - - This global variable controls the amount of unsent data for - sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change - to the global variable has immediate effect. - - Default: UINT_MAX (0xFFFFFFFF) - -tcp_workaround_signed_windows - BOOLEAN - If set, assume no receipt of a window scaling option means the - remote TCP is broken and treats the window as a signed quantity. - If unset, assume the remote TCP is not broken even if we do - not receive a window scaling option from them. - Default: 0 - -tcp_thin_linear_timeouts - BOOLEAN - Enable dynamic triggering of linear timeouts for thin streams. - If set, a check is performed upon retransmission by timeout to - determine if the stream is thin (less than 4 packets in flight). - As long as the stream is found to be thin, up to 6 linear - timeouts may be performed before exponential backoff mode is - initiated. This improves retransmission latency for - non-aggressive thin streams, often found to be time-dependent. - For more information on thin streams, see - Documentation/networking/tcp-thin.txt - Default: 0 - -tcp_limit_output_bytes - INTEGER - Controls TCP Small Queue limit per tcp socket. - TCP bulk sender tends to increase packets in flight until it - gets losses notifications. With SNDBUF autotuning, this can - result in a large amount of packets queued on the local machine - (e.g.: qdiscs, CPU backlog, or device) hurting latency of other - flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes - limits the number of bytes on qdisc or device to reduce artificial - RTT/cwnd and reduce bufferbloat. - Default: 1048576 (16 * 65536) - -tcp_challenge_ack_limit - INTEGER - Limits number of Challenge ACK sent per second, as recommended - in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) - Default: 1000 - -tcp_rx_skb_cache - BOOLEAN - Controls a per TCP socket cache of one skb, that might help - performance of some workloads. This might be dangerous - on systems with a lot of TCP sockets, since it increases - memory usage. - - Default: 0 (disabled) - -UDP variables: - -udp_l3mdev_accept - BOOLEAN - Enabling this option allows a "global" bound socket to work - across L3 master domains (e.g., VRFs) with packets capable of - being received regardless of the L3 domain in which they - originated. Only valid when the kernel was compiled with - CONFIG_NET_L3_MASTER_DEV. - Default: 0 (disabled) - -udp_mem - vector of 3 INTEGERs: min, pressure, max - Number of pages allowed for queueing by all UDP sockets. - - min: Below this number of pages UDP is not bothered about its - memory appetite. When amount of memory allocated by UDP exceeds - this number, UDP starts to moderate memory usage. - - pressure: This value was introduced to follow format of tcp_mem. - - max: Number of pages allowed for queueing by all UDP sockets. - - Default is calculated at boot time from amount of available memory. - -udp_rmem_min - INTEGER - Minimal size of receive buffer used by UDP sockets in moderation. - Each UDP socket is able to use the size for receiving data, even if - total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - Default: 4K - -udp_wmem_min - INTEGER - Minimal size of send buffer used by UDP sockets in moderation. - Each UDP socket is able to use the size for sending data, even if - total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - Default: 4K - -RAW variables: - -raw_l3mdev_accept - BOOLEAN - Enabling this option allows a "global" bound socket to work - across L3 master domains (e.g., VRFs) with packets capable of - being received regardless of the L3 domain in which they - originated. Only valid when the kernel was compiled with - CONFIG_NET_L3_MASTER_DEV. - Default: 1 (enabled) - -CIPSOv4 Variables: - -cipso_cache_enable - BOOLEAN - If set, enable additions to and lookups from the CIPSO label mapping - cache. If unset, additions are ignored and lookups always result in a - miss. However, regardless of the setting the cache is still - invalidated when required when means you can safely toggle this on and - off and the cache will always be "safe". - Default: 1 - -cipso_cache_bucket_size - INTEGER - The CIPSO label cache consists of a fixed size hash table with each - hash bucket containing a number of cache entries. This variable limits - the number of entries in each hash bucket; the larger the value the - more CIPSO label mappings that can be cached. When the number of - entries in a given hash bucket reaches this limit adding new entries - causes the oldest entry in the bucket to be removed to make room. - Default: 10 - -cipso_rbm_optfmt - BOOLEAN - Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of - the CIPSO draft specification (see Documentation/netlabel for details). - This means that when set the CIPSO tag will be padded with empty - categories in order to make the packet data 32-bit aligned. - Default: 0 - -cipso_rbm_structvalid - BOOLEAN - If set, do a very strict check of the CIPSO option when - ip_options_compile() is called. If unset, relax the checks done during - ip_options_compile(). Either way is "safe" as errors are caught else - where in the CIPSO processing code but setting this to 0 (False) should - result in less work (i.e. it should be faster) but could cause problems - with other implementations that require strict checking. - Default: 0 - -IP Variables: - -ip_local_port_range - 2 INTEGERS - Defines the local port range that is used by TCP and UDP to - choose the local port. The first number is the first, the - second the last local port number. - If possible, it is better these numbers have different parity - (one even and one odd value). - Must be greater than or equal to ip_unprivileged_port_start. - The default values are 32768 and 60999 respectively. - -ip_local_reserved_ports - list of comma separated ranges - Specify the ports which are reserved for known third-party - applications. These ports will not be used by automatic port - assignments (e.g. when calling connect() or bind() with port - number 0). Explicit port allocation behavior is unchanged. - - The format used for both input and output is a comma separated - list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and - 10). Writing to the file will clear all previously reserved - ports and update the current list with the one given in the - input. - - Note that ip_local_port_range and ip_local_reserved_ports - settings are independent and both are considered by the kernel - when determining which ports are available for automatic port - assignments. - - You can reserve ports which are not in the current - ip_local_port_range, e.g.: - - $ cat /proc/sys/net/ipv4/ip_local_port_range - 32000 60999 - $ cat /proc/sys/net/ipv4/ip_local_reserved_ports - 8080,9148 - - although this is redundant. However such a setting is useful - if later the port range is changed to a value that will - include the reserved ports. - - Default: Empty - -ip_unprivileged_port_start - INTEGER - This is a per-namespace sysctl. It defines the first - unprivileged port in the network namespace. Privileged ports - require root or CAP_NET_BIND_SERVICE in order to bind to them. - To disable all privileged ports, set this to 0. They must not - overlap with the ip_local_port_range. - - Default: 1024 - -ip_nonlocal_bind - BOOLEAN - If set, allows processes to bind() to non-local IP addresses, - which can be quite useful - but may break some applications. - Default: 0 - -ip_autobind_reuse - BOOLEAN - By default, bind() does not select the ports automatically even if - the new socket and all sockets bound to the port have SO_REUSEADDR. - ip_autobind_reuse allows bind() to reuse the port and this is useful - when you use bind()+connect(), but may break some applications. - The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this - option should only be set by experts. - Default: 0 - -ip_dynaddr - BOOLEAN - If set non-zero, enables support for dynamic addresses. - If set to a non-zero value larger than 1, a kernel log - message will be printed when dynamic address rewriting - occurs. - Default: 0 - -ip_early_demux - BOOLEAN - Optimize input packet processing down to one demux for - certain kinds of local sockets. Currently we only do this - for established TCP and connected UDP sockets. - - It may add an additional cost for pure routing workloads that - reduces overall throughput, in such case you should disable it. - Default: 1 - -ping_group_range - 2 INTEGERS - Restrict ICMP_PROTO datagram sockets to users in the group range. - The default is "1 0", meaning, that nobody (not even root) may - create ping sockets. Setting it to "100 100" would grant permissions - to the single group. "0 4294967295" would enable it for the world, "100 - 4294967295" would enable it for the users, but not daemons. - -tcp_early_demux - BOOLEAN - Enable early demux for established TCP sockets. - Default: 1 - -udp_early_demux - BOOLEAN - Enable early demux for connected UDP sockets. Disable this if - your system could experience more unconnected load. - Default: 1 - -icmp_echo_ignore_all - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it. - Default: 0 - -icmp_echo_ignore_broadcasts - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO and - TIMESTAMP requests sent to it via broadcast/multicast. - Default: 1 - -icmp_ratelimit - INTEGER - Limit the maximal rates for sending ICMP packets whose type matches - icmp_ratemask (see below) to specific targets. - 0 to disable any limiting, - otherwise the minimal space between responses in milliseconds. - Note that another sysctl, icmp_msgs_per_sec limits the number - of ICMP packets sent on all targets. - Default: 1000 - -icmp_msgs_per_sec - INTEGER - Limit maximal number of ICMP packets sent per second from this host. - Only messages whose type matches icmp_ratemask (see below) are - controlled by this limit. - Default: 1000 - -icmp_msgs_burst - INTEGER - icmp_msgs_per_sec controls number of ICMP packets sent per second, - while icmp_msgs_burst controls the burst size of these packets. - Default: 50 - -icmp_ratemask - INTEGER - Mask made of ICMP types for which rates are being limited. - Significant bits: IHGFEDCBA9876543210 - Default mask: 0000001100000011000 (6168) - - Bit definitions (see include/linux/icmp.h): - 0 Echo Reply - 3 Destination Unreachable * - 4 Source Quench * - 5 Redirect - 8 Echo Request - B Time Exceeded * - C Parameter Problem * - D Timestamp Request - E Timestamp Reply - F Info Request - G Info Reply - H Address Mask Request - I Address Mask Reply - - * These are rate limited by default (see default mask above) - -icmp_ignore_bogus_error_responses - BOOLEAN - Some routers violate RFC1122 by sending bogus responses to broadcast - frames. Such violations are normally logged via a kernel warning. - If this is set to TRUE, the kernel will not give such warnings, which - will avoid log file clutter. - Default: 1 - -icmp_errors_use_inbound_ifaddr - BOOLEAN - - If zero, icmp error messages are sent with the primary address of - the exiting interface. - - If non-zero, the message will be sent with the primary address of - the interface that received the packet that caused the icmp error. - This is the behaviour network many administrators will expect from - a router. And it can make debugging complicated network layouts - much easier. - - Note that if no primary address exists for the interface selected, - then the primary address of the first non-loopback interface that - has one will be used regardless of this setting. - - Default: 0 - -igmp_max_memberships - INTEGER - Change the maximum number of multicast groups we can subscribe to. - Default: 20 - - Theoretical maximum value is bounded by having to send a membership - report in a single datagram (i.e. the report can't span multiple - datagrams, or risk confusing the switch and leaving groups you don't - intend to). - - The number of supported groups 'M' is bounded by the number of group - report entries you can fit into a single datagram of 65535 bytes. - - M = 65536-sizeof (ip header)/(sizeof(Group record)) - - Group records are variable length, with a minimum of 12 bytes. - So net.ipv4.igmp_max_memberships should not be set higher than: - - (65536-24) / 12 = 5459 - - The value 5459 assumes no IP header options, so in practice - this number may be lower. - -igmp_max_msf - INTEGER - Maximum number of addresses allowed in the source filter list for a - multicast group. - Default: 10 - -igmp_qrv - INTEGER - Controls the IGMP query robustness variable (see RFC2236 8.1). - Default: 2 (as specified by RFC2236 8.1) - Minimum: 1 (as specified by RFC6636 4.5) - -force_igmp_version - INTEGER - 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback - allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier - Present timer expires. - 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if - receive IGMPv2/v3 query. - 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive - IGMPv1 query message. Will reply report if receive IGMPv3 query. - 3 - Enforce to use IGMP version 3. The same react with default 0. - - Note: this is not the same with force_mld_version because IGMPv3 RFC3376 - Security Considerations does not have clear description that we could - ignore other version messages completely as MLDv2 RFC3810. So make - this value as default 0 is recommended. - -conf/interface/* changes special settings per interface (where -"interface" is the name of your network interface) - -conf/all/* is special, changes the settings for all interfaces - -log_martians - BOOLEAN - Log packets with impossible addresses to kernel log. - log_martians for the interface will be enabled if at least one of - conf/{all,interface}/log_martians is set to TRUE, - it will be disabled otherwise - -accept_redirects - BOOLEAN - Accept ICMP redirect messages. - accept_redirects for the interface will be enabled if: - - both conf/{all,interface}/accept_redirects are TRUE in the case - forwarding for the interface is enabled - or - - at least one of conf/{all,interface}/accept_redirects is TRUE in the - case forwarding for the interface is disabled - accept_redirects for the interface will be disabled otherwise - default TRUE (host) - FALSE (router) - -forwarding - BOOLEAN - Enable IP forwarding on this interface. This controls whether packets - received _on_ this interface can be forwarded. - -mc_forwarding - BOOLEAN - Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE - and a multicast routing daemon is required. - conf/all/mc_forwarding must also be set to TRUE to enable multicast - routing for the interface - -medium_id - INTEGER - Integer value used to differentiate the devices by the medium they - are attached to. Two devices can have different id values when - the broadcast packets are received only on one of them. - The default value 0 means that the device is the only interface - to its medium, value of -1 means that medium is not known. - - Currently, it is used to change the proxy_arp behavior: - the proxy_arp feature is enabled for packets forwarded between - two devices attached to different media. - -proxy_arp - BOOLEAN - Do proxy arp. - proxy_arp for the interface will be enabled if at least one of - conf/{all,interface}/proxy_arp is set to TRUE, - it will be disabled otherwise - -proxy_arp_pvlan - BOOLEAN - Private VLAN proxy arp. - Basically allow proxy arp replies back to the same interface - (from which the ARP request/solicitation was received). - - This is done to support (ethernet) switch features, like RFC - 3069, where the individual ports are NOT allowed to - communicate with each other, but they are allowed to talk to - the upstream router. As described in RFC 3069, it is possible - to allow these hosts to communicate through the upstream - router by proxy_arp'ing. Don't need to be used together with - proxy_arp. - - This technology is known by different names: - In RFC 3069 it is called VLAN Aggregation. - Cisco and Allied Telesyn call it Private VLAN. - Hewlett-Packard call it Source-Port filtering or port-isolation. - Ericsson call it MAC-Forced Forwarding (RFC Draft). - -shared_media - BOOLEAN - Send(router) or accept(host) RFC1620 shared media redirects. - Overrides secure_redirects. - shared_media for the interface will be enabled if at least one of - conf/{all,interface}/shared_media is set to TRUE, - it will be disabled otherwise - default TRUE - -secure_redirects - BOOLEAN - Accept ICMP redirect messages only to gateways listed in the - interface's current gateway list. Even if disabled, RFC1122 redirect - rules still apply. - Overridden by shared_media. - secure_redirects for the interface will be enabled if at least one of - conf/{all,interface}/secure_redirects is set to TRUE, - it will be disabled otherwise - default TRUE - -send_redirects - BOOLEAN - Send redirects, if router. - send_redirects for the interface will be enabled if at least one of - conf/{all,interface}/send_redirects is set to TRUE, - it will be disabled otherwise - Default: TRUE - -bootp_relay - BOOLEAN - Accept packets with source address 0.b.c.d destined - not to this host as local ones. It is supposed, that - BOOTP relay daemon will catch and forward such packets. - conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay - for the interface - default FALSE - Not Implemented Yet. - -accept_source_route - BOOLEAN - Accept packets with SRR option. - conf/all/accept_source_route must also be set to TRUE to accept packets - with SRR option on the interface - default TRUE (router) - FALSE (host) - -accept_local - BOOLEAN - Accept packets with local source addresses. In combination with - suitable routing, this can be used to direct packets between two - local interfaces over the wire and have them accepted properly. - default FALSE - -route_localnet - BOOLEAN - Do not consider loopback addresses as martian source or destination - while routing. This enables the use of 127/8 for local routing purposes. - default FALSE - -rp_filter - INTEGER - 0 - No source validation. - 1 - Strict mode as defined in RFC3704 Strict Reverse Path - Each incoming packet is tested against the FIB and if the interface - is not the best reverse path the packet check will fail. - By default failed packets are discarded. - 2 - Loose mode as defined in RFC3704 Loose Reverse Path - Each incoming packet's source address is also tested against the FIB - and if the source address is not reachable via any interface - the packet check will fail. - - Current recommended practice in RFC3704 is to enable strict mode - to prevent IP spoofing from DDos attacks. If using asymmetric routing - or other complicated routing, then loose mode is recommended. - - The max value from conf/{all,interface}/rp_filter is used - when doing source validation on the {interface}. - - Default value is 0. Note that some distributions enable it - in startup scripts. - -arp_filter - BOOLEAN - 1 - Allows you to have multiple network interfaces on the same - subnet, and have the ARPs for each interface be answered - based on whether or not the kernel would route a packet from - the ARP'd IP out that interface (therefore you must use source - based routing for this to work). In other words it allows control - of which cards (usually 1) will respond to an arp request. - - 0 - (default) The kernel can respond to arp requests with addresses - from other interfaces. This may seem wrong but it usually makes - sense, because it increases the chance of successful communication. - IP addresses are owned by the complete host on Linux, not by - particular interfaces. Only for more complex setups like load- - balancing, does this behaviour cause problems. - - arp_filter for the interface will be enabled if at least one of - conf/{all,interface}/arp_filter is set to TRUE, - it will be disabled otherwise - -arp_announce - INTEGER - Define different restriction levels for announcing the local - source IP address from IP packets in ARP requests sent on - interface: - 0 - (default) Use any local address, configured on any interface - 1 - Try to avoid local addresses that are not in the target's - subnet for this interface. This mode is useful when target - hosts reachable via this interface require the source IP - address in ARP requests to be part of their logical network - configured on the receiving interface. When we generate the - request we will check all our subnets that include the - target IP and will preserve the source address if it is from - such subnet. If there is no such subnet we select source - address according to the rules for level 2. - 2 - Always use the best local address for this target. - In this mode we ignore the source address in the IP packet - and try to select local address that we prefer for talks with - the target host. Such local address is selected by looking - for primary IP addresses on all our subnets on the outgoing - interface that include the target IP address. If no suitable - local address is found we select the first local address - we have on the outgoing interface or on all other interfaces, - with the hope we will receive reply for our request and - even sometimes no matter the source IP address we announce. - - The max value from conf/{all,interface}/arp_announce is used. - - Increasing the restriction level gives more chance for - receiving answer from the resolved target while decreasing - the level announces more valid sender's information. - -arp_ignore - INTEGER - Define different modes for sending replies in response to - received ARP requests that resolve local target IP addresses: - 0 - (default): reply for any local target IP address, configured - on any interface - 1 - reply only if the target IP address is local address - configured on the incoming interface - 2 - reply only if the target IP address is local address - configured on the incoming interface and both with the - sender's IP address are part from same subnet on this interface - 3 - do not reply for local addresses configured with scope host, - only resolutions for global and link addresses are replied - 4-7 - reserved - 8 - do not reply for all local addresses - - The max value from conf/{all,interface}/arp_ignore is used - when ARP request is received on the {interface} - -arp_notify - BOOLEAN - Define mode for notification of address and device changes. - 0 - (default): do nothing - 1 - Generate gratuitous arp requests when device is brought up - or hardware address changes. - -arp_accept - BOOLEAN - Define behavior for gratuitous ARP frames who's IP is not - already present in the ARP table: - 0 - don't create new entries in the ARP table - 1 - create new entries in the ARP table - - Both replies and requests type gratuitous arp will trigger the - ARP table to be updated, if this setting is on. - - If the ARP table already contains the IP address of the - gratuitous arp frame, the arp table will be updated regardless - if this setting is on or off. - -mcast_solicit - INTEGER - The maximum number of multicast probes in INCOMPLETE state, - when the associated hardware address is unknown. Defaults - to 3. - -ucast_solicit - INTEGER - The maximum number of unicast probes in PROBE state, when - the hardware address is being reconfirmed. Defaults to 3. - -app_solicit - INTEGER - The maximum number of probes to send to the user space ARP daemon - via netlink before dropping back to multicast probes (see - mcast_resolicit). Defaults to 0. - -mcast_resolicit - INTEGER - The maximum number of multicast probes after unicast and - app probes in PROBE state. Defaults to 0. - -disable_policy - BOOLEAN - Disable IPSEC policy (SPD) for this interface - -disable_xfrm - BOOLEAN - Disable IPSEC encryption on this interface, whatever the policy - -igmpv2_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - IGMPv1 or IGMPv2 report retransmit will take place. - Default: 10000 (10 seconds) - -igmpv3_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - IGMPv3 report retransmit will take place. - Default: 1000 (1 seconds) - -promote_secondaries - BOOLEAN - When a primary IP address is removed from this interface - promote a corresponding secondary IP address instead of - removing all the corresponding secondary IP addresses. - -drop_unicast_in_l2_multicast - BOOLEAN - Drop any unicast IP packets that are received in link-layer - multicast (or broadcast) frames. - This behavior (for multicast) is actually a SHOULD in RFC - 1122, but is disabled by default for compatibility reasons. - Default: off (0) - -drop_gratuitous_arp - BOOLEAN - Drop all gratuitous ARP frames, for example if there's a known - good ARP proxy on the network and such frames need not be used - (or in the case of 802.11, must not be used to prevent attacks.) - Default: off (0) - - -tag - INTEGER - Allows you to write a number, which can be used as required. - Default value is 0. - -xfrm4_gc_thresh - INTEGER - (Obsolete since linux-4.14) - The threshold at which we will start garbage collecting for IPv4 - destination cache entries. At twice this value the system will - refuse new allocations. - -igmp_link_local_mcast_reports - BOOLEAN - Enable IGMP reports for link local multicast groups in the - 224.0.0.X range. - Default TRUE - -Alexey Kuznetsov. -kuznet@ms2.inr.ac.ru - -Updated by: -Andi Kleen -ak@muc.de -Nicolas Delon -delon.nicolas@wanadoo.fr - - - - -/proc/sys/net/ipv6/* Variables: - -IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also -apply to IPv6 [XXX?]. - -bindv6only - BOOLEAN - Default value for IPV6_V6ONLY socket option, - which restricts use of the IPv6 socket to IPv6 communication - only. - TRUE: disable IPv4-mapped address feature - FALSE: enable IPv4-mapped address feature - - Default: FALSE (as specified in RFC3493) - -flowlabel_consistency - BOOLEAN - Protect the consistency (and unicity) of flow label. - You have to disable it to use IPV6_FL_F_REFLECT flag on the - flow label manager. - TRUE: enabled - FALSE: disabled - Default: TRUE - -auto_flowlabels - INTEGER - Automatically generate flow labels based on a flow hash of the - packet. This allows intermediate devices, such as routers, to - identify packet flows for mechanisms like Equal Cost Multipath - Routing (see RFC 6438). - 0: automatic flow labels are completely disabled - 1: automatic flow labels are enabled by default, they can be - disabled on a per socket basis using the IPV6_AUTOFLOWLABEL - socket option - 2: automatic flow labels are allowed, they may be enabled on a - per socket basis using the IPV6_AUTOFLOWLABEL socket option - 3: automatic flow labels are enabled and enforced, they cannot - be disabled by the socket option - Default: 1 - -flowlabel_state_ranges - BOOLEAN - Split the flow label number space into two ranges. 0-0x7FFFF is - reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF - is reserved for stateless flow labels as described in RFC6437. - TRUE: enabled - FALSE: disabled - Default: true - -flowlabel_reflect - INTEGER - Control flow label reflection. Needed for Path MTU - Discovery to work with Equal Cost Multipath Routing in anycast - environments. See RFC 7690 and: - https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 - - This is a bitmask. - 1: enabled for established flows - - Note that this prevents automatic flowlabel changes, as done - in "tcp: change IPv6 flow-label upon receiving spurious retransmission" - and "tcp: Change txhash on every SYN and RTO retransmit" - - 2: enabled for TCP RESET packets (no active listener) - If set, a RST packet sent in response to a SYN packet on a closed - port will reflect the incoming flow label. - - 4: enabled for ICMPv6 echo reply messages. - - Default: 0 - -fib_multipath_hash_policy - INTEGER - Controls which hash policy to use for multipath routes. - Default: 0 (Layer 3) - Possible values: - 0 - Layer 3 (source and destination addresses plus flow label) - 1 - Layer 4 (standard 5-tuple) - 2 - Layer 3 or inner Layer 3 if present - -anycast_src_echo_reply - BOOLEAN - Controls the use of anycast addresses as source addresses for ICMPv6 - echo reply - TRUE: enabled - FALSE: disabled - Default: FALSE - -idgen_delay - INTEGER - Controls the delay in seconds after which time to retry - privacy stable address generation if a DAD conflict is - detected. - Default: 1 (as specified in RFC7217) - -idgen_retries - INTEGER - Controls the number of retries to generate a stable privacy - address if a DAD conflict is detected. - Default: 3 (as specified in RFC7217) - -mld_qrv - INTEGER - Controls the MLD query robustness variable (see RFC3810 9.1). - Default: 2 (as specified by RFC3810 9.1) - Minimum: 1 (as specified by RFC6636 4.5) - -max_dst_opts_number - INTEGER - Maximum number of non-padding TLVs allowed in a Destination - options extension header. If this value is less than zero - then unknown options are disallowed and the number of known - TLVs allowed is the absolute value of this number. - Default: 8 - -max_hbh_opts_number - INTEGER - Maximum number of non-padding TLVs allowed in a Hop-by-Hop - options extension header. If this value is less than zero - then unknown options are disallowed and the number of known - TLVs allowed is the absolute value of this number. - Default: 8 - -max_dst_opts_length - INTEGER - Maximum length allowed for a Destination options extension - header. - Default: INT_MAX (unlimited) - -max_hbh_length - INTEGER - Maximum length allowed for a Hop-by-Hop options extension - header. - Default: INT_MAX (unlimited) - -skip_notify_on_dev_down - BOOLEAN - Controls whether an RTM_DELROUTE message is generated for routes - removed when a device is taken down or deleted. IPv4 does not - generate this message; IPv6 does by default. Setting this sysctl - to true skips the message, making IPv4 and IPv6 on par in relying - on userspace caches to track link events and evict routes. - Default: false (generate message) - -nexthop_compat_mode - BOOLEAN - New nexthop API provides a means for managing nexthops independent of - prefixes. Backwards compatibilty with old route format is enabled by - default which means route dumps and notifications contain the new - nexthop attribute but also the full, expanded nexthop definition. - Further, updates or deletes of a nexthop configuration generate route - notifications for each fib entry using the nexthop. Once a system - understands the new API, this sysctl can be disabled to achieve full - performance benefits of the new API by disabling the nexthop expansion - and extraneous notifications. - Default: true (backward compat mode) - -IPv6 Fragmentation: - -ip6frag_high_thresh - INTEGER - Maximum memory used to reassemble IPv6 fragments. When - ip6frag_high_thresh bytes of memory is allocated for this purpose, - the fragment handler will toss packets until ip6frag_low_thresh - is reached. - -ip6frag_low_thresh - INTEGER - See ip6frag_high_thresh - -ip6frag_time - INTEGER - Time in seconds to keep an IPv6 fragment in memory. - -IPv6 Segment Routing: - -seg6_flowlabel - INTEGER - Controls the behaviour of computing the flowlabel of outer - IPv6 header in case of SR T.encaps - - -1 set flowlabel to zero. - 0 copy flowlabel from Inner packet in case of Inner IPv6 - (Set flowlabel to 0 in case IPv4/L2) - 1 Compute the flowlabel using seg6_make_flowlabel() - - Default is 0. - -conf/default/*: - Change the interface-specific default settings. - - -conf/all/*: - Change all the interface-specific settings. - - [XXX: Other special features than forwarding?] - -conf/all/forwarding - BOOLEAN - Enable global IPv6 forwarding between all interfaces. - - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. - - This also sets all interfaces' Host/Router setting - 'forwarding' to the specified value. See below for details. - - This referred to as global forwarding. - -proxy_ndp - BOOLEAN - Do proxy ndp. - -fwmark_reflect - BOOLEAN - Controls the fwmark of kernel-generated IPv6 reply packets that are not - associated with a socket for example, TCP RSTs or ICMPv6 echo replies). - If unset, these packets have a fwmark of zero. If set, they have the - fwmark of the packet they are replying to. - Default: 0 - -conf/interface/*: - Change special settings per interface. - - The functional behaviour for certain settings is different - depending on whether local forwarding is enabled or not. - -accept_ra - INTEGER - Accept Router Advertisements; autoconfigure using them. - - It also determines whether or not to transmit Router - Solicitations. If and only if the functional setting is to - accept Router Advertisements, Router Solicitations will be - transmitted. - - Possible values are: - 0 Do not accept Router Advertisements. - 1 Accept Router Advertisements if forwarding is disabled. - 2 Overrule forwarding behaviour. Accept Router Advertisements - even if forwarding is enabled. - - Functional default: enabled if local forwarding is disabled. - disabled if local forwarding is enabled. - -accept_ra_defrtr - BOOLEAN - Learn default router in Router Advertisement. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_ra_from_local - BOOLEAN - Accept RA with source-address that is found on local machine - if the RA is otherwise proper and able to be accepted. - Default is to NOT accept these as it may be an un-intended - network loop. - - Functional default: - enabled if accept_ra_from_local is enabled - on a specific interface. - disabled if accept_ra_from_local is disabled - on a specific interface. - -accept_ra_min_hop_limit - INTEGER - Minimum hop limit Information in Router Advertisement. - - Hop limit Information in Router Advertisement less than this - variable shall be ignored. - - Default: 1 - -accept_ra_pinfo - BOOLEAN - Learn Prefix Information in Router Advertisement. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_ra_rt_info_min_plen - INTEGER - Minimum prefix length of Route Information in RA. - - Route Information w/ prefix smaller than this variable shall - be ignored. - - Functional default: 0 if accept_ra_rtr_pref is enabled. - -1 if accept_ra_rtr_pref is disabled. - -accept_ra_rt_info_max_plen - INTEGER - Maximum prefix length of Route Information in RA. - - Route Information w/ prefix larger than this variable shall - be ignored. - - Functional default: 0 if accept_ra_rtr_pref is enabled. - -1 if accept_ra_rtr_pref is disabled. - -accept_ra_rtr_pref - BOOLEAN - Accept Router Preference in RA. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_ra_mtu - BOOLEAN - Apply the MTU value specified in RA option 5 (RFC4861). If - disabled, the MTU specified in the RA will be ignored. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_redirects - BOOLEAN - Accept Redirects. - - Functional default: enabled if local forwarding is disabled. - disabled if local forwarding is enabled. - -accept_source_route - INTEGER - Accept source routing (routing extension header). - - >= 0: Accept only routing header type 2. - < 0: Do not accept routing header. - - Default: 0 - -autoconf - BOOLEAN - Autoconfigure addresses using Prefix Information in Router - Advertisements. - - Functional default: enabled if accept_ra_pinfo is enabled. - disabled if accept_ra_pinfo is disabled. - -dad_transmits - INTEGER - The amount of Duplicate Address Detection probes to send. - Default: 1 - -forwarding - INTEGER - Configure interface-specific Host/Router behaviour. - - Note: It is recommended to have the same setting on all - interfaces; mixed router/host scenarios are rather uncommon. - - Possible values are: - 0 Forwarding disabled - 1 Forwarding enabled - - FALSE (0): - - By default, Host behaviour is assumed. This means: - - 1. IsRouter flag is not set in Neighbour Advertisements. - 2. If accept_ra is TRUE (default), transmit Router - Solicitations. - 3. If accept_ra is TRUE (default), accept Router - Advertisements (and do autoconfiguration). - 4. If accept_redirects is TRUE (default), accept Redirects. - - TRUE (1): - - If local forwarding is enabled, Router behaviour is assumed. - This means exactly the reverse from the above: - - 1. IsRouter flag is set in Neighbour Advertisements. - 2. Router Solicitations are not sent unless accept_ra is 2. - 3. Router Advertisements are ignored unless accept_ra is 2. - 4. Redirects are ignored. - - Default: 0 (disabled) if global forwarding is disabled (default), - otherwise 1 (enabled). - -hop_limit - INTEGER - Default Hop Limit to set. - Default: 64 - -mtu - INTEGER - Default Maximum Transfer Unit - Default: 1280 (IPv6 required minimum) - -ip_nonlocal_bind - BOOLEAN - If set, allows processes to bind() to non-local IPv6 addresses, - which can be quite useful - but may break some applications. - Default: 0 - -router_probe_interval - INTEGER - Minimum interval (in seconds) between Router Probing described - in RFC4191. - - Default: 60 - -router_solicitation_delay - INTEGER - Number of seconds to wait after interface is brought up - before sending Router Solicitations. - Default: 1 - -router_solicitation_interval - INTEGER - Number of seconds to wait between Router Solicitations. - Default: 4 - -router_solicitations - INTEGER - Number of Router Solicitations to send until assuming no - routers are present. - Default: 3 - -use_oif_addrs_only - BOOLEAN - When enabled, the candidate source addresses for destinations - routed via this interface are restricted to the set of addresses - configured on this interface (vis. RFC 6724, section 4). - - Default: false - -use_tempaddr - INTEGER - Preference for Privacy Extensions (RFC3041). - <= 0 : disable Privacy Extensions - == 1 : enable Privacy Extensions, but prefer public - addresses over temporary addresses. - > 1 : enable Privacy Extensions and prefer temporary - addresses over public addresses. - Default: 0 (for most devices) - -1 (for point-to-point devices and loopback devices) - -temp_valid_lft - INTEGER - valid lifetime (in seconds) for temporary addresses. - Default: 604800 (7 days) - -temp_prefered_lft - INTEGER - Preferred lifetime (in seconds) for temporary addresses. - Default: 86400 (1 day) - -keep_addr_on_down - INTEGER - Keep all IPv6 addresses on an interface down event. If set static - global addresses with no expiration time are not flushed. - >0 : enabled - 0 : system default - <0 : disabled - - Default: 0 (addresses are removed) - -max_desync_factor - INTEGER - Maximum value for DESYNC_FACTOR, which is a random value - that ensures that clients don't synchronize with each - other and generate new addresses at exactly the same time. - value is in seconds. - Default: 600 - -regen_max_retry - INTEGER - Number of attempts before give up attempting to generate - valid temporary addresses. - Default: 5 - -max_addresses - INTEGER - Maximum number of autoconfigured addresses per interface. Setting - to zero disables the limitation. It is not recommended to set this - value too large (or to zero) because it would be an easy way to - crash the kernel by allowing too many addresses to be created. - Default: 16 - -disable_ipv6 - BOOLEAN - Disable IPv6 operation. If accept_dad is set to 2, this value - will be dynamically set to TRUE if DAD fails for the link-local - address. - Default: FALSE (enable IPv6 operation) - - When this value is changed from 1 to 0 (IPv6 is being enabled), - it will dynamically create a link-local address on the given - interface and start Duplicate Address Detection, if necessary. - - When this value is changed from 0 to 1 (IPv6 is being disabled), - it will dynamically delete all addresses and routes on the given - interface. From now on it will not possible to add addresses/routes - to the selected interface. - -accept_dad - INTEGER - Whether to accept DAD (Duplicate Address Detection). - 0: Disable DAD - 1: Enable DAD (default) - 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate - link-local address has been found. - - DAD operation and mode on a given interface will be selected according - to the maximum value of conf/{all,interface}/accept_dad. - -force_tllao - BOOLEAN - Enable sending the target link-layer address option even when - responding to a unicast neighbor solicitation. - Default: FALSE - - Quoting from RFC 2461, section 4.4, Target link-layer address: - - "The option MUST be included for multicast solicitations in order to - avoid infinite Neighbor Solicitation "recursion" when the peer node - does not have a cache entry to return a Neighbor Advertisements - message. When responding to unicast solicitations, the option can be - omitted since the sender of the solicitation has the correct link- - layer address; otherwise it would not have be able to send the unicast - solicitation in the first place. However, including the link-layer - address in this case adds little overhead and eliminates a potential - race condition where the sender deletes the cached link-layer address - prior to receiving a response to a previous solicitation." - -ndisc_notify - BOOLEAN - Define mode for notification of address and device changes. - 0 - (default): do nothing - 1 - Generate unsolicited neighbour advertisements when device is brought - up or hardware address changes. - -ndisc_tclass - INTEGER - The IPv6 Traffic Class to use by default when sending IPv6 Neighbor - Discovery (Router Solicitation, Router Advertisement, Neighbor - Solicitation, Neighbor Advertisement, Redirect) messages. - These 8 bits can be interpreted as 6 high order bits holding the DSCP - value and 2 low order bits representing ECN (which you probably want - to leave cleared). - 0 - (default) - -mldv1_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - MLDv1 report retransmit will take place. - Default: 10000 (10 seconds) - -mldv2_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - MLDv2 report retransmit will take place. - Default: 1000 (1 second) - -force_mld_version - INTEGER - 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed - 1 - Enforce to use MLD version 1 - 2 - Enforce to use MLD version 2 - -suppress_frag_ndisc - INTEGER - Control RFC 6980 (Security Implications of IPv6 Fragmentation - with IPv6 Neighbor Discovery) behavior: - 1 - (default) discard fragmented neighbor discovery packets - 0 - allow fragmented neighbor discovery packets - -optimistic_dad - BOOLEAN - Whether to perform Optimistic Duplicate Address Detection (RFC 4429). - 0: disabled (default) - 1: enabled - - Optimistic Duplicate Address Detection for the interface will be enabled - if at least one of conf/{all,interface}/optimistic_dad is set to 1, - it will be disabled otherwise. - -use_optimistic - BOOLEAN - If enabled, do not classify optimistic addresses as deprecated during - source address selection. Preferred addresses will still be chosen - before optimistic addresses, subject to other ranking in the source - address selection algorithm. - 0: disabled (default) - 1: enabled - - This will be enabled if at least one of - conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. - -stable_secret - IPv6 address - This IPv6 address will be used as a secret to generate IPv6 - addresses for link-local addresses and autoconfigured - ones. All addresses generated after setting this secret will - be stable privacy ones by default. This can be changed via the - addrgenmode ip-link. conf/default/stable_secret is used as the - secret for the namespace, the interface specific ones can - overwrite that. Writes to conf/all/stable_secret are refused. - - It is recommended to generate this secret during installation - of a system and keep it stable after that. - - By default the stable secret is unset. - -addr_gen_mode - INTEGER - Defines how link-local and autoconf addresses are generated. - - 0: generate address based on EUI64 (default) - 1: do no generate a link-local address, use EUI64 for addresses generated - from autoconf - 2: generate stable privacy addresses, using the secret from - stable_secret (RFC7217) - 3: generate stable privacy addresses, using a random secret if unset - -drop_unicast_in_l2_multicast - BOOLEAN - Drop any unicast IPv6 packets that are received in link-layer - multicast (or broadcast) frames. - - By default this is turned off. - -drop_unsolicited_na - BOOLEAN - Drop all unsolicited neighbor advertisements, for example if there's - a known good NA proxy on the network and such frames need not be used - (or in the case of 802.11, must not be used to prevent attacks.) - - By default this is turned off. - -enhanced_dad - BOOLEAN - Include a nonce option in the IPv6 neighbor solicitation messages used for - duplicate address detection per RFC7527. A received DAD NS will only signal - a duplicate address if the nonce is different. This avoids any false - detection of duplicates due to loopback of the NS messages that we send. - The nonce option will be sent on an interface unless both of - conf/{all,interface}/enhanced_dad are set to FALSE. - Default: TRUE - -icmp/*: -ratelimit - INTEGER - Limit the maximal rates for sending ICMPv6 messages. - 0 to disable any limiting, - otherwise the minimal space between responses in milliseconds. - Default: 1000 - -ratemask - list of comma separated ranges - For ICMPv6 message types matching the ranges in the ratemask, limit - the sending of the message according to ratelimit parameter. - - The format used for both input and output is a comma separated - list of ranges (e.g. "0-127,129" for ICMPv6 message type 0 to 127 and - 129). Writing to the file will clear all previous ranges of ICMPv6 - message types and update the current list with the input. - - Refer to: https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml - for numerical values of ICMPv6 message types, e.g. echo request is 128 - and echo reply is 129. - - Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big) - -echo_ignore_all - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it over the IPv6 protocol. - Default: 0 - -echo_ignore_multicast - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it over the IPv6 protocol via multicast. - Default: 0 - -echo_ignore_anycast - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it over the IPv6 protocol destined to anycast address. - Default: 0 - -xfrm6_gc_thresh - INTEGER - (Obsolete since linux-4.14) - The threshold at which we will start garbage collecting for IPv6 - destination cache entries. At twice this value the system will - refuse new allocations. - - -IPv6 Update by: -Pekka Savola -YOSHIFUJI Hideaki / USAGI Project - - -/proc/sys/net/bridge/* Variables: - -bridge-nf-call-arptables - BOOLEAN - 1 : pass bridged ARP traffic to arptables' FORWARD chain. - 0 : disable this. - Default: 1 - -bridge-nf-call-iptables - BOOLEAN - 1 : pass bridged IPv4 traffic to iptables' chains. - 0 : disable this. - Default: 1 - -bridge-nf-call-ip6tables - BOOLEAN - 1 : pass bridged IPv6 traffic to ip6tables' chains. - 0 : disable this. - Default: 1 - -bridge-nf-filter-vlan-tagged - BOOLEAN - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. - 0 : disable this. - Default: 0 - -bridge-nf-filter-pppoe-tagged - BOOLEAN - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. - 0 : disable this. - Default: 0 - -bridge-nf-pass-vlan-input-dev - BOOLEAN - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan - interface on the bridge and set the netfilter input device to the vlan. - This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT - target work with vlan-on-top-of-bridge interfaces. When no matching - vlan interface is found, or this switch is off, the input device is - set to the bridge interface. - 0: disable bridge netfilter vlan interface lookup. - Default: 0 - -proc/sys/net/sctp/* Variables: - -addip_enable - BOOLEAN - Enable or disable extension of Dynamic Address Reconfiguration - (ADD-IP) functionality specified in RFC5061. This extension provides - the ability to dynamically add and remove new addresses for the SCTP - associations. - - 1: Enable extension. - - 0: Disable extension. - - Default: 0 - -pf_enable - INTEGER - Enable or disable pf (pf is short for potentially failed) state. A value - of pf_retrans > path_max_retrans also disables pf state. That is, one of - both pf_enable and pf_retrans > path_max_retrans can disable pf state. - Since pf_retrans and path_max_retrans can be changed by userspace - application, sometimes user expects to disable pf state by the value of - pf_retrans > path_max_retrans, but occasionally the value of pf_retrans - or path_max_retrans is changed by the user application, this pf state is - enabled. As such, it is necessary to add this to dynamically enable - and disable pf state. See: - https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for - details. - - 1: Enable pf. - - 0: Disable pf. - - Default: 1 - -pf_expose - INTEGER - Unset or enable/disable pf (pf is short for potentially failed) state - exposure. Applications can control the exposure of the PF path state - in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO - sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with - SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info - can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, - a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming - SCTP_PF state and a SCTP_PF-state transport info can be got via - SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no - SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when - trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO - sockopt. - - 0: Unset pf state exposure, Compatible with old applications. - - 1: Disable pf state exposure. - - 2: Enable pf state exposure. - - Default: 0 - -addip_noauth_enable - BOOLEAN - Dynamic Address Reconfiguration (ADD-IP) requires the use of - authentication to protect the operations of adding or removing new - addresses. This requirement is mandated so that unauthorized hosts - would not be able to hijack associations. However, older - implementations may not have implemented this requirement while - allowing the ADD-IP extension. For reasons of interoperability, - we provide this variable to control the enforcement of the - authentication requirement. - - 1: Allow ADD-IP extension to be used without authentication. This - should only be set in a closed environment for interoperability - with older implementations. - - 0: Enforce the authentication requirement - - Default: 0 - -auth_enable - BOOLEAN - Enable or disable Authenticated Chunks extension. This extension - provides the ability to send and receive authenticated chunks and is - required for secure operation of Dynamic Address Reconfiguration - (ADD-IP) extension. - - 1: Enable this extension. - 0: Disable this extension. - - Default: 0 - -prsctp_enable - BOOLEAN - Enable or disable the Partial Reliability extension (RFC3758) which - is used to notify peers that a given DATA should no longer be expected. - - 1: Enable extension - 0: Disable - - Default: 1 - -max_burst - INTEGER - The limit of the number of new packets that can be initially sent. It - controls how bursty the generated traffic can be. - - Default: 4 - -association_max_retrans - INTEGER - Set the maximum number for retransmissions that an association can - attempt deciding that the remote end is unreachable. If this value - is exceeded, the association is terminated. - - Default: 10 - -max_init_retransmits - INTEGER - The maximum number of retransmissions of INIT and COOKIE-ECHO chunks - that an association will attempt before declaring the destination - unreachable and terminating. - - Default: 8 - -path_max_retrans - INTEGER - The maximum number of retransmissions that will be attempted on a given - path. Once this threshold is exceeded, the path is considered - unreachable, and new traffic will use a different path when the - association is multihomed. - - Default: 5 - -pf_retrans - INTEGER - The number of retransmissions that will be attempted on a given path - before traffic is redirected to an alternate transport (should one - exist). Note this is distinct from path_max_retrans, as a path that - passes the pf_retrans threshold can still be used. Its only - deprioritized when a transmission path is selected by the stack. This - setting is primarily used to enable fast failover mechanisms without - having to reduce path_max_retrans to a very low value. See: - http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt - for details. Note also that a value of pf_retrans > path_max_retrans - disables this feature. Since both pf_retrans and path_max_retrans can - be changed by userspace application, a variable pf_enable is used to - disable pf state. - - Default: 0 - -ps_retrans - INTEGER - Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming - from section-5 "Primary Path Switchover" in rfc7829. The primary path - will be changed to another active path when the path error counter on - the old primary path exceeds PSMR, so that "the SCTP sender is allowed - to continue data transmission on a new working path even when the old - primary destination address becomes active again". Note this feature - is disabled by initializing 'ps_retrans' per netns as 0xffff by default, - and its value can't be less than 'pf_retrans' when changing by sysctl. - - Default: 0xffff - -rto_initial - INTEGER - The initial round trip timeout value in milliseconds that will be used - in calculating round trip times. This is the initial time interval - for retransmissions. - - Default: 3000 - -rto_max - INTEGER - The maximum value (in milliseconds) of the round trip timeout. This - is the largest time interval that can elapse between retransmissions. - - Default: 60000 - -rto_min - INTEGER - The minimum value (in milliseconds) of the round trip timeout. This - is the smallest time interval the can elapse between retransmissions. - - Default: 1000 - -hb_interval - INTEGER - The interval (in milliseconds) between HEARTBEAT chunks. These chunks - are sent at the specified interval on idle paths to probe the state of - a given path between 2 associations. - - Default: 30000 - -sack_timeout - INTEGER - The amount of time (in milliseconds) that the implementation will wait - to send a SACK. - - Default: 200 - -valid_cookie_life - INTEGER - The default lifetime of the SCTP cookie (in milliseconds). The cookie - is used during association establishment. - - Default: 60000 - -cookie_preserve_enable - BOOLEAN - Enable or disable the ability to extend the lifetime of the SCTP cookie - that is used during the establishment phase of SCTP association - - 1: Enable cookie lifetime extension. - 0: Disable - - Default: 1 - -cookie_hmac_alg - STRING - Select the hmac algorithm used when generating the cookie value sent by - a listening sctp socket to a connecting client in the INIT-ACK chunk. - Valid values are: - * md5 - * sha1 - * none - Ability to assign md5 or sha1 as the selected alg is predicated on the - configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and - CONFIG_CRYPTO_SHA1). - - Default: Dependent on configuration. MD5 if available, else SHA1 if - available, else none. - -rcvbuf_policy - INTEGER - Determines if the receive buffer is attributed to the socket or to - association. SCTP supports the capability to create multiple - associations on a single socket. When using this capability, it is - possible that a single stalled association that's buffering a lot - of data may block other associations from delivering their data by - consuming all of the receive buffer space. To work around this, - the rcvbuf_policy could be set to attribute the receiver buffer space - to each association instead of the socket. This prevents the described - blocking. - - 1: rcvbuf space is per association - 0: rcvbuf space is per socket - - Default: 0 - -sndbuf_policy - INTEGER - Similar to rcvbuf_policy above, this applies to send buffer space. - - 1: Send buffer is tracked per association - 0: Send buffer is tracked per socket. - - Default: 0 - -sctp_mem - vector of 3 INTEGERs: min, pressure, max - Number of pages allowed for queueing by all SCTP sockets. - - min: Below this number of pages SCTP is not bothered about its - memory appetite. When amount of memory allocated by SCTP exceeds - this number, SCTP starts to moderate memory usage. - - pressure: This value was introduced to follow format of tcp_mem. - - max: Number of pages allowed for queueing by all SCTP sockets. - - Default is calculated at boot time from amount of available memory. - -sctp_rmem - vector of 3 INTEGERs: min, default, max - Only the first value ("min") is used, "default" and "max" are - ignored. - - min: Minimal size of receive buffer used by SCTP socket. - It is guaranteed to each SCTP socket (but not association) even - under moderate memory pressure. - - Default: 4K - -sctp_wmem - vector of 3 INTEGERs: min, default, max - Currently this tunable has no effect. - -addr_scope_policy - INTEGER - Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 - - 0 - Disable IPv4 address scoping - 1 - Enable IPv4 address scoping - 2 - Follow draft but allow IPv4 private addresses - 3 - Follow draft but allow IPv4 link local addresses - - Default: 1 - - -/proc/sys/net/core/* - Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. - - -/proc/sys/net/unix/* -max_dgram_qlen - INTEGER - The maximum length of dgram socket receive queue - - Default: 10 - diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst index 10e11099e74a..4edd0d38779e 100644 --- a/Documentation/networking/snmp_counter.rst +++ b/Documentation/networking/snmp_counter.rst @@ -792,7 +792,7 @@ counters to indicate the ACK is skipped in which scenario. The ACK would only be skipped if the received packet is either a SYN packet or it has no data. -.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt +.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst * TcpExtTCPACKSkippedSynRecv diff --git a/net/Kconfig b/net/Kconfig index df8d8c9bd021..8b1f85820a6b 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -86,7 +86,7 @@ config INET "Sysctl support" below, you can change various aspects of the behavior of the TCP/IP code by writing to the (virtual) files in /proc/sys/net/ipv4/*; the options are explained in the file - . + . Short answer: say Y. diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 25a8888826b8..5da4733067fb 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -49,7 +49,7 @@ config IP_ADVANCED_ROUTER Note that some distributions enable it in startup scripts. For details about rp_filter strict and loose mode read - . + . If unsure, say N here. diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index fc61f51d87a3..956a806649f7 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -853,7 +853,7 @@ static bool icmp_unreach(struct sk_buff *skb) case ICMP_FRAG_NEEDED: /* for documentation of the ip_no_pmtu_disc * values please see - * Documentation/networking/ip-sysctl.txt + * Documentation/networking/ip-sysctl.rst */ switch (net->ipv4.sysctl_ip_no_pmtu_disc) { default: -- cgit From 19093313cb0486d568232934bb80dd422d891623 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:50 +0200 Subject: docs: networking: convert ipv6.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - mark a literal as such, in order to avoid a warning; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 6 +- Documentation/networking/index.rst | 1 + Documentation/networking/ipv6.rst | 78 +++++++++++++++++++++++++ Documentation/networking/ipv6.txt | 72 ----------------------- net/ipv6/Kconfig | 2 +- 5 files changed, 83 insertions(+), 76 deletions(-) create mode 100644 Documentation/networking/ipv6.rst delete mode 100644 Documentation/networking/ipv6.txt diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e37db6f1be64..e43f2e1f2958 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -356,7 +356,7 @@ shot down by NMI autoconf= [IPV6] - See Documentation/networking/ipv6.txt. + See Documentation/networking/ipv6.rst. show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller Limit apic dumping. The parameter defines the maximal @@ -872,7 +872,7 @@ miss to occur. disable= [IPV6] - See Documentation/networking/ipv6.txt. + See Documentation/networking/ipv6.rst. hardened_usercopy= [KNL] Under CONFIG_HARDENED_USERCOPY, whether @@ -912,7 +912,7 @@ to workaround buggy firmware. disable_ipv6= [IPV6] - See Documentation/networking/ipv6.txt. + See Documentation/networking/ipv6.rst. disable_mtrr_cleanup [X86] The kernel tries to adjust MTRR layout from continuous diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 7d133d8dbe2a..709675464e51 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -70,6 +70,7 @@ Contents: iphase ipsec ip-sysctl + ipv6 .. only:: subproject and html diff --git a/Documentation/networking/ipv6.rst b/Documentation/networking/ipv6.rst new file mode 100644 index 000000000000..ba09c2f2dcc7 --- /dev/null +++ b/Documentation/networking/ipv6.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +IPv6 +==== + + +Options for the ipv6 module are supplied as parameters at load time. + +Module options may be given as command line arguments to the insmod +or modprobe command, but are usually specified in either +``/etc/modules.d/*.conf`` configuration files, or in a distro-specific +configuration file. + +The available ipv6 module parameters are listed below. If a parameter +is not specified the default value is used. + +The parameters are as follows: + +disable + + Specifies whether to load the IPv6 module, but disable all + its functionality. This might be used when another module + has a dependency on the IPv6 module being loaded, but no + IPv6 addresses or operations are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled. + + This is the default value. + + 1 + IPv6 is disabled. + + No IPv6 addresses will be added to interfaces, and + it will not be possible to open an IPv6 socket. + + A reboot is required to enable IPv6. + +autoconf + + Specifies whether to enable IPv6 address autoconfiguration + on all interfaces. This might be used when one does not wish + for addresses to be automatically generated from prefixes + received in Router Advertisements. + + The possible values and their effects are: + + 0 + IPv6 address autoconfiguration is disabled on all interfaces. + + Only the IPv6 loopback address (::1) and link-local addresses + will be added to interfaces. + + 1 + IPv6 address autoconfiguration is enabled on all interfaces. + + This is the default value. + +disable_ipv6 + + Specifies whether to disable IPv6 on all interfaces. + This might be used when no IPv6 addresses are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled on all interfaces. + + This is the default value. + + 1 + IPv6 is disabled on all interfaces. + + No IPv6 addresses will be added to interfaces. + diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt deleted file mode 100644 index 6cd74fa55358..000000000000 --- a/Documentation/networking/ipv6.txt +++ /dev/null @@ -1,72 +0,0 @@ - -Options for the ipv6 module are supplied as parameters at load time. - -Module options may be given as command line arguments to the insmod -or modprobe command, but are usually specified in either -/etc/modules.d/*.conf configuration files, or in a distro-specific -configuration file. - -The available ipv6 module parameters are listed below. If a parameter -is not specified the default value is used. - -The parameters are as follows: - -disable - - Specifies whether to load the IPv6 module, but disable all - its functionality. This might be used when another module - has a dependency on the IPv6 module being loaded, but no - IPv6 addresses or operations are desired. - - The possible values and their effects are: - - 0 - IPv6 is enabled. - - This is the default value. - - 1 - IPv6 is disabled. - - No IPv6 addresses will be added to interfaces, and - it will not be possible to open an IPv6 socket. - - A reboot is required to enable IPv6. - -autoconf - - Specifies whether to enable IPv6 address autoconfiguration - on all interfaces. This might be used when one does not wish - for addresses to be automatically generated from prefixes - received in Router Advertisements. - - The possible values and their effects are: - - 0 - IPv6 address autoconfiguration is disabled on all interfaces. - - Only the IPv6 loopback address (::1) and link-local addresses - will be added to interfaces. - - 1 - IPv6 address autoconfiguration is enabled on all interfaces. - - This is the default value. - -disable_ipv6 - - Specifies whether to disable IPv6 on all interfaces. - This might be used when no IPv6 addresses are desired. - - The possible values and their effects are: - - 0 - IPv6 is enabled on all interfaces. - - This is the default value. - - 1 - IPv6 is disabled on all interfaces. - - No IPv6 addresses will be added to interfaces. - diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index 2ccaee98fddb..5a6111da26c4 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -13,7 +13,7 @@ menuconfig IPV6 For general information about IPv6, see . For specific information about IPv6 under Linux, see - Documentation/networking/ipv6.txt and read the HOWTO at + Documentation/networking/ipv6.rst and read the HOWTO at To compile this protocol support as a module, choose M here: the -- cgit From 1dc2a785954bf4e562d0c85bea435ee56f705db5 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:51 +0200 Subject: docs: networking: convert ipvlan.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ipvlan.rst | 189 ++++++++++++++++++++++++++++++++++++ Documentation/networking/ipvlan.txt | 146 ---------------------------- 3 files changed, 190 insertions(+), 146 deletions(-) create mode 100644 Documentation/networking/ipvlan.rst delete mode 100644 Documentation/networking/ipvlan.txt diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 709675464e51..54dee1575b54 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -71,6 +71,7 @@ Contents: ipsec ip-sysctl ipv6 + ipvlan .. only:: subproject and html diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst new file mode 100644 index 000000000000..694adcba36b0 --- /dev/null +++ b/Documentation/networking/ipvlan.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +IPVLAN Driver HOWTO +=================== + +Initial Release: + Mahesh Bandewar + +1. Introduction: +================ +This is conceptually very similar to the macvlan driver with one major +exception of using L3 for mux-ing /demux-ing among slaves. This property makes +the master device share the L2 with it's slave devices. I have developed this +driver in conjunction with network namespaces and not sure if there is use case +outside of it. + + +2. Building and Installation: +============================= + +In order to build the driver, please select the config item CONFIG_IPVLAN. +The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module +(CONFIG_IPVLAN=m). + + +3. Configuration: +================= + +There are no module parameters for this driver and it can be configured +using IProute2/ip utility. +:: + + ip link add link name type ipvlan [ mode MODE ] [ FLAGS ] + where + MODE: l3 (default) | l3s | l2 + FLAGS: bridge (default) | private | vepa + +e.g. + + (a) Following will create IPvlan link with eth0 as master in + L3 bridge mode:: + + bash# ip link add link eth0 name ipvl0 type ipvlan + (b) This command will create IPvlan link in L2 bridge mode:: + + bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge + + (c) This command will create an IPvlan device in L2 private mode:: + + bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private + + (d) This command will create an IPvlan device in L2 vepa mode:: + + bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa + + +4. Operating modes: +=================== + +IPvlan has two modes of operation - L2 and L3. For a given master device, +you can select one of these two modes and all slaves on that master will +operate in the same (selected) mode. The RX mode is almost identical except +that in L3 mode the slaves wont receive any multicast / broadcast traffic. +L3 mode is more restrictive since routing is controlled from the other (mostly) +default namespace. + +4.1 L2 mode: +------------ + +In this mode TX processing happens on the stack instance attached to the +slave device and packets are switched and queued to the master device to send +out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) +as well. + +4.2 L3 mode: +------------ + +In this mode TX processing up to L3 happens on the stack instance attached +to the slave device and packets are switched to the stack instance of the +master device for the L2 processing and routing from that instance will be +used before packets are queued on the outbound device. In this mode the slaves +will not receive nor can send multicast / broadcast traffic. + +4.3 L3S mode: +------------- + +This is very similar to the L3 mode except that iptables (conn-tracking) +works in this mode and hence it is L3-symmetric (L3s). This will have slightly less +performance but that shouldn't matter since you are choosing this mode over plain-L3 +mode to make conn-tracking work. + +5. Mode flags: +============== + +At this time following mode flags are available + +5.1 bridge: +----------- +This is the default option. To configure the IPvlan port in this mode, +user can choose to either add this option on the command-line or don't specify +anything. This is the traditional mode where slaves can cross-talk among +themselves apart from talking through the master device. + +5.2 private: +------------ +If this option is added to the command-line, the port is set in private +mode. i.e. port won't allow cross communication between slaves. + +5.3 vepa: +--------- +If this is added to the command-line, the port is set in VEPA mode. +i.e. port will offload switching functionality to the external entity as +described in 802.1Qbg +Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the +master-device, so the packets which are emitted in this mode for the adjacent +neighbor will have source and destination mac same. This will make the switch / +router send the redirect message. + +6. What to choose (macvlan vs. ipvlan)? +======================================= + +These two devices are very similar in many regards and the specific use +case could very well define which device to choose. if one of the following +situations defines your use case then you can choose to use ipvlan: + + +(a) The Linux host that is connected to the external switch / router has + policy configured that allows only one mac per port. +(b) No of virtual devices created on a master exceed the mac capacity and + puts the NIC in promiscuous mode and degraded performance is a concern. +(c) If the slave device is to be put into the hostile / untrusted network + namespace where L2 on the slave could be changed / misused. + + +6. Example configuration: +========================= + +:: + + +=============================================================+ + | Host: host1 | + | | + | +----------------------+ +----------------------+ | + | | NS:ns0 | | NS:ns1 | | + | | | | | | + | | | | | | + | | ipvl0 | | ipvl1 | | + | +----------#-----------+ +-----------#----------+ | + | # # | + | ################################ | + | # eth0 | + +==============================#==============================+ + + +(a) Create two network namespaces - ns0, ns1:: + + ip netns add ns0 + ip netns add ns1 + +(b) Create two ipvlan slaves on eth0 (master device):: + + ip link add link eth0 ipvl0 type ipvlan mode l2 + ip link add link eth0 ipvl1 type ipvlan mode l2 + +(c) Assign slaves to the respective network namespaces:: + + ip link set dev ipvl0 netns ns0 + ip link set dev ipvl1 netns ns1 + +(d) Now switch to the namespace (ns0 or ns1) to configure the slave devices + + - For ns0:: + + (1) ip netns exec ns0 bash + (2) ip link set dev ipvl0 up + (3) ip link set dev lo up + (4) ip -4 addr add 127.0.0.1 dev lo + (5) ip -4 addr add $IPADDR dev ipvl0 + (6) ip -4 route add default via $ROUTER dev ipvl0 + + - For ns1:: + + (1) ip netns exec ns1 bash + (2) ip link set dev ipvl1 up + (3) ip link set dev lo up + (4) ip -4 addr add 127.0.0.1 dev lo + (5) ip -4 addr add $IPADDR dev ipvl1 + (6) ip -4 route add default via $ROUTER dev ipvl1 diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt deleted file mode 100644 index 27a38e50c287..000000000000 --- a/Documentation/networking/ipvlan.txt +++ /dev/null @@ -1,146 +0,0 @@ - - IPVLAN Driver HOWTO - -Initial Release: - Mahesh Bandewar - -1. Introduction: - This is conceptually very similar to the macvlan driver with one major -exception of using L3 for mux-ing /demux-ing among slaves. This property makes -the master device share the L2 with it's slave devices. I have developed this -driver in conjunction with network namespaces and not sure if there is use case -outside of it. - - -2. Building and Installation: - In order to build the driver, please select the config item CONFIG_IPVLAN. -The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module -(CONFIG_IPVLAN=m). - - -3. Configuration: - There are no module parameters for this driver and it can be configured -using IProute2/ip utility. - - ip link add link name type ipvlan [ mode MODE ] [ FLAGS ] - where - MODE: l3 (default) | l3s | l2 - FLAGS: bridge (default) | private | vepa - - e.g. - (a) Following will create IPvlan link with eth0 as master in - L3 bridge mode - bash# ip link add link eth0 name ipvl0 type ipvlan - (b) This command will create IPvlan link in L2 bridge mode. - bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge - (c) This command will create an IPvlan device in L2 private mode. - bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private - (d) This command will create an IPvlan device in L2 vepa mode. - bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa - - -4. Operating modes: - IPvlan has two modes of operation - L2 and L3. For a given master device, -you can select one of these two modes and all slaves on that master will -operate in the same (selected) mode. The RX mode is almost identical except -that in L3 mode the slaves wont receive any multicast / broadcast traffic. -L3 mode is more restrictive since routing is controlled from the other (mostly) -default namespace. - -4.1 L2 mode: - In this mode TX processing happens on the stack instance attached to the -slave device and packets are switched and queued to the master device to send -out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) -as well. - -4.2 L3 mode: - In this mode TX processing up to L3 happens on the stack instance attached -to the slave device and packets are switched to the stack instance of the -master device for the L2 processing and routing from that instance will be -used before packets are queued on the outbound device. In this mode the slaves -will not receive nor can send multicast / broadcast traffic. - -4.3 L3S mode: - This is very similar to the L3 mode except that iptables (conn-tracking) -works in this mode and hence it is L3-symmetric (L3s). This will have slightly less -performance but that shouldn't matter since you are choosing this mode over plain-L3 -mode to make conn-tracking work. - -5. Mode flags: - At this time following mode flags are available - -5.1 bridge: - This is the default option. To configure the IPvlan port in this mode, -user can choose to either add this option on the command-line or don't specify -anything. This is the traditional mode where slaves can cross-talk among -themselves apart from talking through the master device. - -5.2 private: - If this option is added to the command-line, the port is set in private -mode. i.e. port won't allow cross communication between slaves. - -5.3 vepa: - If this is added to the command-line, the port is set in VEPA mode. -i.e. port will offload switching functionality to the external entity as -described in 802.1Qbg -Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the -master-device, so the packets which are emitted in this mode for the adjacent -neighbor will have source and destination mac same. This will make the switch / -router send the redirect message. - -6. What to choose (macvlan vs. ipvlan)? - These two devices are very similar in many regards and the specific use -case could very well define which device to choose. if one of the following -situations defines your use case then you can choose to use ipvlan - - (a) The Linux host that is connected to the external switch / router has -policy configured that allows only one mac per port. - (b) No of virtual devices created on a master exceed the mac capacity and -puts the NIC in promiscuous mode and degraded performance is a concern. - (c) If the slave device is to be put into the hostile / untrusted network -namespace where L2 on the slave could be changed / misused. - - -6. Example configuration: - - +=============================================================+ - | Host: host1 | - | | - | +----------------------+ +----------------------+ | - | | NS:ns0 | | NS:ns1 | | - | | | | | | - | | | | | | - | | ipvl0 | | ipvl1 | | - | +----------#-----------+ +-----------#----------+ | - | # # | - | ################################ | - | # eth0 | - +==============================#==============================+ - - - (a) Create two network namespaces - ns0, ns1 - ip netns add ns0 - ip netns add ns1 - - (b) Create two ipvlan slaves on eth0 (master device) - ip link add link eth0 ipvl0 type ipvlan mode l2 - ip link add link eth0 ipvl1 type ipvlan mode l2 - - (c) Assign slaves to the respective network namespaces - ip link set dev ipvl0 netns ns0 - ip link set dev ipvl1 netns ns1 - - (d) Now switch to the namespace (ns0 or ns1) to configure the slave devices - - For ns0 - (1) ip netns exec ns0 bash - (2) ip link set dev ipvl0 up - (3) ip link set dev lo up - (4) ip -4 addr add 127.0.0.1 dev lo - (5) ip -4 addr add $IPADDR dev ipvl0 - (6) ip -4 route add default via $ROUTER dev ipvl0 - - For ns1 - (1) ip netns exec ns1 bash - (2) ip link set dev ipvl1 up - (3) ip link set dev lo up - (4) ip -4 addr add 127.0.0.1 dev lo - (5) ip -4 addr add $IPADDR dev ipvl1 - (6) ip -4 route add default via $ROUTER dev ipvl1 -- cgit From 82a07bf33d7d0c3a194f62178e0fea2d68227b89 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:52 +0200 Subject: docs: networking: convert ipvs-sysctl.txt to ReST - add SPDX header; - add a document title; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Simon Horman Signed-off-by: David S. Miller --- Documentation/admin-guide/sysctl/net.rst | 4 +- Documentation/networking/index.rst | 1 + Documentation/networking/ipvs-sysctl.rst | 302 +++++++++++++++++++++++++++++++ Documentation/networking/ipvs-sysctl.txt | 294 ------------------------------ MAINTAINERS | 2 +- 5 files changed, 306 insertions(+), 297 deletions(-) create mode 100644 Documentation/networking/ipvs-sysctl.rst delete mode 100644 Documentation/networking/ipvs-sysctl.txt diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 84e3348a9543..2ad1b77a7182 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -353,8 +353,8 @@ socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings ------------------------------------- -Please see: Documentation/networking/ip-sysctl.rst and ipvs-sysctl.txt for -descriptions of these entries. +Please see: Documentation/networking/ip-sysctl.rst and +Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. 4. Appletalk diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 54dee1575b54..bbd4e0041457 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -72,6 +72,7 @@ Contents: ip-sysctl ipv6 ipvlan + ipvs-sysctl .. only:: subproject and html diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst new file mode 100644 index 000000000000..be36c4600e8f --- /dev/null +++ b/Documentation/networking/ipvs-sysctl.rst @@ -0,0 +1,302 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +IPvs-sysctl +=========== + +/proc/sys/net/ipv4/vs/* Variables: +================================== + +am_droprate - INTEGER + default 10 + + It sets the always mode drop rate, which is used in the mode 3 + of the drop_rate defense. + +amemthresh - INTEGER + default 1024 + + It sets the available memory threshold (in pages), which is + used in the automatic modes of defense. When there is no + enough available memory, the respective strategy will be + enabled and the variable is automatically set to 2, otherwise + the strategy is disabled and the variable is set to 1. + +backup_only - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + +conn_reuse_mode - INTEGER + 1 - default + + Controls how ipvs will deal with connections that are detected + port reuse. It is a bitmap, with the values being: + + 0: disable any special handling on port reuse. The new + connection will be delivered to the same real server that was + servicing the previous connection. This will effectively + disable expire_nodest_conn. + + bit 1: enable rescheduling of new connections when it is safe. + That is, whenever expire_nodest_conn and for TCP sockets, when + the connection is in TIME_WAIT state (which is only possible if + you use NAT mode). + + bit 2: it is bit 1 plus, for TCP connections, when connections + are in FIN_WAIT state, as this is the last state seen by load + balancer in Direct Routing mode. This bit helps on adding new + real servers to a very busy cluster. + +conntrack - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, maintain connection tracking entries for + connections handled by IPVS. + + This should be enabled if connections handled by IPVS are to be + also handled by stateful firewall rules. That is, iptables rules + that make use of connection tracking. It is a performance + optimisation to disable this setting otherwise. + + Connections handled by the IPVS FTP application module + will have connection tracking entries regardless of this setting. + + Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. + +cache_bypass - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If it is enabled, forward packets to the original destination + directly when no cache server is available and destination + address is not local (iph->daddr is RTN_UNICAST). It is mostly + used in transparent web cache cluster. + +debug_level - INTEGER + - 0 - transmission error messages (default) + - 1 - non-fatal error messages + - 2 - configuration + - 3 - destination trash + - 4 - drop entry + - 5 - service lookup + - 6 - scheduling + - 7 - connection new/expire, lookup and synchronization + - 8 - state transition + - 9 - binding destination, template checks and applications + - 10 - IPVS packet transmission + - 11 - IPVS packet handling (ip_vs_in/ip_vs_out) + - 12 or more - packet traversal + + Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. + + Higher debugging levels include the messages for lower debugging + levels, so setting debug level 2, includes level 0, 1 and 2 + messages. Thus, logging becomes more and more verbose the higher + the level. + +drop_entry - INTEGER + - 0 - disabled (default) + + The drop_entry defense is to randomly drop entries in the + connection hash table, just in order to collect back some + memory for new connections. In the current code, the + drop_entry procedure can be activated every second, then it + randomly scans 1/32 of the whole and drops entries that are in + the SYN-RECV/SYNACK state, which should be effective against + syn-flooding attack. + + The valid values of drop_entry are from 0 to 3, where 0 means + that this strategy is always disabled, 1 and 2 mean automatic + modes (when there is no enough available memory, the strategy + is enabled and the variable is automatically set to 2, + otherwise the strategy is disabled and the variable is set to + 1), and 3 means that that the strategy is always enabled. + +drop_packet - INTEGER + - 0 - disabled (default) + + The drop_packet defense is designed to drop 1/rate packets + before forwarding them to real servers. If the rate is 1, then + drop all the incoming packets. + + The value definition is the same as that of the drop_entry. In + the automatic mode, the rate is determined by the follow + formula: rate = amemthresh / (amemthresh - available_memory) + when available memory is less than the available memory + threshold. When the mode 3 is set, the always mode drop rate + is controlled by the /proc/sys/net/ipv4/vs/am_droprate. + +expire_nodest_conn - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + The default value is 0, the load balancer will silently drop + packets when its destination server is not available. It may + be useful, when user-space monitoring program deletes the + destination server (because of server overload or wrong + detection) and add back the server later, and the connections + to the server can continue. + + If this feature is enabled, the load balancer will expire the + connection immediately when a packet arrives and its + destination server is not available, then the client program + will be notified that the connection is closed. This is + equivalent to the feature some people requires to flush + connections when its destination is not available. + +expire_quiescent_template - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + When set to a non-zero value, the load balancer will expire + persistent templates when the destination server is quiescent. + This may be useful, when a user makes a destination server + quiescent by setting its weight to 0 and it is desired that + subsequent otherwise persistent connections are sent to a + different destination server. By default new persistent + connections are allowed to quiescent destination servers. + + If this feature is enabled, the load balancer will expire the + persistence template if it is to be used to schedule a new + connection and the destination server is quiescent. + +ignore_tunneled - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, ipvs will set the ipvs_property on all packets which are of + unrecognized protocols. This prevents us from routing tunneled + protocols like ipip, which is useful to prevent rescheduling + packets that have been tunneled to the ipvs host (i.e. to prevent + ipvs routing loops when ipvs is also acting as a real server). + +nat_icmp_send - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + It controls sending icmp error messages (ICMP_DEST_UNREACH) + for VS/NAT when the load balancer receives packets from real + servers but the connection entries don't exist. + +pmtu_disc - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + By default, reject with FRAG_NEEDED all DF packets that exceed + the PMTU, irrespective of the forwarding method. For TUN method + the flag can be disabled to fragment such packets. + +secure_tcp - INTEGER + - 0 - disabled (default) + + The secure_tcp defense is to use a more complicated TCP state + transition table. For VS/NAT, it also delays entering the + TCP ESTABLISHED state until the three way handshake is completed. + + The value definition is the same as that of drop_entry and + drop_packet. + +sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period + default 3 50 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus sync_period equals the threshold. The range of the + threshold is from 0 to sync_period. + + When sync_period and sync_refresh_period are 0, send sync only + for state changes or only once when pkts matches sync_threshold + +sync_refresh_period - UNSIGNED INTEGER + default 0 + + In seconds, difference in reported connection timer that triggers + new sync message. It can be used to avoid sync messages for the + specified period (or half of the connection timeout if it is lower) + if connection state is not changed since last sync. + + This is useful for normal connections with high traffic to reduce + sync rate. Additionally, retry sync_retries times with period of + sync_refresh_period/8. + +sync_retries - INTEGER + default 0 + + Defines sync retries with period of sync_refresh_period/8. Useful + to protect against loss of sync messages. The range of the + sync_retries is from 0 to 3. + +sync_qlen_max - UNSIGNED LONG + + Hard limit for queued sync messages that are not sent yet. It + defaults to 1/32 of the memory pages but actually represents + number of messages. It will protect us from allocating large + parts of memory when the sending rate is lower than the queuing + rate. + +sync_sock_size - INTEGER + default 0 + + Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. + Default value is 0 (preserve system defaults). + +sync_ports - INTEGER + default 1 + + The number of threads that master and backup servers can use for + sync traffic. Every thread will use single UDP port, thread 0 will + use the default port 8848 while last thread will use port + 8848+sync_ports-1. + +snat_reroute - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + If enabled, recalculate the route of SNATed packets from + realservers so that they are routed as if they originate from the + director. Otherwise they are routed as if they are forwarded by the + director. + + If policy routing is in effect then it is possible that the route + of a packet originating from a director is routed differently to a + packet being forwarded by the director. + + If policy routing is not in effect then the recalculated route will + always be the same as the original route so it is an optimisation + to disable snat_reroute and avoid the recalculation. + +sync_persist_mode - INTEGER + default 0 + + Controls the synchronisation of connections when using persistence + + 0: All types of connections are synchronised + + 1: Attempt to reduce the synchronisation traffic depending on + the connection type. For persistent services avoid synchronisation + for normal connections, do it only for persistence templates. + In such case, for TCP and SCTP it may need enabling sloppy_tcp and + sloppy_sctp flags on backup servers. For non-persistent services + such optimization is not applied, mode 0 is assumed. + +sync_version - INTEGER + default 1 + + The version of the synchronisation protocol used when sending + synchronisation messages. + + 0 selects the original synchronisation protocol (version 0). This + should be used when sending synchronisation messages to a legacy + system that only understands the original synchronisation protocol. + + 1 selects the current synchronisation protocol (version 1). This + should be used where possible. + + Kernels with this sync_version entry are able to receive messages + of both version 1 and version 2 of the synchronisation protocol. diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt deleted file mode 100644 index 056898685d40..000000000000 --- a/Documentation/networking/ipvs-sysctl.txt +++ /dev/null @@ -1,294 +0,0 @@ -/proc/sys/net/ipv4/vs/* Variables: - -am_droprate - INTEGER - default 10 - - It sets the always mode drop rate, which is used in the mode 3 - of the drop_rate defense. - -amemthresh - INTEGER - default 1024 - - It sets the available memory threshold (in pages), which is - used in the automatic modes of defense. When there is no - enough available memory, the respective strategy will be - enabled and the variable is automatically set to 2, otherwise - the strategy is disabled and the variable is set to 1. - -backup_only - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, disable the director function while the server is - in backup mode to avoid packet loops for DR/TUN methods. - -conn_reuse_mode - INTEGER - 1 - default - - Controls how ipvs will deal with connections that are detected - port reuse. It is a bitmap, with the values being: - - 0: disable any special handling on port reuse. The new - connection will be delivered to the same real server that was - servicing the previous connection. This will effectively - disable expire_nodest_conn. - - bit 1: enable rescheduling of new connections when it is safe. - That is, whenever expire_nodest_conn and for TCP sockets, when - the connection is in TIME_WAIT state (which is only possible if - you use NAT mode). - - bit 2: it is bit 1 plus, for TCP connections, when connections - are in FIN_WAIT state, as this is the last state seen by load - balancer in Direct Routing mode. This bit helps on adding new - real servers to a very busy cluster. - -conntrack - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, maintain connection tracking entries for - connections handled by IPVS. - - This should be enabled if connections handled by IPVS are to be - also handled by stateful firewall rules. That is, iptables rules - that make use of connection tracking. It is a performance - optimisation to disable this setting otherwise. - - Connections handled by the IPVS FTP application module - will have connection tracking entries regardless of this setting. - - Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. - -cache_bypass - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If it is enabled, forward packets to the original destination - directly when no cache server is available and destination - address is not local (iph->daddr is RTN_UNICAST). It is mostly - used in transparent web cache cluster. - -debug_level - INTEGER - 0 - transmission error messages (default) - 1 - non-fatal error messages - 2 - configuration - 3 - destination trash - 4 - drop entry - 5 - service lookup - 6 - scheduling - 7 - connection new/expire, lookup and synchronization - 8 - state transition - 9 - binding destination, template checks and applications - 10 - IPVS packet transmission - 11 - IPVS packet handling (ip_vs_in/ip_vs_out) - 12 or more - packet traversal - - Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. - - Higher debugging levels include the messages for lower debugging - levels, so setting debug level 2, includes level 0, 1 and 2 - messages. Thus, logging becomes more and more verbose the higher - the level. - -drop_entry - INTEGER - 0 - disabled (default) - - The drop_entry defense is to randomly drop entries in the - connection hash table, just in order to collect back some - memory for new connections. In the current code, the - drop_entry procedure can be activated every second, then it - randomly scans 1/32 of the whole and drops entries that are in - the SYN-RECV/SYNACK state, which should be effective against - syn-flooding attack. - - The valid values of drop_entry are from 0 to 3, where 0 means - that this strategy is always disabled, 1 and 2 mean automatic - modes (when there is no enough available memory, the strategy - is enabled and the variable is automatically set to 2, - otherwise the strategy is disabled and the variable is set to - 1), and 3 means that that the strategy is always enabled. - -drop_packet - INTEGER - 0 - disabled (default) - - The drop_packet defense is designed to drop 1/rate packets - before forwarding them to real servers. If the rate is 1, then - drop all the incoming packets. - - The value definition is the same as that of the drop_entry. In - the automatic mode, the rate is determined by the follow - formula: rate = amemthresh / (amemthresh - available_memory) - when available memory is less than the available memory - threshold. When the mode 3 is set, the always mode drop rate - is controlled by the /proc/sys/net/ipv4/vs/am_droprate. - -expire_nodest_conn - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - The default value is 0, the load balancer will silently drop - packets when its destination server is not available. It may - be useful, when user-space monitoring program deletes the - destination server (because of server overload or wrong - detection) and add back the server later, and the connections - to the server can continue. - - If this feature is enabled, the load balancer will expire the - connection immediately when a packet arrives and its - destination server is not available, then the client program - will be notified that the connection is closed. This is - equivalent to the feature some people requires to flush - connections when its destination is not available. - -expire_quiescent_template - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - When set to a non-zero value, the load balancer will expire - persistent templates when the destination server is quiescent. - This may be useful, when a user makes a destination server - quiescent by setting its weight to 0 and it is desired that - subsequent otherwise persistent connections are sent to a - different destination server. By default new persistent - connections are allowed to quiescent destination servers. - - If this feature is enabled, the load balancer will expire the - persistence template if it is to be used to schedule a new - connection and the destination server is quiescent. - -ignore_tunneled - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, ipvs will set the ipvs_property on all packets which are of - unrecognized protocols. This prevents us from routing tunneled - protocols like ipip, which is useful to prevent rescheduling - packets that have been tunneled to the ipvs host (i.e. to prevent - ipvs routing loops when ipvs is also acting as a real server). - -nat_icmp_send - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - It controls sending icmp error messages (ICMP_DEST_UNREACH) - for VS/NAT when the load balancer receives packets from real - servers but the connection entries don't exist. - -pmtu_disc - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - By default, reject with FRAG_NEEDED all DF packets that exceed - the PMTU, irrespective of the forwarding method. For TUN method - the flag can be disabled to fragment such packets. - -secure_tcp - INTEGER - 0 - disabled (default) - - The secure_tcp defense is to use a more complicated TCP state - transition table. For VS/NAT, it also delays entering the - TCP ESTABLISHED state until the three way handshake is completed. - - The value definition is the same as that of drop_entry and - drop_packet. - -sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period - default 3 50 - - It sets synchronization threshold, which is the minimum number - of incoming packets that a connection needs to receive before - the connection will be synchronized. A connection will be - synchronized, every time the number of its incoming packets - modulus sync_period equals the threshold. The range of the - threshold is from 0 to sync_period. - - When sync_period and sync_refresh_period are 0, send sync only - for state changes or only once when pkts matches sync_threshold - -sync_refresh_period - UNSIGNED INTEGER - default 0 - - In seconds, difference in reported connection timer that triggers - new sync message. It can be used to avoid sync messages for the - specified period (or half of the connection timeout if it is lower) - if connection state is not changed since last sync. - - This is useful for normal connections with high traffic to reduce - sync rate. Additionally, retry sync_retries times with period of - sync_refresh_period/8. - -sync_retries - INTEGER - default 0 - - Defines sync retries with period of sync_refresh_period/8. Useful - to protect against loss of sync messages. The range of the - sync_retries is from 0 to 3. - -sync_qlen_max - UNSIGNED LONG - - Hard limit for queued sync messages that are not sent yet. It - defaults to 1/32 of the memory pages but actually represents - number of messages. It will protect us from allocating large - parts of memory when the sending rate is lower than the queuing - rate. - -sync_sock_size - INTEGER - default 0 - - Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. - Default value is 0 (preserve system defaults). - -sync_ports - INTEGER - default 1 - - The number of threads that master and backup servers can use for - sync traffic. Every thread will use single UDP port, thread 0 will - use the default port 8848 while last thread will use port - 8848+sync_ports-1. - -snat_reroute - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - If enabled, recalculate the route of SNATed packets from - realservers so that they are routed as if they originate from the - director. Otherwise they are routed as if they are forwarded by the - director. - - If policy routing is in effect then it is possible that the route - of a packet originating from a director is routed differently to a - packet being forwarded by the director. - - If policy routing is not in effect then the recalculated route will - always be the same as the original route so it is an optimisation - to disable snat_reroute and avoid the recalculation. - -sync_persist_mode - INTEGER - default 0 - - Controls the synchronisation of connections when using persistence - - 0: All types of connections are synchronised - 1: Attempt to reduce the synchronisation traffic depending on - the connection type. For persistent services avoid synchronisation - for normal connections, do it only for persistence templates. - In such case, for TCP and SCTP it may need enabling sloppy_tcp and - sloppy_sctp flags on backup servers. For non-persistent services - such optimization is not applied, mode 0 is assumed. - -sync_version - INTEGER - default 1 - - The version of the synchronisation protocol used when sending - synchronisation messages. - - 0 selects the original synchronisation protocol (version 0). This - should be used when sending synchronisation messages to a legacy - system that only understands the original synchronisation protocol. - - 1 selects the current synchronisation protocol (version 1). This - should be used where possible. - - Kernels with this sync_version entry are able to receive messages - of both version 1 and version 2 of the synchronisation protocol. diff --git a/MAINTAINERS b/MAINTAINERS index df5e4ccc1ccb..3a5f52a3c055 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8934,7 +8934,7 @@ L: lvs-devel@vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs.git -F: Documentation/networking/ipvs-sysctl.txt +F: Documentation/networking/ipvs-sysctl.rst F: include/net/ip_vs.h F: include/uapi/linux/ip_vs.h F: net/netfilter/ipvs/ -- cgit From b9dd2bea2245dd8ba4f68e801af93e4b38bfe6b0 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:53 +0200 Subject: docs: networking: convert kcm.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/kcm.rst | 290 +++++++++++++++++++++++++++++++++++++ Documentation/networking/kcm.txt | 285 ------------------------------------ 3 files changed, 291 insertions(+), 285 deletions(-) create mode 100644 Documentation/networking/kcm.rst delete mode 100644 Documentation/networking/kcm.txt diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index bbd4e0041457..e1ff08b94d90 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -73,6 +73,7 @@ Contents: ipv6 ipvlan ipvs-sysctl + kcm .. only:: subproject and html diff --git a/Documentation/networking/kcm.rst b/Documentation/networking/kcm.rst new file mode 100644 index 000000000000..db0f5560ac1c --- /dev/null +++ b/Documentation/networking/kcm.rst @@ -0,0 +1,290 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +Kernel Connection Multiplexor +============================= + +Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based +interface over TCP for generic application protocols. With KCM an application +can efficiently send and receive application protocol messages over TCP using +datagram sockets. + +KCM implements an NxM multiplexor in the kernel as diagrammed below:: + + +------------+ +------------+ +------------+ +------------+ + | KCM socket | | KCM socket | | KCM socket | | KCM socket | + +------------+ +------------+ +------------+ +------------+ + | | | | + +-----------+ | | +----------+ + | | | | + +----------------------------------+ + | Multiplexor | + +----------------------------------+ + | | | | | + +---------+ | | | ------------+ + | | | | | + +----------+ +----------+ +----------+ +----------+ +----------+ + | Psock | | Psock | | Psock | | Psock | | Psock | + +----------+ +----------+ +----------+ +----------+ +----------+ + | | | | | + +----------+ +----------+ +----------+ +----------+ +----------+ + | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | + +----------+ +----------+ +----------+ +----------+ +----------+ + +KCM sockets +=========== + +The KCM sockets provide the user interface to the multiplexor. All the KCM sockets +bound to a multiplexor are considered to have equivalent function, and I/O +operations in different sockets may be done in parallel without the need for +synchronization between threads in userspace. + +Multiplexor +=========== + +The multiplexor provides the message steering. In the transmit path, messages +written on a KCM socket are sent atomically on an appropriate TCP socket. +Similarly, in the receive path, messages are constructed on each TCP socket +(Psock) and complete messages are steered to a KCM socket. + +TCP sockets & Psocks +==================== + +TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated +for each bound TCP socket, this structure holds the state for constructing +messages on receive as well as other connection specific information for KCM. + +Connected mode semantics +======================== + +Each multiplexor assumes that all attached TCP connections are to the same +destination and can use the different connections for load balancing when +transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) +can be used to send and receive messages from the KCM socket. + +Socket types +============ + +KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. + +Message delineation +------------------- + +Messages are sent over a TCP stream with some application protocol message +format that typically includes a header which frames the messages. The length +of a received message can be deduced from the application protocol header +(often just a simple length field). + +A TCP stream must be parsed to determine message boundaries. Berkeley Packet +Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a +BPF program must be specified. The program is called at the start of receiving +a new message and is given an skbuff that contains the bytes received so far. +It parses the message header and returns the length of the message. Given this +information, KCM will construct the message of the stated length and deliver it +to a KCM socket. + +TCP socket management +--------------------- + +When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and +write space available (POLLOUT) events are handled by the multiplexor. If there +is a state change (disconnection) or other error on a TCP socket, an error is +posted on the TCP socket so that a POLLERR event happens and KCM discontinues +using the socket. When the application gets the error notification for a +TCP socket, it should unattach the socket from KCM and then handle the error +condition (the typical response is to close the socket and create a new +connection if necessary). + +KCM limits the maximum receive message size to be the size of the receive +socket buffer on the attached TCP socket (the socket buffer size can be set by +SO_RCVBUF). If the length of a new message reported by the BPF program is +greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP +socket. The BPF program may also enforce a maximum messages size and report an +error when it is exceeded. + +A timeout may be set for assembling messages on a receive socket. The timeout +value is taken from the receive timeout of the attached TCP socket (this is set +by SO_RCVTIMEO). If the timer expires before assembly is complete an error +(ETIMEDOUT) is posted on the socket. + +User interface +============== + +Creating a multiplexor +---------------------- + +A new multiplexor and initial KCM socket is created by a socket call:: + + socket(AF_KCM, type, protocol) + +- type is either SOCK_DGRAM or SOCK_SEQPACKET +- protocol is KCMPROTO_CONNECTED + +Cloning KCM sockets +------------------- + +After the first KCM socket is created using the socket call as described +above, additional sockets for the multiplexor can be created by cloning +a KCM socket. This is accomplished by an ioctl on a KCM socket:: + + /* From linux/kcm.h */ + struct kcm_clone { + int fd; + }; + + struct kcm_clone info; + + memset(&info, 0, sizeof(info)); + + err = ioctl(kcmfd, SIOCKCMCLONE, &info); + + if (!err) + newkcmfd = info.fd; + +Attach transport sockets +------------------------ + +Attaching of transport sockets to a multiplexor is performed by calling an +ioctl on a KCM socket for the multiplexor. e.g.:: + + /* From linux/kcm.h */ + struct kcm_attach { + int fd; + int bpf_fd; + }; + + struct kcm_attach info; + + memset(&info, 0, sizeof(info)); + + info.fd = tcpfd; + info.bpf_fd = bpf_prog_fd; + + ioctl(kcmfd, SIOCKCMATTACH, &info); + +The kcm_attach structure contains: + + - fd: file descriptor for TCP socket being attached + - bpf_prog_fd: file descriptor for compiled BPF program downloaded + +Unattach transport sockets +-------------------------- + +Unattaching a transport socket from a multiplexor is straightforward. An +"unattach" ioctl is done with the kcm_unattach structure as the argument:: + + /* From linux/kcm.h */ + struct kcm_unattach { + int fd; + }; + + struct kcm_unattach info; + + memset(&info, 0, sizeof(info)); + + info.fd = cfd; + + ioctl(fd, SIOCKCMUNATTACH, &info); + +Disabling receive on KCM socket +------------------------------- + +A setsockopt is used to disable or enable receiving on a KCM socket. +When receive is disabled, any pending messages in the socket's +receive buffer are moved to other sockets. This feature is useful +if an application thread knows that it will be doing a lot of +work on a request and won't be able to service new messages for a +while. Example use:: + + int val = 1; + + setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) + +BFP programs for message delineation +------------------------------------ + +BPF programs can be compiled using the BPF LLVM backend. For example, +the BPF program for parsing Thrift is:: + + #include "bpf.h" /* for __sk_buff */ + #include "bpf_helpers.h" /* for load_word intrinsic */ + + SEC("socket_kcm") + int bpf_prog1(struct __sk_buff *skb) + { + return load_word(skb, 0) + 4; + } + + char _license[] SEC("license") = "GPL"; + +Use in applications +=================== + +KCM accelerates application layer protocols. Specifically, it allows +applications to use a message based interface for sending and receiving +messages. The kernel provides necessary assurances that messages are sent +and received atomically. This relieves much of the burden applications have +in mapping a message based protocol onto the TCP stream. KCM also make +application layer messages a unit of work in the kernel for the purposes of +steering and scheduling, which in turn allows a simpler networking model in +multithreaded applications. + +Configurations +-------------- + +In an Nx1 configuration, KCM logically provides multiple socket handles +to the same TCP connection. This allows parallelism between in I/O +operations on the TCP socket (for instance copyin and copyout of data is +parallelized). In an application, a KCM socket can be opened for each +processing thread and inserted into the epoll (similar to how SO_REUSEPORT +is used to allow multiple listener sockets on the same port). + +In a MxN configuration, multiple connections are established to the +same destination. These are used for simple load balancing. + +Message batching +---------------- + +The primary purpose of KCM is load balancing between KCM sockets and hence +threads in a nominal use case. Perfect load balancing, that is steering +each received message to a different KCM socket or steering each sent +message to a different TCP socket, can negatively impact performance +since this doesn't allow for affinities to be established. Balancing +based on groups, or batches of messages, can be beneficial for performance. + +On transmit, there are three ways an application can batch (pipeline) +messages on a KCM socket. + + 1) Send multiple messages in a single sendmmsg. + 2) Send a group of messages each with a sendmsg call, where all messages + except the last have MSG_BATCH in the flags of sendmsg call. + 3) Create "super message" composed of multiple messages and send this + with a single sendmsg. + +On receive, the KCM module attempts to queue messages received on the +same KCM socket during each TCP ready callback. The targeted KCM socket +changes at each receive ready callback on the KCM socket. The application +does not need to configure this. + +Error handling +-------------- + +An application should include a thread to monitor errors raised on +the TCP connection. Normally, this will be done by placing each +TCP socket attached to a KCM multiplexor in epoll set for POLLERR +event. If an error occurs on an attached TCP socket, KCM sets an EPIPE +on the socket thus waking up the application thread. When the application +sees the error (which may just be a disconnect) it should unattach the +socket from KCM and then close it. It is assumed that once an error is +posted on the TCP socket the data stream is unrecoverable (i.e. an error +may have occurred in the middle of receiving a message). + +TCP connection monitoring +------------------------- + +In KCM there is no means to correlate a message to the TCP socket that +was used to send or receive the message (except in the case there is +only one attached TCP socket). However, the application does retain +an open file descriptor to the socket so it will be able to get statistics +from the socket which can be used in detecting issues (such as high +retransmissions on the socket). diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt deleted file mode 100644 index b773a5278ac4..000000000000 --- a/Documentation/networking/kcm.txt +++ /dev/null @@ -1,285 +0,0 @@ -Kernel Connection Multiplexor ------------------------------ - -Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based -interface over TCP for generic application protocols. With KCM an application -can efficiently send and receive application protocol messages over TCP using -datagram sockets. - -KCM implements an NxM multiplexor in the kernel as diagrammed below: - -+------------+ +------------+ +------------+ +------------+ -| KCM socket | | KCM socket | | KCM socket | | KCM socket | -+------------+ +------------+ +------------+ +------------+ - | | | | - +-----------+ | | +----------+ - | | | | - +----------------------------------+ - | Multiplexor | - +----------------------------------+ - | | | | | - +---------+ | | | ------------+ - | | | | | -+----------+ +----------+ +----------+ +----------+ +----------+ -| Psock | | Psock | | Psock | | Psock | | Psock | -+----------+ +----------+ +----------+ +----------+ +----------+ - | | | | | -+----------+ +----------+ +----------+ +----------+ +----------+ -| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | -+----------+ +----------+ +----------+ +----------+ +----------+ - -KCM sockets ------------ - -The KCM sockets provide the user interface to the multiplexor. All the KCM sockets -bound to a multiplexor are considered to have equivalent function, and I/O -operations in different sockets may be done in parallel without the need for -synchronization between threads in userspace. - -Multiplexor ------------ - -The multiplexor provides the message steering. In the transmit path, messages -written on a KCM socket are sent atomically on an appropriate TCP socket. -Similarly, in the receive path, messages are constructed on each TCP socket -(Psock) and complete messages are steered to a KCM socket. - -TCP sockets & Psocks --------------------- - -TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated -for each bound TCP socket, this structure holds the state for constructing -messages on receive as well as other connection specific information for KCM. - -Connected mode semantics ------------------------- - -Each multiplexor assumes that all attached TCP connections are to the same -destination and can use the different connections for load balancing when -transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) -can be used to send and receive messages from the KCM socket. - -Socket types ------------- - -KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. - -Message delineation -------------------- - -Messages are sent over a TCP stream with some application protocol message -format that typically includes a header which frames the messages. The length -of a received message can be deduced from the application protocol header -(often just a simple length field). - -A TCP stream must be parsed to determine message boundaries. Berkeley Packet -Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a -BPF program must be specified. The program is called at the start of receiving -a new message and is given an skbuff that contains the bytes received so far. -It parses the message header and returns the length of the message. Given this -information, KCM will construct the message of the stated length and deliver it -to a KCM socket. - -TCP socket management ---------------------- - -When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and -write space available (POLLOUT) events are handled by the multiplexor. If there -is a state change (disconnection) or other error on a TCP socket, an error is -posted on the TCP socket so that a POLLERR event happens and KCM discontinues -using the socket. When the application gets the error notification for a -TCP socket, it should unattach the socket from KCM and then handle the error -condition (the typical response is to close the socket and create a new -connection if necessary). - -KCM limits the maximum receive message size to be the size of the receive -socket buffer on the attached TCP socket (the socket buffer size can be set by -SO_RCVBUF). If the length of a new message reported by the BPF program is -greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP -socket. The BPF program may also enforce a maximum messages size and report an -error when it is exceeded. - -A timeout may be set for assembling messages on a receive socket. The timeout -value is taken from the receive timeout of the attached TCP socket (this is set -by SO_RCVTIMEO). If the timer expires before assembly is complete an error -(ETIMEDOUT) is posted on the socket. - -User interface -============== - -Creating a multiplexor ----------------------- - -A new multiplexor and initial KCM socket is created by a socket call: - - socket(AF_KCM, type, protocol) - - - type is either SOCK_DGRAM or SOCK_SEQPACKET - - protocol is KCMPROTO_CONNECTED - -Cloning KCM sockets -------------------- - -After the first KCM socket is created using the socket call as described -above, additional sockets for the multiplexor can be created by cloning -a KCM socket. This is accomplished by an ioctl on a KCM socket: - - /* From linux/kcm.h */ - struct kcm_clone { - int fd; - }; - - struct kcm_clone info; - - memset(&info, 0, sizeof(info)); - - err = ioctl(kcmfd, SIOCKCMCLONE, &info); - - if (!err) - newkcmfd = info.fd; - -Attach transport sockets ------------------------- - -Attaching of transport sockets to a multiplexor is performed by calling an -ioctl on a KCM socket for the multiplexor. e.g.: - - /* From linux/kcm.h */ - struct kcm_attach { - int fd; - int bpf_fd; - }; - - struct kcm_attach info; - - memset(&info, 0, sizeof(info)); - - info.fd = tcpfd; - info.bpf_fd = bpf_prog_fd; - - ioctl(kcmfd, SIOCKCMATTACH, &info); - -The kcm_attach structure contains: - fd: file descriptor for TCP socket being attached - bpf_prog_fd: file descriptor for compiled BPF program downloaded - -Unattach transport sockets --------------------------- - -Unattaching a transport socket from a multiplexor is straightforward. An -"unattach" ioctl is done with the kcm_unattach structure as the argument: - - /* From linux/kcm.h */ - struct kcm_unattach { - int fd; - }; - - struct kcm_unattach info; - - memset(&info, 0, sizeof(info)); - - info.fd = cfd; - - ioctl(fd, SIOCKCMUNATTACH, &info); - -Disabling receive on KCM socket -------------------------------- - -A setsockopt is used to disable or enable receiving on a KCM socket. -When receive is disabled, any pending messages in the socket's -receive buffer are moved to other sockets. This feature is useful -if an application thread knows that it will be doing a lot of -work on a request and won't be able to service new messages for a -while. Example use: - - int val = 1; - - setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) - -BFP programs for message delineation ------------------------------------- - -BPF programs can be compiled using the BPF LLVM backend. For example, -the BPF program for parsing Thrift is: - - #include "bpf.h" /* for __sk_buff */ - #include "bpf_helpers.h" /* for load_word intrinsic */ - - SEC("socket_kcm") - int bpf_prog1(struct __sk_buff *skb) - { - return load_word(skb, 0) + 4; - } - - char _license[] SEC("license") = "GPL"; - -Use in applications -=================== - -KCM accelerates application layer protocols. Specifically, it allows -applications to use a message based interface for sending and receiving -messages. The kernel provides necessary assurances that messages are sent -and received atomically. This relieves much of the burden applications have -in mapping a message based protocol onto the TCP stream. KCM also make -application layer messages a unit of work in the kernel for the purposes of -steering and scheduling, which in turn allows a simpler networking model in -multithreaded applications. - -Configurations --------------- - -In an Nx1 configuration, KCM logically provides multiple socket handles -to the same TCP connection. This allows parallelism between in I/O -operations on the TCP socket (for instance copyin and copyout of data is -parallelized). In an application, a KCM socket can be opened for each -processing thread and inserted into the epoll (similar to how SO_REUSEPORT -is used to allow multiple listener sockets on the same port). - -In a MxN configuration, multiple connections are established to the -same destination. These are used for simple load balancing. - -Message batching ----------------- - -The primary purpose of KCM is load balancing between KCM sockets and hence -threads in a nominal use case. Perfect load balancing, that is steering -each received message to a different KCM socket or steering each sent -message to a different TCP socket, can negatively impact performance -since this doesn't allow for affinities to be established. Balancing -based on groups, or batches of messages, can be beneficial for performance. - -On transmit, there are three ways an application can batch (pipeline) -messages on a KCM socket. - 1) Send multiple messages in a single sendmmsg. - 2) Send a group of messages each with a sendmsg call, where all messages - except the last have MSG_BATCH in the flags of sendmsg call. - 3) Create "super message" composed of multiple messages and send this - with a single sendmsg. - -On receive, the KCM module attempts to queue messages received on the -same KCM socket during each TCP ready callback. The targeted KCM socket -changes at each receive ready callback on the KCM socket. The application -does not need to configure this. - -Error handling --------------- - -An application should include a thread to monitor errors raised on -the TCP connection. Normally, this will be done by placing each -TCP socket attached to a KCM multiplexor in epoll set for POLLERR -event. If an error occurs on an attached TCP socket, KCM sets an EPIPE -on the socket thus waking up the application thread. When the application -sees the error (which may just be a disconnect) it should unattach the -socket from KCM and then close it. It is assumed that once an error is -posted on the TCP socket the data stream is unrecoverable (i.e. an error -may have occurred in the middle of receiving a message). - -TCP connection monitoring -------------------------- - -In KCM there is no means to correlate a message to the TCP socket that -was used to send or receive the message (except in the case there is -only one attached TCP socket). However, the application does retain -an open file descriptor to the socket so it will be able to get statistics -from the socket which can be used in detecting issues (such as high -retransmissions on the socket). -- cgit