blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2014-12-08	Merge branch 'tipc-next'	David S. Miller	5	-158/+157
	Ying Xue says: ==================== tipc: convert name table read-write lock to RCU Now TIPC name table is statically allocated and is protected with a Read-Write lock. To enhance the performance of TIPC name table lookup, we are going to involve RCU lock to protect the name table. As a consequence, it becomes lockless to concurrently look up name table on read side. However, before the conversion can be successfully made, the following two things must be first done: - change allocation way of name table from static to dynamic - fix several incorrect locking policy issues ==================== Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: convert name table read-write lock to RCU	Ying Xue	4	-59/+69
	Convert tipc name table read-write lock to RCU. After this change, a new spin lock is used to protect name table on write side while RCU is applied on read side. Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: remove unnecessary INIT_LIST_HEAD	Ying Xue	2	-3/+0
	When a list_head variable is seen as a new entry to be added to a list head, it's unnecessary to be initialized with INIT_LIST_HEAD(). Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: simplify relationship between name table lock and node lock	Ying Xue	1	-9/+10
	When tipc name sequence is published, name table lock is released before name sequence buffer is delivered to remote nodes through its underlying unicast links. However, when name sequence is withdrawn, the name table lock is held until the transmission of the removal message of name sequence is finished. During the process, node lock is nested in name table lock. To prevent node lock from being nested in name table lock, while withdrawing name, we should adopt the same locking policy of publishing name sequence: name table lock should be released before message is sent. Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: any name table member must be protected under name table lock	Ying Xue	1	-1/+2
	As tipc_nametbl_lock is used to protect name_table structure, the lock must be held while all members of name_table structure are accessed. However, the lock is not obtained while a member of name_table structure - local_publ_count is read in tipc_nametbl_publish(), as a consequence, an inconsistent value of local_publ_count might be got. Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: ensure all name sequences are properly protected with its lock	Ying Xue	1	-21/+27
	TIPC internally created a name table which is used to store name sequences. Now there is a read-write lock - tipc_nametbl_lock to protect the table, and each name sequence saved in the table is protected with its private lock. When a name sequence is inserted or removed to or from the table, its members might need to change. Therefore, in normal case, the two locks must be held while TIPC operates the table. However, there are still several places where we only hold tipc_nametbl_lock without proprerly obtaining name sequence lock, which might cause the corruption of name sequence. Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: ensure all name sequences are released when name table is stopped	Ying Xue	1	-4/+3
	As TIPC subscriber server is terminated before name table, no user depends on subscription list of name sequence when name table is stopped. Therefore, all name sequences stored in name table should be released whatever their subscriptions lists are empty or not, otherwise, memory leak might happen. Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: make name table allocated dynamically	Ying Xue	3	-65/+55
	Name table locking policy is going to be adjusted from read-write lock protection to RCU lock protection in the future commits. But its essential precondition is to convert the allocation way of name table from static to dynamic mode. Signed-off-by: Ying Xue <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	tipc: remove size variable from publ_list struct	Ying Xue	1	-12/+7
	The size variable is introduced in publ_list struct to help us exactly calculate SKB buffer sizes needed by publications when all publications in name table are delivered in bulk in named_distribute(). But if publication SKB buffer size is assumed to MTU, the size variable in publ_list struct can be completely eliminated at the cost of wasting a bit memory space for last SKB. Signed-off-by: Ying Xue <[email protected]> Signed-off-by: Tero Aho <[email protected]> Reviewed-by: Erik Hugne <[email protected]> Reviewed-by: Jon Maloy <[email protected]> Tested-by: Erik Hugne <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	udp: Neaten and reduce size of compute_score functions	Joe Perches	2	-99/+125
	The compute_score functions are a bit difficult to read. Neaten them a bit to reduce object sizes and make them a bit more intelligible. Return early to avoid indentation and avoid unnecessary initializations. (allyesconfig, but w/ -O2 and no profiling) $ size net/ipv[46]/udp.o.* text data bss dec hex filename 28680 1184 25 29889 74c1 net/ipv4/udp.o.new 28756 1184 25 29965 750d net/ipv4/udp.o.old 17600 1010 2 18612 48b4 net/ipv6/udp.o.new 17632 1010 2 18644 48d4 net/ipv6/udp.o.old Signed-off-by: Joe Perches <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	net: bcmgenet: enable driver to work without a device tree	Petri Gynther	4	-33/+143
	Modify bcmgenet driver so that it can be used on Broadcom 7xxx MIPS-based STB platforms without a device tree. Signed-off-by: Petri Gynther <[email protected]> Acked-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	copy_from_iter_nocache()	Al Viro	2	-0/+22
	BTW, do we want memcpy_nocache()? Signed-off-by: Al Viro <[email protected]>
2014-12-08	new helper: iov_iter_kvec()	Al Viro	2	-0/+15
	initialization of kvec-backed iov_iter Signed-off-by: Al Viro <[email protected]>
2014-12-08	csum_and_copy_..._iter()	Al Viro	2	-0/+91
	Signed-off-by: Al Viro <[email protected]>
2014-12-08	hyperv: Add support for vNIC hot removal	Haiyang Zhang	4	-0/+10
	This patch adds proper handling of the vNIC hot removal event, which includes a rescind-channel-offer message from the host side that triggers vNIC close and removal. In this case, the notices to the host during close and removal is not necessary because the channel is rescinded. This patch blocks these unnecessary messages, and lets vNIC removal process complete normally. Signed-off-by: Haiyang Zhang <[email protected]> Reviewed-by: K. Y. Srinivasan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	test: bpf: expand DIV_KX to DIV_MOD_KX	Denis Kirjanov	1	-2/+8
	Expand DIV_KX to use BPF_MOD operation in the DIV_KX bpf 'classic' test. CC: Alexei Starovoitov <[email protected]> Signed-off-by: Denis Kirjanov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	Merge branch 'tstamp-next'	David S. Miller	5	-25/+146
	Willem de Bruijn says: ==================== timestamping updates The main goal for this patchset is to allow correlating timestamps with the egress interface. Also introduce a warning, as discussed previously, and update the tests to verify the new feature. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-12-08	net-timestamp: expand documentation and test	Willem de Bruijn	2	-20/+93
	Documentation: expand explanation of timestamp counter Test: new: flag -I requests and prints PKTINFO new: flag -x prints payload (possibly truncated) fix: remove pretty print that breaks common flag '-l 1' Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	net-timestamp: allow reading recv cmsg on errqueue with origin tstamp	Willem de Bruijn	4	-6/+52
	Allow reading of timestamps and cmsg at the same time on all relevant socket families. One use is to correlate timestamps with egress device, by asking for cmsg IP_PKTINFO. on AF_INET sockets, call the relevant function (ip_cmsg_recv). To avoid changing legacy expectations, only do so if the caller sets a new timestamping flag SOF_TIMESTAMPING_OPT_CMSG. on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already returned for all origins. only change is to set ifindex, which is not initialized for all error origins. In both cases, only generate the pktinfo message if an ifindex is known. This is not the case for ACK timestamps. The difference between the protocol families is probably a historical accident as a result of the different conditions for generating cmsg in the relevant ip(v6)_recv_error function: ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) { ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) { At one time, this was the same test bar for the ICMP/ICMP6 distinction. This is no longer true. Signed-off-by: Willem de Bruijn <[email protected]> ---- Changes v1 -> v2 large rewrite - integrate with existing pktinfo cmsg generation code - on ipv4: only send with new flag, to maintain legacy behavior - on ipv6: send at most a single pktinfo cmsg - on ipv6: initialize fields if not yet initialized The recv cmsg interfaces are also relevant to the discussion of whether looping packet headers is problematic. For v6, cmsgs that identify many headers are already returned. This patch expands that to v4. If it sounds reasonable, I will follow with patches 1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY (http://patchwork.ozlabs.org/patch/366967/) 2. sysctl to conditionally drop all timestamps that have payload or cmsg from users without CAP_NET_RAW. Signed-off-by: David S. Miller <[email protected]>
2014-12-08	ipv4: warn once on passing AF_INET6 socket to ip_recv_error	Willem de Bruijn	1	-0/+2
	One line change, in response to catching an occurrence of this bug. See also fix f4713a3dfad0 ("net-timestamp: make tcp_recvmsg call ...") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-08	iov_iter.c: handle ITER_KVEC directly	Al Viro	2	-13/+70
	... without bothering with copy_..._user() Signed-off-by: Al Viro <[email protected]>
2014-12-08	Merge branch 'master' of ↵	John W. Linville	11	-34/+59
	git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
2014-12-07	can: flexcan: Consolidate and unify state change handling	Andri Yngvason	1	-83/+18
	Replacing error state change handling with the new mechanism. Signed-off-by: Andri Yngvason <[email protected]> Acked-by: Wolfgang Grandegger <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: mscan: Consolidate and unify state change handling	Andri Yngvason	1	-34/+14
	Replacing error state change handling with the new mechanism. Signed-off-by: Andri Yngvason <[email protected]> Acked-by: Wolfgang Grandegger <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: sja1000: Consolidate and unify state change handling	Andri Yngvason	1	-28/+23
	Replacing error state change handling with the new mechanism. Signed-off-by: Andri Yngvason <[email protected]> Acked-by: Wolfgang Grandegger <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: dev: Consolidate and unify state change handling	Andri Yngvason	3	-0/+82
	The handling of can error states is different between platforms. This is an attempt to correct that problem. I've moved this handling into a generic function for changing the error state. This ensures that error state changes are handled the same way everywhere (where this function is used). This new mechanism also adds reverse state transitioning in error frames, i.e. the user will be notified through the socket interface when the state goes down. Signed-off-by: Andri Yngvason <[email protected]> Acked-by: Wolfgang Grandegger <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: Enable -D__CHECK_ENDIAN__ for sparse by default	Marc Kleine-Budde	1	-1/+2
	This patch enables endian checking by default when running sparse via "make C=2" for example. Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: fix spelling errors	Jeremiah Mahler	4	-9/+9
	Fix various spelling errors in the comments of the CAN modules. Signed-off-by: Jeremiah Mahler <[email protected]> Acked-by: Oliver Hartkopp <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: slcan/vcan: eliminate banner[] variable, switch to pr_info()	Jeremiah Mahler	2	-9/+3
	Several can modules in drivers/net/can use a banner[] variable at the top which defines a string that is used once during init. This string is also embedded with KERN_INFO which makes it printk() specific. Improve the code by eliminating the banner[] variable and moving the string to where it is printed. Then switch from printk(KERN_INFO to pr_info() for the lines that were changed. This patch is similar to [1] which was applied to net/can. [1]: https://lkml.org/lkml/2014/11/22/10 Signed-off-by: Jeremiah Mahler <[email protected]> Acked-by: Oliver Hartkopp <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-07	can: eliminate banner[] variable and switch to pr_info()	Jeremiah Mahler	3	-10/+3
	Several CAN modules use a design pattern with a banner[] variable at the top which defines a string that is used once during init to print the banner. The string is also embedded with KERN_INFO which makes it printk() specific. Improve the code by eliminating the banner[] variable and moving the string to where it is printed. Then switch from printk(KERN_INFO to pr_info() for the lines that were changed. Signed-off-by: Jeremiah Mahler <[email protected]> Acked-by: Oliver Hartkopp <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2014-12-06	i40e: Reduce stack in i40e_dbg_dump_desc	Joe Perches	1	-13/+17
	Reduce stack use by using kmemdup and not using a very large struct on stack. In function ‘i40e_dbg_dump_desc’: warning: the frame size of 8192 bytes is larger than 2048 bytes [-Wframe-larger-than=] Signed-off-by: Joe Perches <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Bump i40e version to 1.2.2 and i40evf version to 1.0.6	Catherine Sullivan	2	-3/+3
	Bump version. Change-ID: I4264e81dcfb57ec46a3ede54b0a6cb25b497d3cb Signed-off-by: Catherine Sullivan <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: get pf_id from HW rather than PCI function	Shannon Nelson	1	-12/+11
	Getting the pf_id from the function number was a good place to start, but when the PF was setup in passthru mode, the PCI bus/device/function was virtualized and the number in the VM is different from the number in the bare metal. This caused HW configuration issues when the wrong pf_id was used to set up the HMC and other structures. The PF_FUNC_RID register has the real bus/device/function information as configured by the BIOS, so use that for a better number. This works in NPAR mode as well. Change-ID: I65e3dd6c97594890c2bad566b83cc670b1dae534 Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Greg Rose <[email protected]> Acked-by: Kevin Scott <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: increase ARQ size	Mitch Williams	1	-1/+1
	The ARQ needs to have at least as many entries as VFs, or the VFs will get errors from the FW when they send messages to the PF. Since we don't know how many VFs we'll end up with, just set up 128 descriptors. Change-ID: I04ae3d1c7faf09110eb782214e9c05aeb62a6c59 Signed-off-by: Mitch Williams <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Re enable Main VSI loopback setting in the reset path	Anjali Singhai Jain	1	-0/+3
	There is an order in which this should happen. It turns out that FW will not let you change the Loopback setting of the VSI with update VSI prior to the VEB creation. Change-ID: I7614ddff8b4c37702930c02f16f8c346aaa64bd1 Signed-off-by: Anjali Singhai Jain <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Add new update VSI flow to accommodate FW fix with VSI Loopback mode	Anjali Singhai Jain	3	-4/+15
	All VSIs on a VEB should either have loopback enabled or disabled, a mixed mode is not supported for a VEB. Since our driver supports multiple VSIs per PF that need to talk to each other make sure to enable Loopback for the PF and FDIR VSI as well. Also, we now have to explicitly enable Loopback mode otherwise we fail VSI creation for VMDq and VF VSIs. Change-ID: Ib68c3ea4aeb730ac9468f930610de456efbe5b20 Signed-off-by: Anjali Singhai Jain <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Increase reset delay	Kevin Scott	1	-1/+1
	Increase reset delay to ensure all internal caches are properly flushed in worst case scenario. Change-ID: I6f059a9e024fbf9ef1debd32497eed21369957fc Signed-off-by: Kevin Scott <[email protected]> Acked-by: Shannon Nelson <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40evf: make early init sequence even more robust	Mitch Williams	2	-13/+12
	When multiple VFs attempt to initialize simultaneously, the firmware may delay or drop messages. Make the init code more adept at handling these situations by a) reinitializing the admin queue if the firmware fails to process a request, and b) resending a request if the PF doesn't answer. Once the request has been sent again, the PF might end up getting both requests and send the configuration information to the driver twice. This will cause the VF to complain about receiving an unexpected message from the PF. Since this is not fatal, reduce the warning level of the log messages that are generated in response to this event. Change-ID: I9370a1a2fde2ad3934fa25ccfd0545edfbbb4805 Signed-off-by: Mitch Williams <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: fix netdev_stat macro definition	Shannon Nelson	1	-1/+2
	The old xxx_NETDEV_STAT() macro was defined long before the newer rtnl_link_stats64 came into being, and just never got updated. Since we're using rtnl_link_stats64 in other parts of the driver, we should use it here as well. We've just been lucky that the field definitions are the same sizes. Change-ID: I19fc71619905700235dcdf0d3c8153aec81d36de Signed-off-by: Shannon Nelson <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Define and use i40e_is_vf macro	Anjali Singhai Jain	4	-2/+6
	This patch is useful for future expansion when new VF MAC types get added. It helps with cleaning up VF driver flow. Change-ID: Ibe1eeb71262a3a40f24a1c5409436bdc3411da7f Signed-off-by: Anjali Singhai Jain <[email protected]> Acked-by: Shannon Nelson <[email protected]> Acked-by: Greg Rose <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Add a virtual channel op to config RSS	Anjali Singhai Jain	2	-0/+2
	Add the Virtual Channel OP event opcode for CONFIG_RSS, so that the Virtual Channel state machine can properly decipher status change events. Change-ID: I09939c7aa380147f60c49fd01ef2e27d0dc1c299 Signed-off-by: Anjali Singhai Jain <[email protected]> Acked-by: Mitch Williams <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: don't enable PTP support on more than one PF per port	Jacob Keller	2	-11/+25
	Resolve an issue related to images with multiple PFs per physical port. We cannot fully support 1588 PTP features, since only one port should control (ie: write) the registers at a time. Doing so can cause interference of functionality. It may be possible to partially implement the API for only those features without side effects. However, this at minimum means non controlling PFs lose Tx timestamps, frequency atunement, and possibly SYSTIME adjustment. There may be further impact I did not discover. Since the API in the kernel expects these features to work, it is simpler and less dangerous to just disable PTP features on all PFs not identified as the controlling PF in PRTTSYN_CTL0.PF_ID. This change also removes the warning printed when hwtstaml IOCTL is called on the wrong PF. This is actually meaningless now, since only one PF per port will support it. In addition, the ethtool get_ts_info IOCTL was updated so that only the controlling port will even indicate support (so as not to confuse users). The overall downside is complete loss of functionality on non controlling PF, vs the possible gain of partial support. The biggest factor for choosing this approach is simplicity and ensuring that the main PF will work. There could easily be other portions of the 1588 logic with side effects I am not aware, and the reduced functionality that might be made available is significantly less useful. In addition, the API does not allow for proper indication of why particular features are not supported. These reasons are enough to decide for the simpler approach to resolving this issue. Change-ID: If4696bae686fc18aef6552b67dd417213d987c16 Signed-off-by: Jacob Keller <[email protected]> Tested-by: Jim Young <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Add description to misc and fd interrupts	Carolyn Wyborny	1	-4/+7
	This patch adds additional text description for base pf0 and flow director generated interrupts. Without this patch, these interrupts are difficult to distinguish per port on a multi-function device. Change-ID: I4662e1b38840757765a3fe63d90219d28e76bfab Signed-off-by: Carolyn Wyborny <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: allow various base numbers in debugfs aq commands	Shannon Nelson	1	-2/+2
	Use the 'i' rather than the more restrictive 'x' or 'd' in the aq_cmd arguments. This makes the user interface much more forgiving and user friendly. Change-ID: I5dcd57b9befc047e06b74cf1152a25a3fa9e1309 Signed-off-by: Shannon Nelson <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: remove useless debug noise	Shannon Nelson	2	-6/+0
	This message really doesn't give any useful information and ends up getting printed every service_task loop in the Linux driver, filling the logfile with noise when AQ tracing is enabled. This patch simply removes the noise. Change-ID: I30ad51e6b03c7ad12a7d9c102def0087db622df3 Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Mitch Williams <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-06	i40e: Remove unneeded break statement	Shannon Nelson	1	-2/+0
	This case statement is empty and the fall through just breaks out so remove the break and let it fall through to break out. Change-ID: I1b5ba9870d5245ca80bfca6e7f5f089e2eb8ccb0 Signed-off-by: Shannon Nelson <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]>
2014-12-05	Merge branch 'ebpf-next'	David S. Miller	31	-5/+987
	Alexei Starovoitov says: ==================== allow eBPF programs to be attached to sockets V1->V2: fixed comments in sample code to state clearly that packet data is accessed with LD_ABS instructions and not internal skb fields. Also replaced constants in: BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto /), with: BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) / R0 = ip->proto */), V1 cover: Introduce BPF_PROG_TYPE_SOCKET_FILTER type of eBPF programs that can be attached to sockets with setsockopt(). Allow such programs to access maps via lookup/update/delete helpers. This feature was previewed by bpf manpage in commit b4fc1a460f30("Merge branch 'bpf-next'") Now it can actually run. 1st patch adds LD_ABS/LD_IND instruction verification and 2nd patch adds new setsockopt() flag. Patches 3-6 are examples in assembler and in C. Though native eBPF programs are way more powerful than classic filters (attachable through similar setsockopt() call), they don't have skb field accessors yet. Like skb->pkt_type, skb->dev->ifindex are not accessible. There are sevaral ways to achieve that. That will be in the next set of patches. So in this set native eBPF programs can only read data from packet and access maps. The most powerful example is sockex2_kern.c from patch 6 where ~200 lines of C are compiled into ~300 of eBPF instructions. It shows how quite complex packet parsing can be done. LLVM used to build examples is at https://github.com/iovisor/llvm which is fork of llvm trunk that I'm cleaning up for upstreaming. ==================== Signed-off-by: David S. Miller <[email protected]>
2014-12-05	samples: bpf: large eBPF program in C	Alexei Starovoitov	3	-0/+263
	sockex2_kern.c is purposefully large eBPF program in C. llvm compiles ~200 lines of C code into ~300 eBPF instructions. It's similar to __skb_flow_dissect() to demonstrate that complex packet parsing can be done by eBPF. Then it uses (struct flow_keys)->dst IP address (or hash of ipv6 dst) to keep stats of number of packets per IP. User space loads eBPF program, attaches it to loopback interface and prints dest_ip->#packets stats every second. Usage: $sudo samples/bpf/sockex2 ip 127.0.0.1 count 19 ip 127.0.0.1 count 178115 ip 127.0.0.1 count 369437 ip 127.0.0.1 count 559841 ip 127.0.0.1 count 750539 Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-05	samples: bpf: trivial eBPF program in C	Alexei Starovoitov	4	-1/+89
	this example does the same task as previous socket example in assembler, but this one does it in C. eBPF program in kernel does: /* assume that packet is IPv4, load one byte of IP->proto / int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); long value; value = bpf_map_lookup_elem(&my_map, &index); if (value) __sync_fetch_and_add(value, 1); Corresponding user space reads map[tcp], map[udp], map[icmp] and prints protocol stats every second Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2014-12-05	samples: bpf: elf_bpf file loader	Alexei Starovoitov	3	-0/+267
	simple .o parser and loader using BPF syscall. .o is a standard ELF generated by LLVM backend It parses elf file compiled by llvm .c->.o - parses 'maps' section and creates maps via BPF syscall - parses 'license' section and passes it to syscall - parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD - loads eBPF programs via BPF syscall One ELF file can contain multiple BPF programs. int load_bpf_file(char *path); populates prog_fd[] and map_fd[] with FDs received from bpf syscall bpf_helpers.h - helper functions available to eBPF programs written in C Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>