aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2013-11-03net: skb_checksum: allow custom update/combine for walking skbDaniel Borkmann1-10/+21
Currently, skb_checksum walks over 1) linearized, 2) frags[], and 3) frag_list data and calculats the one's complement, a 32 bit result suitable for feeding into itself or csum_tcpudp_magic(), but unsuitable for SCTP as we're calculating CRC32c there. Hence, in order to not re-implement the very same function in SCTP (and maybe other protocols) over and over again, use an update() + combine() callback internally to allow for walking over the skb with different algorithms. Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-11-02net: flow_dissector: fail on evil iph->ihlJason Wang1-1/+1
We don't validate iph->ihl which may lead a dead loop if we meet a IPIP skb whose iph->ihl is zero. Fix this by failing immediately when iph->ihl is evil (less than 5). This issue were introduced by commit ec5efe7946280d1e84603389a1030ccec0a767ae (rps: support IPIP encapsulation). Cc: Eric Dumazet <[email protected]> Cc: Petr Matousek <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Daniel Borkmann <[email protected]> Signed-off-by: Jason Wang <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-29net, mc: fix the incorrect comments in two mc-related functionsZhi Yong Wu1-2/+2
Signed-off-by: Zhi Yong Wu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-29net, iovec: fix the incorrect comment in memcpy_fromiovecend()Zhi Yong Wu1-1/+1
Signed-off-by: Zhi Yong Wu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-29net, datagram: fix the incorrect comment in zerocopy_sg_from_iovec()Zhi Yong Wu1-1/+1
Signed-off-by: Zhi Yong Wu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-25netpoll: fix rx_hook() interface by passing the skbAntonio Quartulli1-13/+18
Right now skb->data is passed to rx_hook() even if the skb has not been linearised and without giving rx_hook() a way to linearise it. Change the rx_hook() interface and make it accept the skb and the offset to the UDP payload as arguments. rx_hook() is also renamed to rx_skb_hook() to ensure that out of the tree users notice the API change. In this way any rx_skb_hook() implementation can perform all the needed operations to properly (and safely) access the skb data. Signed-off-by: Antonio Quartulli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-25net: fix rtnl notification in atomic contextAlexei Starovoitov2-12/+13
commit 991fb3f74c "dev: always advertise rx_flags changes via netlink" introduced rtnl notification from __dev_set_promiscuity(), which can be called in atomic context. Steps to reproduce: ip tuntap add dev tap1 mode tap ifconfig tap1 up tcpdump -nei tap1 & ip tuntap del dev tap1 mode tap [ 271.627994] device tap1 left promiscuous mode [ 271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940 [ 271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip [ 271.677525] INFO: lockdep is turned off. [ 271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G W 3.12.0-rc3+ #73 [ 271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012 [ 271.731254] ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428 [ 271.760261] ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8 [ 271.790683] 0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8 [ 271.822332] Call Trace: [ 271.838234] [<ffffffff817544e5>] dump_stack+0x55/0x76 [ 271.854446] [<ffffffff8108bad1>] __might_sleep+0x181/0x240 [ 271.870836] [<ffffffff81110ff8>] ? rcu_irq_exit+0x68/0xb0 [ 271.887076] [<ffffffff811a80be>] kmem_cache_alloc_node+0x4e/0x2a0 [ 271.903368] [<ffffffff810b4ddc>] ? vprintk_emit+0x1dc/0x5a0 [ 271.919716] [<ffffffff81614d67>] ? __alloc_skb+0x57/0x2a0 [ 271.936088] [<ffffffff810b4de0>] ? vprintk_emit+0x1e0/0x5a0 [ 271.952504] [<ffffffff81614d67>] __alloc_skb+0x57/0x2a0 [ 271.968902] [<ffffffff8163a0b2>] rtmsg_ifinfo+0x52/0x100 [ 271.985302] [<ffffffff8162ac6d>] __dev_notify_flags+0xad/0xc0 [ 272.001642] [<ffffffff8162ad0c>] __dev_set_promiscuity+0x8c/0x1c0 [ 272.017917] [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380 [ 272.033961] [<ffffffff8162b109>] dev_set_promiscuity+0x29/0x50 [ 272.049855] [<ffffffff8172e937>] packet_dev_mc+0x87/0xc0 [ 272.065494] [<ffffffff81732052>] packet_notifier+0x1b2/0x380 [ 272.080915] [<ffffffff81731ea5>] ? packet_notifier+0x5/0x380 [ 272.096009] [<ffffffff81761c66>] notifier_call_chain+0x66/0x150 [ 272.110803] [<ffffffff8108503e>] __raw_notifier_call_chain+0xe/0x10 [ 272.125468] [<ffffffff81085056>] raw_notifier_call_chain+0x16/0x20 [ 272.139984] [<ffffffff81620190>] call_netdevice_notifiers_info+0x40/0x70 [ 272.154523] [<ffffffff816201d6>] call_netdevice_notifiers+0x16/0x20 [ 272.168552] [<ffffffff816224c5>] rollback_registered_many+0x145/0x240 [ 272.182263] [<ffffffff81622641>] rollback_registered+0x31/0x40 [ 272.195369] [<ffffffff816229c8>] unregister_netdevice_queue+0x58/0x90 [ 272.208230] [<ffffffff81547ca0>] __tun_detach+0x140/0x340 [ 272.220686] [<ffffffff81547ed6>] tun_chr_close+0x36/0x60 Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-25net: initialize hashrnd in flow_dissector with net_get_random_onceHannes Frederic Sowa1-13/+21
We also can defer the initialization of hashrnd in flow_dissector to its first use. Since net_get_random_once is irq safe now we don't have to audit the call paths if one of this functions get called by an interrupt handler. Cc: David S. Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-25net: make net_get_random_once irq safeHannes Frederic Sowa1-3/+4
I initial build non irq safe version of net_get_random_once because I would liked to have the freedom to defer even the extraction process of get_random_bytes until the nonblocking pool is fully seeded. I don't think this is a good idea anymore and thus this patch makes net_get_random_once irq safe. Now someone using net_get_random_once does not need to care from where it is called. Cc: David S. Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-25net: add missing dev_put() in __netdev_adjacent_dev_insertNikolay Aleksandrov1-0/+1
I think that a dev_put() is needed in the error path to preserve the proper dev refcount. CC: Veaceslav Falico <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Acked-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+2
Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2013-10-23net: always inline net_secret_initHannes Frederic Sowa1-1/+1
Currently net_secret_init does not get inlined, so we always have a call to net_secret_init even in the fast path. Let's specify net_secret_init as __always_inline so we have the nop in the fast-path without the call to net_secret_init and the unlikely path at the epilogue of the function. jump_labels handle the inlining correctly. Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-22net: remove function sk_reset_txq()ZHAO Gang1-6/+0
What sk_reset_txq() does is just calls function sk_tx_queue_reset(), and sk_reset_txq() is used only in sock.h, by dst_negative_advice(). Let dst_negative_advice() calls sk_tx_queue_reset() directly so we can remove unneeded sk_reset_txq(). Signed-off-by: ZHAO Gang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-21ipv6: sit: add GSO/TSO supportEric Dumazet1-0/+1
Now ipv6_gso_segment() is stackable, its relatively easy to implement GSO/TSO support for SIT tunnels Performance results, when segmentation is done after tunnel device (as no NIC is yet enabled for TSO SIT support) : Before patch : lpq84:~# ./netperf -H 2002:af6:1153:: -Cc MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 16384 10.00 3168.31 4.81 4.64 2.988 2.877 After patch : lpq84:~# ./netperf -H 2002:af6:1153:: -Cc MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 16384 10.00 5525.00 7.76 5.17 2.763 1.840 Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-19net: switch net_secret key generation to net_get_random_onceHannes Frederic Sowa1-12/+2
Cc: Eric Dumazet <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-19net: introduce new macro net_get_random_onceHannes Frederic Sowa1-0/+48
net_get_random_once is a new macro which handles the initialization of secret keys. It is possible to call it in the fast path. Only the initialization depends on the spinlock and is rather slow. Otherwise it should get used just before the key is used to delay the entropy extration as late as possible to get better randomness. It returns true if the key got initialized. The usage of static_keys for net_get_random_once is a bit uncommon so it needs some further explanation why this actually works: === In the simple non-HAVE_JUMP_LABEL case we actually have === no constrains to use static_key_(true|false) on keys initialized with STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of the likely case that the initialization is already done. The key is initialized like this: ___done_key = { .enabled = ATOMIC_INIT(0) } The check if (!static_key_true(&___done_key)) \ expands into (pseudo code) if (!likely(___done_key > 0)) , so we take the fast path as soon as ___done_key is increased from the helper function. === If HAVE_JUMP_LABELs are available this depends === on patching of jumps into the prepared NOPs, which is done in jump_label_init at boot-up time (from start_kernel). It is forbidden and dangerous to use net_get_random_once in functions which are called before that! At compilation time NOPs are generated at the call sites of net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we need to call net_get_random_once two times in inet6_ehashfn, so two NOPs): 71: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 76: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) Both will be patched to the actual jumps to the end of the function to call __net_get_random_once at boot time as explained above. arch_static_branch is optimized and inlined for false as return value and actually also returns false in case the NOP is placed in the instruction stream. So in the fast case we get a "return false". But because we initialize ___done_key with (enabled != (entries & 1)) this call-site will get patched up at boot thus returning true. The final check looks like this: if (!static_key_true(&___done_key)) \ ___ret = __net_get_random_once(buf, \ expands to if (!!static_key_false(&___done_key)) \ ___ret = __net_get_random_once(buf, \ So we get true at boot time and as soon as static_key_slow_inc is called on the key it will invert the logic and return false for the fast path. static_key_slow_inc will change the branch because it got initialized with .enabled == 0. After static_key_slow_inc is called on the key the branch is replaced with a nop again. === Misc: === The helper defers the increment into a workqueue so we don't have problems calling this code from atomic sections. A seperate boolean (___done) guards the case where we enter net_get_random_once again before the increment happend. Cc: Ingo Molnar <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Jason Baron <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-19ipip: add GSO/TSO supportEric Dumazet1-0/+1
Now inet_gso_segment() is stackable, its relatively easy to implement GSO/TSO support for IPIP Performance results, when segmentation is done after tunnel device (as no NIC is yet enabled for TSO IPIP support) : Before patch : lpq83:~# ./netperf -H 7.7.9.84 -Cc MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 16384 10.00 3357.88 5.09 3.70 2.983 2.167 After patch : lpq83:~# ./netperf -H 7.7.9.84 -Cc MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 16384 10.00 7710.19 4.52 6.62 1.152 1.687 Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-19ipv4: gso: make inet_gso_segment() stackableEric Dumazet1-0/+2
In order to support GSO on IPIP, we need to make inet_gso_segment() stackable. It should not assume network header starts right after mac header. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-19net: generalize skb_segment()Eric Dumazet1-17/+5
While implementing GSO/TSO support for IPIP, I found skb_segment() was assuming network header was immediately following mac header. Its not really true in the case inet_gso_segment() is stacked : By the time tcp_gso_segment() is called, network header points to the inner IP header. Let's instead assume nothing and pick the current offsets found in original skb, we have skb_headers_offset_update() helper for that. Also move the csum_start update inside skb_headers_offset_update() Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-19Merge 3.12-rc6 into driver-core-nextGreg Kroah-Hartman3-6/+74
We want these fixes here too. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-10-18net: refactor sk_page_frag_refill()Eric Dumazet1-4/+23
While working on virtio_net new allocation strategy to increase payload/truesize ratio, we found that refactoring sk_page_frag_refill() was needed. This patch splits sk_page_frag_refill() into two parts, adding skb_page_frag_refill() which can be used without a socket. While we are at it, add a minimum frag size of 32 for sk_page_frag_refill() Michael will either use netdev_alloc_frag() from softirq context, or skb_page_frag_refill() from process context in refill_work() (GFP_KERNEL allocations) Signed-off-by: Eric Dumazet <[email protected]> Cc: Michael Dalton <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-10net: gro: allow to build full sized skbEric Dumazet1-17/+26
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb, typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags. It's relatively easy to extend the skb using frag_list to allow more frags to be appended into the last sk_buff. This still builds very efficient skbs, and allows reaching 45 MSS per skb. (45 MSS GRO packet uses one skb plus a frag_list containing 2 additional sk_buff) High speed TCP flows benefit from this extension by lowering TCP stack cpu usage (less packets stored in receive queue, less ACK packets processed) Forwarding setups could be hurt, as such skbs will need to be linearized, although its not a new problem, as GRO could already provide skbs with a frag_list. We could make the 65536 bytes threshold a tunable to mitigate this. (First time we need to linearize skb in skb_needs_linearize(), we could lower the tunable to ~16*1460 so that following skb_gro_receive() calls build smaller skbs) Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-09net: secure_seq: Fix warning when CONFIG_IPV6 and CONFIG_INET are not selectedFabio Estevam1-0/+2
net_secret() is only used when CONFIG_IPV6 or CONFIG_INET are selected. Building a defconfig with both of these symbols unselected (Using the ARM at91sam9rl_defconfig, for example) leads to the following build warning: $ make at91sam9rl_defconfig # # configuration written to .config # $ make net/core/secure_seq.o scripts/kconfig/conf --silentoldconfig Kconfig CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h make[1]: `include/generated/mach-types.h' is up to date. CALL scripts/checksyscalls.sh CC net/core/secure_seq.o net/core/secure_seq.c:17:13: warning: 'net_secret_init' defined but not used [-Wunused-function] Fix this warning by protecting the definition of net_secret() with these symbols. Reported-by: Olof Johansson <[email protected]> Signed-off-by: Fabio Estevam <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-5/+7
Conflicts: include/linux/netdevice.h net/core/sock.c Trivial merge issues. Removal of "extern" for functions declaration in netdevice.h at the same time "const" was added to an argument. Two parallel line additions in net/core/sock.c Signed-off-by: David S. Miller <[email protected]>
2013-10-08pkt_sched: fq: fix non TCP flows pacingEric Dumazet1-0/+1
Steinar reported FQ pacing was not working for UDP flows. It looks like the initial sk->sk_pacing_rate value of 0 was a wrong choice. We should init it to ~0U (unlimited) Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes no real sense. The default rate is really unlimited, and we need to avoid a zero divide. Reported-by: Steinar H. Gunderson <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-08cgroup: netprio: remove unnecessary task_netprioidxGao feng1-2/+1
Since the tasks have been migrated to the cgroup, there is no need to call task_netprioidx to get task's cgroup id. Signed-off-by: Gao feng <[email protected]> Acked-by: Neil Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-07net: Separate the close_list and the unreg_list v2Eric W. Biederman1-11/+14
Separate the unreg_list and the close_list in dev_close_many preventing dev_close_many from permuting the unreg_list. The permutations of the unreg_list have resulted in cases where the loopback device is accessed it has been freed in code such as dst_ifdown. Resulting in subtle memory corruption. This is the second bug from sharing the storage between the close_list and the unreg_list. The issues that crop up with sharing are apparently too subtle to show up in normal testing or usage, so let's forget about being clever and use two separate lists. v2: Make all callers pass in a close_list to dev_close_many Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-07net: fix unsafe set_memory_rw from softirqAlexei Starovoitov1-4/+4
on x86 system with net.core.bpf_jit_enable = 1 sudo tcpdump -i eth1 'tcp port 22' causes the warning: [ 56.766097] Possible unsafe locking scenario: [ 56.766097] [ 56.780146] CPU0 [ 56.786807] ---- [ 56.793188] lock(&(&vb->lock)->rlock); [ 56.799593] <Interrupt> [ 56.805889] lock(&(&vb->lock)->rlock); [ 56.812266] [ 56.812266] *** DEADLOCK *** [ 56.812266] [ 56.830670] 1 lock held by ksoftirqd/1/13: [ 56.836838] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380 [ 56.849757] [ 56.849757] stack backtrace: [ 56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45 [ 56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012 [ 56.882004] ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007 [ 56.895630] ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001 [ 56.909313] ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001 [ 56.923006] Call Trace: [ 56.929532] [<ffffffff8175a145>] dump_stack+0x55/0x76 [ 56.936067] [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208 [ 56.942445] [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50 [ 56.948932] [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150 [ 56.955470] [<ffffffff810ccb52>] mark_lock+0x282/0x2c0 [ 56.961945] [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50 [ 56.968474] [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50 [ 56.975140] [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90 [ 56.981942] [<ffffffff810cef72>] lock_acquire+0x92/0x1d0 [ 56.988745] [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380 [ 56.995619] [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50 [ 57.002493] [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380 [ 57.009447] [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380 [ 57.016477] [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380 [ 57.023607] [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460 [ 57.030818] [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10 [ 57.037896] [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0 [ 57.044789] [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0 [ 57.051720] [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40 [ 57.058727] [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40 [ 57.065577] [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30 [ 57.072338] [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0 [ 57.078962] [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0 [ 57.085373] [<ffffffff81058245>] run_ksoftirqd+0x35/0x70 cannot reuse jited filter memory, since it's readonly, so use original bpf insns memory to hold work_struct defer kfree of sk_filter until jit completed freeing tested on x86_64 and i386 Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-07netif_set_xps_queue: make cpu mask constMichael S. Tsirkin1-1/+2
virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-03flow_dissector: factor out the ports extraction in skb_flow_get_portsNikolay Aleksandrov1-11/+28
Factor out the code that extracts the ports from skb_flow_dissect and add a new function skb_flow_get_ports which can be re-used. Suggested-by: Veaceslav Falico <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Acked-by: Eric Dumazet <[email protected]> Reviewed-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-6/+74
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/usb/qmi_wwan.c drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h include/net/netfilter/nf_conntrack_synproxy.h include/net/secure_seq.h The conflicts are of two varieties: 1) Conflicts with Joe Perches's 'extern' removal from header file function declarations. Usually it's an argument signature change or a function being added/removed. The resolutions are trivial. 2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds a new value, another changes an existing value. That sort of thing. Signed-off-by: David S. Miller <[email protected]>
2013-09-30net: flow_dissector: fix thoff for IPPROTO_AHEric Dumazet1-2/+2
In commit 8ed781668dd49 ("flow_keys: include thoff into flow_keys for later usage"), we missed that existing code was using nhoff as a temporary variable that could not always contain transport header offset. This is not a problem for TCP/UDP because port offset (@poff) is 0 for these protocols. Signed-off-by: Eric Dumazet <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Nikolay Aleksandrov <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-30dev: always advertise rx_flags changes via netlinkNicolas Dichtel1-23/+37
When flags IFF_PROMISC and IFF_ALLMULTI are changed, netlink messages are not consistent. For example, if a multicast daemon is running (flag IFF_ALLMULTI set in dev->flags but not dev->gflags, ie not exported to userspace) and then a user sets it via netlink (flag IFF_ALLMULTI set in dev->flags and dev->gflags, ie exported to userspace), no netlink message is sent. Same for IFF_PROMISC and because dev->promiscuity is exported via IFLA_PROMISCUITY, we may send a netlink message after each change of this counter. Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-30dev: update __dev_notify_flags() to send rtnl msgNicolas Dichtel2-7/+7
This patch only prepares the next one, there is no functional change. Now, __dev_notify_flags() can also be used to notify flags changes via rtnetlink. Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-28net: introduce SO_MAX_PACING_RATEEric Dumazet1-0/+12
As mentioned in commit afe4fd062416b ("pkt_sched: fq: Fair Queue packet scheduler"), this patch adds a new socket option. SO_MAX_PACING_RATE offers the application the ability to cap the rate computed by transport layer. Value is in bytes per second. u32 val = 1000000; setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val)); To be effectively paced, a flow must use FQ packet scheduler. Note that a packet scheduler takes into account the headers for its computations. The effective payload rate depends on MSS and retransmits if any. I chose to make this pacing rate a SOL_SOCKET option instead of a TCP one because this can be used by other protocols. Signed-off-by: Eric Dumazet <[email protected]> Cc: Steinar H. Gunderson <[email protected]> Cc: Michael Kerrisk <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-28net: net_secret should not depend on TCPEric Dumazet1-3/+24
A host might need net_secret[] and never open a single socket. Problem added in commit aebda156a570782 ("net: defer net_secret[] initialization") Based on prior patch from Hannes Frederic Sowa. Reported-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-28net: Delay default_device_exit_batch until no devices are unregistering v2Eric W. Biederman1-1/+48
There is currently serialization network namespaces exiting and network devices exiting as the final part of netdev_run_todo does not happen under the rtnl_lock. This is compounded by the fact that the only list of devices unregistering in netdev_run_todo is local to the netdev_run_todo. This lack of serialization in extreme cases results in network devices unregistering in netdev_run_todo after the loopback device of their network namespace has been freed (making dst_ifdown unsafe), and after the their network namespace has exited (making the NETDEV_UNREGISTER, and NETDEV_UNREGISTER_FINAL callbacks unsafe). Add the missing serialization by a per network namespace count of how many network devices are unregistering and having a wait queue that is woken up whenever the count is decreased. The count and wait queue allow default_device_exit_batch to wait until all of the unregistration activity for a network namespace has finished before proceeding to unregister the loopback device and then allowing the network namespace to exit. Only a single global wait queue is used because there is a single global lock, and there is a single waiter, per network namespace wait queues would be a waste of resources. The per network namespace count of unregistering devices gives a progress guarantee because the number of network devices unregistering in an exiting network namespace must ultimately drop to zero (assuming network device unregistration completes). The basic logic remains the same as in v1. This patch is now half comment and half rtnl_lock_unregistering an expanded version of wait_event performs no extra work in the common case where no network devices are unregistering when we get to default_device_exit_batch. Reported-by: Francesco Ruggeri <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26sysfs: make attr namespace interface less convolutedTejun Heo1-6/+8
sysfs ns (namespace) implementation became more convoluted than necessary while trying to hide ns information from visible interface. The relatively recent attr ns support is a good example. * attr ns tag is determined by sysfs_ops->namespace() callback while dir tag is determined by kobj_type->namespace(). The placement is arbitrary. * Instead of performing operations with explicit ns tag, the namespace callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(), class_attr_namespace(), class_attr->namespace(). It's not simpler in any sense. The only thing this convolution does is traversing the whole stack backwards. The namespace callbacks are unncessary because the operations involved are inherently synchronous. The information can be provided in in straight-forward top-down direction and reversing that direction is unnecessary and against basic design principles. This backward interface is unnecessarily convoluted and hinders properly separating out sysfs from driver model / kobject for proper layering. This patch updates attr ns support such that * sysfs_ops->namespace() and class_attr->namespace() are dropped. * sysfs_{create|remove}_file_ns(), which take explicit @ns param, are added and sysfs_{create|remove}_file() are now simple wrappers around the ns aware functions. * ns handling is dropped from sysfs_chmod_file(). Nobody uses it at this point. sysfs_chmod_file_ns() can be added later if necessary. * Explicit @ns is propagated through class_{create|remove}_file_ns() and netdev_class_{create|remove}_file_ns(). * driver/net/bonding which is currently the only user of attr namespace is updated to use netdev_class_{create|remove}_file_ns() with @bh->net as the ns tag instead of using the namespace callback. This patch should be an equivalent conversion without any functional difference. It makes the code easier to follow, reduces lines of code a bit and helps proper separation and layering. Signed-off-by: Tejun Heo <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Kay Sievers <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-09-26net: create sysfs symlinks for neighbour devicesVeaceslav Falico1-1/+34
Also, remove the same functionality from bonding - it will be already done for any device that links to its lower/upper neighbour. The links will be created for dev's kobject, and will look like lower_eth0 for lower device eth0 and upper_bridge0 for upper device bridge0. CC: Jay Vosburgh <[email protected]> CC: Andy Gospodarek <[email protected]> CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: expose the master link to sysfs, and remove it from bondVeaceslav Falico1-2/+17
Currently, we can have only one master upper neighbour, so it would be useful to create a symlink to it in the sysfs device directory, the way that bonding now does it, for every device. Lower devices from bridge/team/etc will automagically get it, so we could rely on it. Also, remove the same functionality from bonding. CC: Jay Vosburgh <[email protected]> CC: Andy Gospodarek <[email protected]> CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: add a possibility to get private from netdev_adjacent->listVeaceslav Falico1-0/+10
It will be useful to get first/last element. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: add for_each iterators through neighbour lower link's privateVeaceslav Falico1-1/+59
Add a possibility to iterate through netdev_adjacent's private, currently only for lower neighbours. Add both RCU and RTNL/other locking variants of iterators, and make the non-rcu variant to be safe from removal. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: add netdev_adjacent->private and allow to use itVeaceslav Falico1-11/+57
Currently, even though we can access any linked device, we can't attach anything to it, which is vital to properly manage them. To fix this, add a new void *private to netdev_adjacent and functions setting/getting it (per link), so that we can save, per example, bonding's slave structures there, per slave device. netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to upper dev and populates the neighbour link only with private. netdev_lower_dev_get_private{,_rcu}() returns the private, if found. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: add RCU variant to search for netdev_adjacent linkVeaceslav Falico1-0/+13
Currently we have only the RTNL flavour, however we can traverse it while holding only RCU, so add the RCU search. Add an RCU variant that uses list_head * as an argument, so that it can be universally used afterwards. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> CC: Cong Wang <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: add adj_list to save only neighboursVeaceslav Falico1-100/+103
Currently, we distinguish neighbours (first-level linked devices) from non-neighbours by the neighbour bool in the netdev_adjacent. This could be quite time-consuming in case we would like to traverse *only* through neighbours - cause we'd have to traverse through all devices and check for this flag, and in a (quite common) scenario where we have lots of vlans on top of bridge, which is on top of a bond - the bonding would have to go through all those vlans to get its upper neighbour linked devices. This situation is really unpleasant, cause there are already a lot of cases when a device with slaves needs to go through them in hot path. To fix this, introduce a new upper/lower device lists structure - adj_list, which contains only the neighbours. It works always in pair with the all_adj_list structure (renamed from upper/lower_dev_list), i.e. both of them contain the same links, only that all_adj_list contains also non-neighbour device links. It's really a small change visible, currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't change the main linked logic at all. Also, add some comments a fix a name collision in netdev_for_each_upper_dev_rcu() and rework the naming by the following rules: netdev_(all_)(upper|lower)_* If "all_" is present, then we work with the whole list of upper/lower devices, otherwise - only with direct neighbours. Uninline functions - to get better stack traces. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> CC: Cong Wang <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: use lists as arguments instead of bool upperVeaceslav Falico1-32/+22
Currently we make use of bool upper when we want to specify if we want to work with upper/lower list. It's, however, harder to read, debug and occupies a lot more code. Fix this by just passing the correct upper/lower_dev_list list_head pointer instead of bool upper, and work internally with it. CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Jiri Pirko <[email protected]> CC: Alexander Duyck <[email protected]> CC: Cong Wang <[email protected]> Signed-off-by: Veaceslav Falico <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-26net: neighbour: use source address of last enqueued packet for solicitationHannes Frederic Sowa1-1/+1
Currently we always use the first member of the arp_queue to determine the sender ip address of the arp packet (or in case of IPv6 - source address of the ndisc packet). This skb is fixed as long as the queue is not drained by a complete purge because of a timeout or by a successful response. If the first packet enqueued on the arp_queue is from a local application with a manually set source address and the to be discovered system does some kind of uRPF checks on the source address in the arp packet the resolving process hangs until a timeout and restarts. This hurts communication with the participating network node. This could be mitigated a bit if we use the latest enqueued skb's source address for the resolving process, which is not as static as the arp_queue's head. This change of the source address could result in better recovery of a failed solicitation. Cc: "David S. Miller" <[email protected]> Cc: Julian Anastasov <[email protected]> Reviewed-by: Julian Anastasov <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-19netpoll: fix NULL pointer dereference in netpoll_cleanupNikolay Aleksandrov1-5/+4
I've been hitting a NULL ptr deref while using netconsole because the np->dev check and the pointer manipulation in netpoll_cleanup are done without rtnl and the following sequence happens when having a netconsole over a vlan and we remove the vlan while disabling the netconsole: CPU 1 CPU2 removes vlan and calls the notifier enters store_enabled(), calls netdev_cleanup which checks np->dev and then waits for rtnl executes the netconsole netdev release notifier making np->dev == NULL and releases rtnl continues to dereference a member of np->dev which at this point is == NULL Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-12netpoll: Should handle ETH_P_ARP other than ETH_P_IP in netpoll_neigh_replySonic Zhang1-1/+1
The received ARP request type in the Ethernet packet head is ETH_P_ARP other than ETH_P_IP. [ Bug introduced by commit b7394d2429c198b1da3d46ac39192e891029ec0f ("netpoll: prepare for ipv6") ] Signed-off-by: Sonic Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2013-09-11net: fix multiqueue selectionEric Dumazet1-1/+1
commit 416186fbf8c5b4e4465 ("net: Split core bits of netdev_pick_tx into __netdev_pick_tx") added a bug that disables caching of queue index in the socket. This is the source of packet reorders for TCP flows, and again this is happening more often when using FQ pacing. Old code was doing if (queue_index != old_index) sk_tx_queue_set(sk, queue_index); Alexander renamed the variables but forgot to change sk_tx_queue_set() 2nd parameter. if (queue_index != new_index) sk_tx_queue_set(sk, queue_index); This means we store -1 over and over in sk->sk_tx_queue_mapping Signed-off-by: Eric Dumazet <[email protected]> Cc: Alexander Duyck <[email protected]> Acked-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>