blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2016-05-20	net: Cleanup encap items in ip_tunnels.h	Tom Herbert	2	-45/+4
	Consolidate all the ip_tunnel_encap definitions in one spot in the header file. Also, move ip_encap_hlen and ip_tunnel_encap from ip_tunnel.c to ip_tunnels.h so they call be called without a dependency on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ipv6: Change "final" protocol processing for encapsulation	Tom Herbert	1	-1/+14
	When performing foo-over-UDP, UDP packets are processed by the encapsulation handler which returns another protocol to process. This may result in processing two (or more) protocols in the loop that are marked as INET6_PROTO_FINAL. The actions taken for hitting a final protocol, in particular the skb_postpull_rcsum can only be performed once. This patch set adds a check of a final protocol has been seen. The rules are: - If the final protocol has not been seen any protocol is processed (final and non-final). In the case of a final protocol, the final actions are taken (like the skb_postpull_rcsum) - If a final protocol has been seen (e.g. an encapsulating UDP header) then no further non-final protocols are allowed (e.g. extension headers). For more final protocols the final actions are not taken (e.g. skb_postpull_rcsum). Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ipv6: Fix nexthdr for reinjection	Tom Herbert	1	-3/+15
	In ip6_input_finish the nexthdr protocol is retrieved from the next header offset that is returned in the cb of the skb. This method does not work for UDP encapsulation that may not even have a concept of a nexthdr field (e.g. FOU). This patch checks for a final protocol (INET6_PROTO_FINAL) when a protocol handler returns > 0. If the protocol is not final then resubmission is performed on nhoff value. If the protocol is final then the nexthdr is taken to be the return value. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	net: define gso types for IPx over IPv4 and IPv6	Tom Herbert	6	-18/+15
	This patch defines two new GSO definitions SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and NETIF_F_GSO_IPXIP6. These are used to described IP in IP tunnel and what the outer protocol is. The inner protocol can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT are removed (these are both instances of SKB_GSO_IPXIP4). SKB_GSO_IPXIP6 will be used when support for GSO with IP encapsulation over IPv6 is added. Signed-off-by: Tom Herbert <[email protected]> Acked-by: Jeff Kirsher <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	gso: Remove arbitrary checks for unsupported GSO	Tom Herbert	7	-102/+1
	In several gso_segment functions there are checks of gso_type against a seemingly arbitrary list of SKB_GSO_* flags. This seems like an attempt to identify unsupported GSO types, but since the stack is the one that set these GSO types in the first place this seems unnecessary to do. If a combination isn't valid in the first place that stack should not allow setting it. This is a code simplication especially for add new GSO types. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	Merge tag 'for-linus' of ↵	Linus Torvalds	3	-3/+3
	git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "Primary 4.7 merge window changes - Updates to the new Intel X722 iWARP driver - Updates to the hfi1 driver - Fixes for the iw_cxgb4 driver - Misc core fixes - Generic RDMA READ/WRITE API addition - SRP updates - Misc ipoib updates - Minor mlx5 updates" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits) IB/mlx5: Fire the CQ completion handler from tasklet net/mlx5_core: Use tasklet for user-space CQ completion events IB/core: Do not require CAP_NET_ADMIN for packet sniffing IB/mlx4: Fix unaligned access in send_reply_to_slave IB/mlx5: Report Scatter FCS device capability when supported IB/mlx5: Add Scatter FCS support for Raw Packet QP IB/core: Add Scatter FCS create flag IB/core: Add Raw Scatter FCS device capability IB/core: Add extended device capability flags i40iw: pass hw_stats by reference rather than by value i40iw: Remove unnecessary synchronize_irq() before free_irq() i40iw: constify i40iw_vf_cqp_ops structure IB/mlx5: Add UARs write-combining and non-cached mapping IB/mlx5: Allow mapping the free running counter on PROT_EXEC IB/mlx4: Use list_for_each_entry_safe IB/SA: Use correct free function IB/core: Fix a potential array overrun in CMA and SA agent IB/core: Remove unnecessary check in ibnl_rcv_msg IB/IWPM: Fix a potential skb leak RDMA/nes: replace custom print_hex_dump() ...
2016-05-20	af_unix: fix hard linked sockets on overlay	Miklos Szeredi	1	-3/+3
	Overlayfs uses separate inodes even in the case of hard links on the underlying filesystems. This is a problem for AF_UNIX socket implementation which indexes sockets based on the inode. This resulted in hard linked sockets not working. The fix is to use the real, underlying inode. Test case follows: -- ovl-sock-test.c -- #include <unistd.h> #include <err.h> #include <sys/socket.h> #include <sys/un.h> #define SOCK "test-sock" #define SOCK2 "test-sock2" int main(void) { int fd, fd2; struct sockaddr_un addr = { .sun_family = AF_UNIX, .sun_path = SOCK, }; struct sockaddr_un addr2 = { .sun_family = AF_UNIX, .sun_path = SOCK2, }; unlink(SOCK); unlink(SOCK2); if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) err(1, "socket"); if (bind(fd, (struct sockaddr ) &addr, sizeof(addr)) == -1) err(1, "bind"); if (listen(fd, 0) == -1) err(1, "listen"); if (link(SOCK, SOCK2) == -1) err(1, "link"); if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1) err(1, "socket"); if (connect(fd2, (struct sockaddr ) &addr2, sizeof(addr2)) == -1) err (1, "connect"); return 0; } ---- Reported-by: Alexander Morozov <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Cc: <[email protected]>
2016-05-19	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	2	-4/+6
	Merge updates from Andrew Morton: - fsnotify fix - poll() timeout fix - a few scripts/ tweaks - debugobjects updates - the (small) ocfs2 queue - Minor fixes to kernel/padata.c - Maybe half of the MM queue * emailed patches from Andrew Morton <[email protected]>: (117 commits) mm, page_alloc: restore the original nodemask if the fast path allocation failed mm, page_alloc: uninline the bad page part of check_new_page() mm, page_alloc: don't duplicate code in free_pcp_prepare mm, page_alloc: defer debugging checks of pages allocated from the PCP mm, page_alloc: defer debugging checks of freed pages until a PCP drain cpuset: use static key better and convert to new API mm, page_alloc: inline pageblock lookup in page free fast paths mm, page_alloc: remove unnecessary variable from free_pcppages_bulk mm, page_alloc: pull out side effects from free_pages_check mm, page_alloc: un-inline the bad part of free_pages_check mm, page_alloc: check multiple page fields with a single branch mm, page_alloc: remove field from alloc_context mm, page_alloc: avoid looking up the first zone in a zonelist twice mm, page_alloc: shortcut watermark checks for order-0 pages mm, page_alloc: reduce cost of fair zone allocation policy retry mm, page_alloc: shorten the page allocator fast path mm, page_alloc: check once if a zone has isolated pageblocks mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath mm, page_alloc: simplify last cpupid reset mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask() ...
2016-05-19	mm/page_ref: use page_ref helper instead of direct modification of _count	Joonsoo Kim	1	-1/+1
	page_reference manipulation functions are introduced to track down reference count change of the page. Use it instead of direct modification of _count. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Berg <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Sunil Goutham <[email protected]> Cc: Chris Metcalf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-19	fs: poll/select/recvmmsg: use timespec64 for timeout events	Deepa Dinamani	1	-3/+5
	struct timespec is not y2038 safe. Even though timespec might be sufficient to represent timeouts, use struct timespec64 here as the plan is to get rid of all timespec reference in the kernel. The patch transitions the common functions: poll_select_set_timeout() and select_estimate_accuracy() to use timespec64. And, all the syscalls that use these functions are transitioned in the same patch. The restart block parameters for poll uses monotonic time. Use timespec64 here as well to assign timeout value. This parameter in the restart block need not change because this only holds the monotonic timestamp at which timeout should occur. And, unsigned long data type should be big enough for this timestamp. The system call interfaces will be handled in a separate series. Compat interfaces need not change as timespec64 is an alias to struct timespec on a 64 bit system. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Deepa Dinamani <[email protected]> Acked-by: John Stultz <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-19	Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge	David S. Miller	6	-25/+93
	Antonio Quartulli says: ==================== During the Wireless Battle Mesh v9 in Porto (PT) at the beginning of May, we managed to uncover and fix some important bugs in our new B.A.T.M.A.N. V algorithm. These are the fixes we came up with together with others that I collected in the past weeks: - avoid potential crash due to NULL pointer dereference in B.A.T.M.A.N. V routine when a neigh_ifinfo object is not found, by Sven Eckelmann - avoid use-after-free of skb when counting outgoing bytes, by Florian Westphal - fix neigh_ifinfo object reference counting imbalance when using B.A.T.M.A.N. V, by Sven Eckelmann. Such imbalance may lead to the impossibility of releasing the related netdev object on shutdown - avoid invalid memory access in case of error while allocating bcast_own_sum when a new hard-interface is added, by Sven Eckelmann - ensure originator address is updated in OMG/ELP packet content upon primary interface address change, by Antonio Quartulli - fix integer overflow when computing TQ metric (B.A.T.M.A.N. IV), by Sven Eckelmann - avoid race condition while adding new neigh_node which would result in having two objects mapping to the same physical neighbour, by Linus Lüssing - ensure originator address is initialized in ELP packet content on secondary interfaces, by Marek Lindner ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-19	tipc: block BH in TCP callbacks	Eric Dumazet	1	-4/+4
	TCP stack can now run from process context. Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous assumption. Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <[email protected]> Cc: Jon Maloy <[email protected]> Cc: Ying Xue <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-19	rds: tcp: block BH in TCP callbacks	Eric Dumazet	4	-8/+8
	TCP stack can now run from process context. Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous assumption. Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-19	kcm: fix a signedness in kcm_splice_read()	WANG Cong	1	-1/+1
	skb_splice_bits() returns int, kcm_splice_read() returns ssize_t, both are signed. We may need another patch to make them all ssize_t, but that deserves a separated patch. Fixes: 91687355b927 ("kcm: Splice support") Reported-by: David Binderman <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-19	Merge branch 'next' of ↵	Linus Torvalds	3	-4/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - A new LSM, "LoadPin", from Kees Cook is added, which allows forcing of modules and firmware to be loaded from a specific device (this is from ChromeOS, where the device as a whole is verified cryptographically via dm-verity). This is disabled by default but can be configured to be enabled by default (don't do this if you don't know what you're doing). - Keys: allow authentication data to be stored in an asymmetric key. Lots of general fixes and updates. - SELinux: add restrictions for loading of kernel modules via finit_module(). Distinguish non-init user namespace capability checks. Apply execstack check on thread stacks" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits) LSM: LoadPin: provide enablement CONFIG Yama: use atomic allocations when reporting seccomp: Fix comment typo ima: add support for creating files using the mknodat syscall ima: fix ima_inode_post_setattr vfs: forbid write access when reading a file into memory fs: fix over-zealous use of "const" selinux: apply execstack check on thread stacks selinux: distinguish non-init user namespace capability checks LSM: LoadPin for kernel file loading restrictions fs: define a string representation of the kernel_read_file_id enumeration Yama: consolidate error reporting string_helpers: add kstrdup_quotable_file string_helpers: add kstrdup_quotable_cmdline string_helpers: add kstrdup_quotable selinux: check ss_initialized before revalidating an inode label selinux: delay inode label lookup as long as possible selinux: don't revalidate an inode's label when explicitly setting it selinux: Change bool variable name to index. KEYS: Add KEYCTL_DH_COMPUTE command ...
2016-05-18	batman-adv: initialize ELP orig address on secondary interfaces	Marek Lindner	3	-9/+34
	This fix prevents nodes to wrongly create a 00:00:00:00:00:00 originator which can potentially interfere with the rest of the neighbor statistics. Fixes: d6f94d91f766 ("batman-adv: ELP - adding basic infrastructure") Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: Avoid duplicate neigh_node additions	Linus Lüssing	1	-2/+4
	Two parallel calls to batadv_neigh_node_new() might race for creating and adding the same neig_node. Fix this by including the check for any already existing, identical neigh_node within the spin-lock. This fixes splats like the following: [ 739.535069] ------------[ cut here ]------------ [ 739.535079] WARNING: CPU: 0 PID: 0 at /usr/src/batman-adv/git/batman-adv/net/batman-adv/bat_iv_ogm.c:1004 batadv_iv_ogm_process_per_outif+0xe3f/0xe60 [batman_adv]() [ 739.535092] too many matching neigh_nodes [ 739.535094] Modules linked in: dm_mod tun ip6table_filter ip6table_mangle ip6table_nat nf_nat_ipv6 ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TCPMSS xt_mark iptable_mangle xt_tcpudp xt_conntrack iptable_filter ip_tables x_tables ip_gre ip_tunnel gre bridge stp llc thermal_sys kvm_intel kvm crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev pcspkr ip6_gre ip6_tunnel tunnel6 batman_adv(O) libcrc32c nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack autofs4 ext4 crc16 mbcache jbd2 xen_netfront xen_blkfront crc32c_intel [ 739.535177] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W O 4.2.0-0.bpo.1-amd64 #1 Debian 4.2.6-3~bpo8+2 [ 739.535186] 0000000000000000 ffffffffa013b050 ffffffff81554521 ffff88007d003c18 [ 739.535201] ffffffff8106fa01 0000000000000000 ffff8800047a087a ffff880079c3a000 [ 739.735602] ffff88007b82bf40 ffff88007bc2d1c0 ffffffff8106fa7a ffffffffa013aa8e [ 739.735624] Call Trace: [ 739.735639] <IRQ> [<ffffffff81554521>] ? dump_stack+0x40/0x50 [ 739.735677] [<ffffffff8106fa01>] ? warn_slowpath_common+0x81/0xb0 [ 739.735692] [<ffffffff8106fa7a>] ? warn_slowpath_fmt+0x4a/0x50 [ 739.735715] [<ffffffffa012448f>] ? batadv_iv_ogm_process_per_outif+0xe3f/0xe60 [batman_adv] [ 739.735740] [<ffffffffa0124813>] ? batadv_iv_ogm_receive+0x363/0x380 [batman_adv] [ 739.735762] [<ffffffffa0124813>] ? batadv_iv_ogm_receive+0x363/0x380 [batman_adv] [ 739.735783] [<ffffffff810b0841>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 739.735804] [<ffffffffa012cb39>] ? batadv_batman_skb_recv+0xc9/0x110 [batman_adv] [ 739.735825] [<ffffffff81464891>] ? __netif_receive_skb_core+0x841/0x9a0 [ 739.735838] [<ffffffff810b0841>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 739.735853] [<ffffffff81465681>] ? process_backlog+0xa1/0x140 [ 739.735864] [<ffffffff81464f1a>] ? net_rx_action+0x20a/0x320 [ 739.735878] [<ffffffff81073aa7>] ? __do_softirq+0x107/0x270 [ 739.735891] [<ffffffff81073d82>] ? irq_exit+0x92/0xa0 [ 739.735905] [<ffffffff8137e0d1>] ? xen_evtchn_do_upcall+0x31/0x40 [ 739.735924] [<ffffffff8155b8fe>] ? xen_do_hypervisor_callback+0x1e/0x40 [ 739.735939] <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [ 739.735965] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [ 739.735979] [<ffffffff8100a39c>] ? xen_safe_halt+0xc/0x20 [ 739.735991] [<ffffffff8101da6c>] ? default_idle+0x1c/0xa0 [ 739.736004] [<ffffffff810abf6b>] ? cpu_startup_entry+0x2eb/0x350 [ 739.736019] [<ffffffff81b2af5e>] ? start_kernel+0x480/0x48b [ 739.736032] [<ffffffff81b2d116>] ? xen_start_kernel+0x507/0x511 [ 739.736048] ---[ end trace c106bb901244bc8c ]--- Fixes: f987ed6ebd99 ("batman-adv: protect neighbor list with rcu locks") Reported-by: Martin Weinelt <[email protected]> Signed-off-by: Linus Lüssing <[email protected]> Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: Fix integer overflow in batadv_iv_ogm_calc_tq	Sven Eckelmann	1	-2/+3
	The undefined behavior sanatizer detected an signed integer overflow in a setup with near perfect link quality UBSAN: Undefined behaviour in net/batman-adv/bat_iv_ogm.c:1246:25 signed integer overflow: 8713350 * 255 cannot be represented in type 'int' The problems happens because the calculation of mixed unsigned and signed integers resulted in an integer multiplication. batadv_ogm_packet::tq (u8 255) * tq_own (u8 255) * tq_asym_penalty (int 134; max 255) * tq_iface_penalty (int 255; max 255) The tq_iface_penalty, tq_asym_penalty and inv_asym_penalty can just be changed to unsigned int because they are not expected to become negative. Fixes: c039876892e3 ("batman-adv: add WiFi penalty") Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: make sure ELP/OGM orig MAC is updated on address change	Antonio Quartulli	1	-4/+22
	When the MAC address of the primary interface is changed, update the originator address in the ELP and OGM skb buffers as well in order to reflect the change. Fixes: d6f94d91f766 ("batman-adv: ELP - adding basic infrastructure") Reported-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: Fix unexpected free of bcast_own on add_if error	Sven Eckelmann	1	-3/+1
	The function batadv_iv_ogm_orig_add_if allocates new buffers for bcast_own and bcast_own_sum. It is expected that these buffers are unchanged in case either bcast_own or bcast_own_sum couldn't be resized. But the error handling of this function frees the already resized buffer for bcast_own when the allocation of the new bcast_own_sum buffer failed. This will lead to an invalid memory access when some code will try to access bcast_own. Instead the resized new bcast_own buffer has to be kept. This will not lead to problems because the size of the buffer was only increased and therefore no user of the buffer will try to access bytes outside of the new buffer. Fixes: d0015fdd3d2c ("batman-adv: provide orig_node routing API") Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: Fix refcnt leak in batadv_v_neigh_*	Sven Eckelmann	1	-7/+25
	The functions batadv_neigh_ifinfo_get increase the reference counter of the batadv_neigh_ifinfo. These have to be reduced again when the reference is not used anymore to correctly free the objects. Fixes: 9786906022eb ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls") Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: Avoid nullptr derefence in batadv_v_neigh_is_sob	Sven Eckelmann	1	-0/+4
	batadv_neigh_ifinfo_get can return NULL when it cannot find (even when only temporarily) anymore the neigh_ifinfo in the list neigh->ifinfo_list. This has to be checked to avoid kernel Oopses when the ifinfo is dereferenced. This a situation which isn't expected but is already handled by functions like batadv_v_neigh_cmp. The same kind of warning is therefore used before the function returns without dereferencing the pointers. Fixes: 9786906022eb ("batman-adv: B.A.T.M.A.N. V - implement neighbor comparison API calls") Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-18	batman-adv: fix skb deref after free	Florian Westphal	1	-1/+3
	batadv_send_skb_to_orig() calls dev_queue_xmit() so we can't use skb->len. Fixes: 953324776d6d ("batman-adv: network coding - buffer unicast packets before forward") Signed-off-by: Florian Westphal <[email protected]> Reviewed-by: Sven Eckelmann <[email protected]> Signed-off-by: Marek Lindner <[email protected]> Signed-off-by: Antonio Quartulli <[email protected]>
2016-05-17	Merge branch 'for-linus' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits) gitignore: fix wording mfd: ab8500-debugfs: fix "between" in printk memstick: trivial fix of spelling mistake on management cpupowerutils: bench: fix "average" treewide: Fix typos in printk IB/mlx4: printk fix pinctrl: sirf/atlas7: fix printk spelling serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/ w1: comment spelling s/minmum/minimum/ Blackfin: comment spelling s/divsor/divisor/ metag: Fix misspellings in comments. ia64: Fix misspellings in comments. hexagon: Fix misspellings in comments. tools/perf: Fix misspellings in comments. cris: Fix misspellings in comments. c6x: Fix misspellings in comments. blackfin: Fix misspelling of 'register' in comment. avr32: Fix misspelling of 'definitions' in comment. treewide: Fix typos in printk Doc: treewide : Fix typos in DocBook/filesystem.xml ...
2016-05-17	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	Linus Torvalds	365	-7855/+13259
	Pull networking updates from David Miller: "Highlights: 1) Support SPI based w5100 devices, from Akinobu Mita. 2) Partial Segmentation Offload, from Alexander Duyck. 3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE. 4) Allow cls_flower stats offload, from Amir Vadai. 5) Implement bpf blinding, from Daniel Borkmann. 6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is actually using FASYNC these atomics are superfluous. From Eric Dumazet. 7) Run TCP more preemptibly, also from Eric Dumazet. 8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e driver, from Gal Pressman. 9) Allow creating ppp devices via rtnetlink, from Guillaume Nault. 10) Improve BPF usage documentation, from Jesper Dangaard Brouer. 11) Support tunneling offloads in qed, from Manish Chopra. 12) aRFS offloading in mlx5e, from Maor Gottlieb. 13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo Leitner. 14) Add MSG_EOR support to TCP, this allows controlling packet coalescing on application record boundaries for more accurate socket timestamp sampling. From Martin KaFai Lau. 15) Fix alignment of 64-bit netlink attributes across the board, from Nicolas Dichtel. 16) Per-vlan stats in bridging, from Nikolay Aleksandrov. 17) Several conversions of drivers to ethtool ksettings, from Philippe Reynes. 18) Checksum neutral ILA in ipv6, from Tom Herbert. 19) Factorize all of the various marvell dsa drivers into one, from Vivien Didelot 20) Add VF support to qed driver, from Yuval Mintz" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits) Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m" Revert "phy dp83867: Make rgmii parameters optional" r8169: default to 64-bit DMA on recent PCIe chips phy dp83867: Make rgmii parameters optional phy dp83867: Fix compilation with CONFIG_OF_MDIO=m bpf: arm64: remove callee-save registers use for tmp registers asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions switchdev: pass pointer to fib_info instead of copy net_sched: close another race condition in tcf_mirred_release() tipc: fix nametable publication field in nl compat drivers: net: Don't print unpopulated net_device name qed: add support for dcbx. ravb: Add missing free_irq() calls to ravb_close() qed: Remove a stray tab net: ethernet: fec-mpc52xx: use phy_ethtool_{get\|set}_link_ksettings net: ethernet: fec-mpc52xx: use phydev from struct net_device bpf, doc: fix typo on bpf_asm descriptions stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set net: ethernet: fs-enet: use phy_ethtool_{get\|set}_link_ksettings net: ethernet: fs-enet: use phydev from struct net_device ...
2016-05-17	Merge branch 'work.const-path' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct path' constification update from Al Viro: "'struct path' is passed by reference to a bunch of Linux security methods; in theory, there's nothing to stop them from modifying the damn thing and LSM community being what it is, sooner or later some enterprising soul is going to decide that it's a good idea. Let's remove the temptation and constify all of those..." * 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: constify ima_d_path() constify security_sb_pivotroot() constify security_path_chroot() constify security_path_{link,rename} apparmor: remove useless checks for NULL ->mnt constify security_path_{mkdir,mknod,symlink} constify security_path_{unlink,rmdir} apparmor: constify common_perm_...() apparmor: constify aa_path_link() apparmor: new helper - common_path_perm() constify chmod_common/security_path_chmod constify security_sb_mount() constify chown_common/security_path_chown tomoyo: constify assorted struct path * apparmor_path_truncate(): path->mnt is never NULL constify vfs_truncate() constify security_path_truncate() [apparmor] constify struct path * in a bunch of helpers
2016-05-17	SUNRPC: Ensure get_rpccred() and put_rpccred() can take NULL arguments	Trond Myklebust	1	-2/+3
	Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Remove qplock	Chuck Lever	2	-4/+0
	Clean up. After "xprtrdma: Remove ro_unmap() from all registration modes", there are no longer any sites that take rpcrdma_ia::qplock for read. The one site that takes it for write is always single-threaded. It is safe to remove it. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Faster server reboot recovery	Chuck Lever	1	-1/+11
	In a cluster failover scenario, it is desirable for the client to attempt to reconnect quickly, as an alternate NFS server is already waiting to take over for the down server. The client can't see that a server IP address has moved to a new server until the existing connection is gone. For fabrics and devices where it is meaningful, set a definite upper bound on the amount of time before it is determined that a connection is no longer valid. This allows the RPC client to detect connection loss in a timely matter, then perform a fresh resolution of the server GUID in case it has changed (cluster failover). Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Remove ro_unmap() from all registration modes	Chuck Lever	4	-88/+0
	Clean up: The ro_unmap method is no longer used. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Add ro_unmap_safe memreg method	Chuck Lever	6	-19/+150
	There needs to be a safe method of releasing registered memory resources when an RPC terminates. Safe can mean a number of things: + Doesn't have to sleep + Doesn't rely on having a QP in RTS ro_unmap_safe will be that safe method. It can be used in cases where synchronous memory invalidation can deadlock, or needs to have an active QP. The important case is fencing an RPC's memory regions after it is signaled (^C) and before it exits. If this is not done, there is a window where the server can write an RPC reply into memory that the client has released and re-used for some other purpose. Note that this is a full solution for FRWR, but FMR and physical still have some gaps where a particularly bad server can wreak some havoc on the client. These gaps are not made worse by this patch and are expected to be exceptionally rare and timing-based. They are noted in documenting comments. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Refactor __fmr_dma_unmap()	Chuck Lever	1	-5/+2
	Separate the DMA unmap operation from freeing the MW. In a subsequent patch they will not always be done at the same time, and they are not related operations (except by order; freeing the MW must be the last step during invalidation). Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Move fr_xprt and fr_worker to struct rpcrdma_mw	Chuck Lever	2	-7/+7
	In a subsequent patch, the fr_xprt and fr_worker fields will be needed by another memory registration mode. Move them into the generic rpcrdma_mw structure that wraps struct rpcrdma_frmr. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Refactor the FRWR recovery worker	Chuck Lever	1	-9/+16
	Maintain the order of invalidation and DMA unmapping when doing a background MR reset. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Reset MRs in frwr_op_unmap_sync()	Chuck Lever	1	-38/+60
	frwr_op_unmap_sync() is now invoked in a workqueue context, the same as __frwr_queue_recovery(). There's no need to defer MR reset if posting LOCAL_INV MRs fails. This means that even when ib_post_send() fails (which should occur very rarely) the invalidation and DMA unmapping steps are still done in the correct order. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Save I/O direction in struct rpcrdma_frwr	Chuck Lever	2	-3/+4
	Move the the I/O direction field from rpcrdma_mr_seg into the rpcrdma_frmr. This makes it possible to DMA-unmap the frwr long after an RPC has exited and its rpcrdma_mr_seg array has been released and re-used. This might occur if an RPC times out while waiting for a new connection to be established. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Rename rpcrdma_frwr::sg and sg_nents	Chuck Lever	2	-20/+20
	Clean up: Follow same naming convention as other fields in struct rpcrdma_frwr. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Use core ib_drain_qp() API	Chuck Lever	1	-35/+6
	Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with the new ib_drain_qp() API. Signed-off-by: Chuck Lever <[email protected]> Reviewed-By: Leon Romanovsky <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Remove rpcrdma_create_chunks()	Chuck Lever	1	-151/+0
	rpcrdma_create_chunks() has been replaced, and can be removed. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Allow Read list and Reply chunk simultaneously	Chuck Lever	2	-60/+272
	rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Update comments in rpcrdma_marshal_req()	Chuck Lever	1	-14/+4
	Update documenting comments to reflect code changes over the past year. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Avoid using Write list for small NFS READ requests	Chuck Lever	1	-4/+5
	Avoid the latency and interrupt overhead of registering a Write chunk when handling NFS READ requests of a few hundred bytes or less. This change does not interoperate with Linux NFS/RDMA servers that do not have commit 9d11b51ce7c1 ('svcrdma: Fix send_reply() scatter/gather set-up'). Commit 9d11b51ce7c1 was introduced in v4.3, and is included in 4.2.y, 4.1.y, and 3.18.y. Oracle bug 22925946 has been filed to request that the above fix be included in the Oracle Linux UEK4 NFS/RDMA server. Red Hat bugzillas 1327280 and 1327554 have been filed to request that RHEL NFS/RDMA server backports include the above fix. Workaround: Replace the "proto=rdma,port=20049" mount options with "proto=tcp" until commit 9d11b51ce7c1 is applied to your NFS server. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Prevent inline overflow	Chuck Lever	5	-11/+90
	When deciding whether to send a Call inline, rpcrdma_marshal_req doesn't take into account header bytes consumed by chunk lists. This results in Call messages on the wire that are sometimes larger than the inline threshold. Likewise, when a Write list or Reply chunk is in play, the server's reply has to emit an RDMA Send that includes a larger-than-minimal RPC-over-RDMA header. The actual size of a Call message cannot be estimated until after the chunk lists have been registered. Thus the size of each RPC-over-RDMA header can be estimated only after chunks are registered; but the decision to register chunks is based on the size of that header. Chicken, meet egg. The best a client can do is estimate header size based on the largest header that might occur, and then ensure that inline content is always smaller than that. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers	Chuck Lever	5	-26/+23
	Send buffer space is shared between the RPC-over-RDMA header and an RPC message. A large RPC-over-RDMA header means less space is available for the associated RPC message, which then has to be moved via an RDMA Read or Write. As more segments are added to the chunk lists, the header increases in size. Typical modern hardware needs only a few segments to convey the maximum payload size, but some devices and registration modes may need a lot of segments to convey data payload. Sometimes so many are needed that the remaining space in the Send buffer is not enough for the RPC message. Sending such a message usually fails. To ensure a transport can always make forward progress, cap the number of RDMA segments that are allowed in chunk lists. This prevents less-capable devices and memory registrations from consuming a large portion of the Send buffer by reducing the maximum data payload that can be conveyed with such devices. For now I choose an arbitrary maximum of 8 RDMA segments. This allows a maximum size RPC-over-RDMA header to fit nicely in the current 1024 byte inline threshold with over 700 bytes remaining for an inline RPC message. The current maximum data payload of NFS READ or WRITE requests is one megabyte. To convey that payload on a client with 4KB pages, each chunk segment would need to handle 32 or more data pages. This is well within the capabilities of FMR. For physical registration, the maximum payload size on platforms with 4KB pages is reduced to 32KB. For FRWR, a device's maximum page list depth would need to be at least 34 to support the maximum 1MB payload. A device with a smaller maximum page list depth means the maximum data payload is reduced when using that device. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	xprtrdma: Bound the inline threshold values	Chuck Lever	1	-0/+6
	Currently the sysctls that allow setting the inline threshold allow any value to be set. Small values only make the transport run slower. The default 1KB setting is as low as is reasonable. And the logic that decides how to divide a Send buffer between RPC-over-RDMA header and RPC message assumes (but does not check) that the lower bound is not crazy (say, 57 bytes). Send and receive buffers share a page with some control information. Values larger than about 3KB can't be supported, currently. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	sunrpc: Advertise maximum backchannel payload size	Chuck Lever	5	-0/+41
	RPC-over-RDMA transports have a limit on how large a backward direction (backchannel) RPC message can be. Ensure that the NFSv4.x CREATE_SESSION operation advertises this limit to servers. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-05-17	Merge tag 'net-next-qcom-soc-4.7-2-merge' of git://github.com/andersson/kernel	David S. Miller	1	-3/+6
	Merge tag 'qcom-soc-for-4.7-2' into net-next This merges the Qualcomm SOC tree with the net-next, solving the merge conflict in the SMD API between the two.
2016-05-17	Merge branch 'for-linus' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull parallel filesystem directory handling update from Al Viro. This is the main parallel directory work by Al that makes the vfs layer able to do lookup and readdir in parallel within a single directory. That's a big change, since this used to be all protected by the directory inode mutex. The inode mutex is replaced by an rwsem, and serialization of lookups of a single name is done by a "in-progress" dentry marker. The series begins with xattr cleanups, and then ends with switching filesystems over to actually doing the readdir in parallel (switching to the "iterate_shared()" that only takes the read lock). A more detailed explanation of the process from Al Viro: "The xattr work starts with some acl fixes, then switches ->getxattr to passing inode and dentry separately. This is the point where the things start to get tricky - that got merged into the very beginning of the -rc3-based #work.lookups, to allow untangling the security_d_instantiate() mess. The xattr work itself proceeds to switch a lot of filesystems to generic_...xattr(); no complications there. After that initial xattr work, the series then does the following: - untangle security_d_instantiate() - convert a bunch of open-coded lookup_one_len_unlocked() to calls of that thing; one such place (in overlayfs) actually yields a trivial conflict with overlayfs fixes later in the cycle - overlayfs ended up switching to a variant of lookup_one_len_unlocked() sans the permission checks. I would've dropped that commit (it gets overridden on merge from #ovl-fixes in #for-next; proper resolution is to use the variant in mainline fs/overlayfs/super.c), but I didn't want to rebase the damn thing - it was fairly late in the cycle... - some filesystems had managed to depend on lookup/lookup exclusion for fs-internal data structures in a way that would break if we relaxed the VFS exclusion. Fixing hadn't been hard, fortunately. - core of that series - parallel lookup machinery, replacing ->i_mutex with rwsem, making lookup_slow() take it only shared. At that point lookups happen in parallel; lookups on the same name wait for the in-progress one to be done with that dentry. Surprisingly little code, at that - almost all of it is in fs/dcache.c, with fs/namei.c changes limited to lookup_slow() - making it use the new primitive and actually switching to locking shared. - parallel readdir stuff - first of all, we provide the exclusion on per-struct file basis, same as we do for read() vs lseek() for regular files. That takes care of most of the needed exclusion in readdir/readdir; however, these guys are trickier than lookups, so I went for switching them one-by-one. To do that, a new method '->iterate_shared()' is added and filesystems are switched to it as they are either confirmed to be OK with shared lock on directory or fixed to be OK with that. I hope to kill the original method come next cycle (almost all in-tree filesystems are switched already), but it's still not quite finished. - several filesystems get switched to parallel readdir. The interesting part here is dealing with dcache preseeding by readdir; that needs minor adjustment to be safe with directory locked only shared. Most of the filesystems doing that got switched to in those commits. Important exception: NFS. Turns out that NFS folks, with their, er, insistence on VFS getting the fuck out of the way of the Smart Filesystem Code That Knows How And What To Lock(tm) have grown the locking of their own. They had their own homegrown rwsem, with lookup/readdir/atomic_open being writers (sillyunlink is the reader there). Of course, with VFS getting the fuck out of the way, as requested, the actual smarts of the smart filesystem code etc. had become exposed... - do_last/lookup_open/atomic_open cleanups. As the result, open() without O_CREAT locks the directory only shared. Including the ->atomic_open() case. Backmerge from #for-linus in the middle of that - atomic_open() fix got brought in. - then comes NFS switch to saner (VFS-based ;-) locking, killing the homegrown "lookup and readdir are writers" kinda-sorta rwsem. All exclusion for sillyunlink/lookup is done by the parallel lookups mechanism. Exclusion between sillyunlink and rmdir is a real rwsem now - rmdir being the writer. Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel now. - the rest of the series consists of switching a lot of filesystems to parallel readdir; in a lot of cases ->llseek() gets simplified as well. One backmerge in there (again, #for-linus - rockridge fix)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits) ext4: switch to ->iterate_shared() hfs: switch to ->iterate_shared() hfsplus: switch to ->iterate_shared() hostfs: switch to ->iterate_shared() hpfs: switch to ->iterate_shared() hpfs: handle allocation failures in hpfs_add_pos() gfs2: switch to ->iterate_shared() f2fs: switch to ->iterate_shared() afs: switch to ->iterate_shared() befs: switch to ->iterate_shared() befs: constify stuff a bit isofs: switch to ->iterate_shared() get_acorn_filename(): deobfuscate a bit btrfs: switch to ->iterate_shared() logfs: no need to lock directory in lseek switch ecryptfs to ->iterate_shared 9p: switch to ->iterate_shared() fat: switch to ->iterate_shared() romfs, squashfs: switch to ->iterate_shared() more trivial ->iterate_shared conversions ...
2016-05-17	switchdev: pass pointer to fib_info instead of copy	Jiri Pirko	1	-4/+2
	The problem is that fib_info->nh is [0] so the struct fib_info allocation size depends on number of nexthops. If we just copy fib_info, we do not copy the nexthops info and driver accesses memory which is not ours. Given the fact that fib4 does not defer operations and therefore it does not need copy, just pass the pointer down to drivers as it was done before. Fixes: 850d0cbc91 ("switchdev: remove pointers from switchdev objects") Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-17	net_sched: close another race condition in tcf_mirred_release()	WANG Cong	1	-2/+3
	We saw the following extra refcount release on veth device: kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1 Since we heavily use mirred action to redirect packets to veth, I think this is caused by the following race condition: CPU0: tcf_mirred_release(): (in RCU callback) struct net_device dev = rcu_dereference_protected(m->tcfm_dev, 1); CPU1: mirred_device_event(): spin_lock_bh(&mirred_list_lock); list_for_each_entry(m, &mirred_list, tcfm_list) { if (rcu_access_pointer(m->tcfm_dev) == dev) { dev_put(dev); / Note : no rcu grace period necessary, as * net_device are already rcu protected. */ RCU_INIT_POINTER(m->tcfm_dev, NULL); } } spin_unlock_bh(&mirred_list_lock); CPU0: tcf_mirred_release(): spin_lock_bh(&mirred_list_lock); list_del(&m->tcfm_list); spin_unlock_bh(&mirred_list_lock); if (dev) // <======== Stil refers to the old m->tcfm_dev dev_put(dev); // <======== dev_put() is called on it again The action init code path is good because it is impossible to modify an action that is being removed. So, fix this by moving everything under the spinlock. Fixes: 2ee22a90c7af ("net_sched: act_mirred: remove spinlock in fast path") Fixes: 6bd00b850635 ("act_mirred: fix a race condition on mirred_list") Cc: Jamal Hadi Salim <[email protected]> Signed-off-by: Cong Wang <[email protected]> Acked-by: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>