aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-02-21Merge branch 'bpf-unlocking-fix'David S. Miller9-5/+22
Daniel Borkmann says: ==================== BPF fix with regards to unlocking This set fixes the issue Eric was reporting recently [1]. First patch is a prerequisite discussed with Laura that is needed for the later fix in the second one. I've tested this extensively and it does not reproduce anymore on my side after the fix. Thanks & sorry about that! [1] https://www.spinics.net/lists/netdev/msg421877.html ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21bpf: fix unlocking of jited image when module ronx not setDaniel Borkmann4-5/+14
Eric and Willem reported that they recently saw random crashes when JIT was in use and bisected this to 74451e66d516 ("bpf: make jited programs visible in traces"). Issue was that the consolidation part added bpf_jit_binary_unlock_ro() that would unlock previously made read-only memory back to read-write. However, DEBUG_SET_MODULE_RONX cannot be used for this to test for presence of set_memory_*() functions. We need to use ARCH_HAS_SET_MEMORY instead to fix this; also add the corresponding bpf_jit_binary_lock_ro() to filter.h. Fixes: 74451e66d516 ("bpf: make jited programs visible in traces") Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Bisected-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21arch: add ARCH_HAS_SET_MEMORY configDaniel Borkmann5-0/+8
Currently, there's no good way to test for the presence of set_memory_ro/rw/x/nx() helpers implemented by archs such as x86, arm, arm64 and s390. There's DEBUG_SET_MODULE_RONX and DEBUG_RODATA, however both don't really reflect that: set_memory_*() are also available even when DEBUG_SET_MODULE_RONX is turned off, and DEBUG_RODATA is set by parisc, but doesn't implement above functions. Thus, add ARCH_HAS_SET_MEMORY that is selected by mentioned archs, where generic code can test against this. This also allows later on to move DEBUG_SET_MODULE_RONX out of the arch specific Kconfig to define it only once depending on ARCH_HAS_SET_MEMORY. Suggested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: napi_watchdog() can use napi_schedule_irqoff()Eric Dumazet1-1/+1
hrtimer handlers run with masked hard IRQ, we can therefore use napi_schedule_irqoff() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"Eric Dumazet1-2/+2
This reverts commit e70ac171658679ecf6bea4bbd9e9325cd6079d2b. jtcp_rcv_established() is in fact called with hard irq being disabled. Initial bug report from Ricardo Nabinger Sanchez [1] still needs to be investigated, but does not look like a TCP bug. [1] https://www.spinics.net/lists/netdev/msg420960.html Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <xiaolong.ye@intel.com> Cc: Ricardo Nabinger Sanchez <rnsanchez@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net/hsr: use eth_hw_addr_random()Tobias Klauser1-1/+1
Use eth_hw_addr_random() to set a random MAC address in order to make sure dev->addr_assign_type will be properly set to NET_ADDR_RANDOM. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21Merge branch 'mvpp2-next'David S. Miller2-87/+128
Thomas Petazzoni says: ==================== net: mvpp2: misc improvements and preparation patches This series contains a number of fixes, misc improvements and preparation patches for an upcoming series that adds support for the new PPv2.2 network controller to the mvpp2 driver. The most significant improvements are: - Switching to using build_skb(), which is necessary for the upcoming PPv2.2 support, but anyway a good improvement to the current mvpp2 driver (supporting PPv2.1). - Making the driver build on 64-bit platforms. Changes since v3: - Addition of a patch "net: mvpp2: fix DMA address calculation in mvpp2_txq_inc_put()", which fixes a bug in the driver in the calculation of DMA addresses. This bug was found using DMA_API_DEBUG. - Modify the "net: mvpp2: switch to build_skb() in the RX path" patch to recalculate the fragment size when the MTU is changed in mvpp2_bm_update_mtu(). - Added Acked-by from Russell King on all patches, except: * "net: mvpp2: fix DMA address calculation in mvpp2_txq_inc_put()", because it's a new patch * "net: mvpp2: switch to build_skb() in the RX path" because I modified it since the v3. - Rebased on top of 4.10. Changes since v2: - Fix remaining 64-bit build warning, reported by David Miller. - Adjust how bit mask related definitions are done in "net: mvpp2: simplify MVPP2_PRS_RI_* definitions" according to Russell King suggestions. - Add a patch "net: mvpp2: remove useless arguments in mvpp2_rx_{pkts,time}_coal_set", suggested by Russell King. - Rework mvpp2_rx_time_coal_set() implementation to avoid overflows and rounding errors. I've used the implementation suggested by Russell King. Changes since v1: - This series is split as a separate series from the larger patch set adding support for PPv2.2 in the mvpp2 driver, as requested by David Miller. - Rebased on top of v4.10-rc1. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: enable building on 64-bit platformsThomas Petazzoni2-15/+19
The mvpp2 is going to be extended to support the Marvell Armada 7K/8K platform, which is ARM64. As a preparation to this work, this commit enables building the mvpp2 driver on ARM64, by: - Adjusting the Kconfig dependency - Fixing the types used in the driver so that they are 32/64-bits compliant. We use dma_addr_t for DMA addresses, and unsigned long for virtual addresses. It is worth mentioning that after this commit, the driver is for now still only used on 32-bits platforms, and will only work on 32-bits platforms. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: switch to build_skb() in the RX pathThomas Petazzoni1-22/+57
This commit adapts the mvpp2 RX path to use the build_skb() method. Not only build_skb() is now the recommended mechanism, but it also simplifies the addition of support for the PPv2.2 variant. Indeed, without build_skb(), we have to keep track for each RX descriptor of the physical address of the packet buffer, and the virtual address of the SKB. However, in PPv2.2 running on 64 bits platform, there is not enough space in the descriptor to store the virtual address of the SKB. So having to take care only of the address of the packet buffer, and building the SKB upon reception helps in supporting PPv2.2. The implementation is fairly straightforward: - mvpp2_skb_alloc() is renamed to mvpp2_buf_alloc() and no longer allocates a SKB. Instead, it allocates a buffer using the new mvpp2_frag_alloc() function, with enough space for the data and SKB. - The initialization of the RX buffers in mvpp2_bm_bufs_add() as well as the refill of the RX buffers in mvpp2_rx_refill() is adjusted accordingly. - Finally, the mvpp2_rx() is modified to use build_skb(). Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: simplify MVPP2_PRS_RI_* definitionsThomas Petazzoni1-8/+8
Some of the MVPP2_PRS_RI_* definitions use the ~(value) syntax, which doesn't compile nicely on 64-bit. Moreover, those definitions are in fact unneeded, since they are always used in combination with a bit mask that ensures only the appropriate bits are modified. Therefore, such definitions should just be set to 0x0. In addition, as suggested by Russell King, we change the _MASK definitions to also use the BIT() macro so that it is clear they are related to the values defined afterwards. For example: #define MVPP2_PRS_RI_L2_CAST_MASK 0x600 #define MVPP2_PRS_RI_L2_UCAST ~(BIT(9) | BIT(10)) #define MVPP2_PRS_RI_L2_MCAST BIT(9) #define MVPP2_PRS_RI_L2_BCAST BIT(10) becomes #define MVPP2_PRS_RI_L2_CAST_MASK (BIT(9) | BIT(10)) #define MVPP2_PRS_RI_L2_UCAST 0x0 #define MVPP2_PRS_RI_L2_MCAST BIT(9) #define MVPP2_PRS_RI_L2_BCAST BIT(10) Because the values (MVPP2_PRS_RI_L2_UCAST, MVPP2_PRS_RI_L2_MCAST and MVPP2_PRS_RI_L2_BCAST) are always applied with MVPP2_PRS_RI_L2_CAST_MASK, and therefore there is no need for MVPP2_PRS_RI_L2_UCAST to be defined as ~(BIT(9) | BIT(10)). It fixes the following warnings when building the driver on a 64-bit platform (which is not possible as of this commit, but will be enabled in a follow-up commit): drivers/net/ethernet/marvell/mvpp2.c: In function ‘mvpp2_prs_mac_promisc_set’: drivers/net/ethernet/marvell/mvpp2.c:524:33: warning: large integer implicitly truncated to unsigned type [-Woverflow] #define MVPP2_PRS_RI_L2_UCAST ~(BIT(9) | BIT(10)) ^ drivers/net/ethernet/marvell/mvpp2.c:1459:33: note: in expansion of macro ‘MVPP2_PRS_RI_L2_UCAST’ mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_L2_UCAST, Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULTThomas Petazzoni1-1/+1
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: remove unused register definitionsThomas Petazzoni1-4/+0
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: simplify mvpp2_bm_bufs_add()Thomas Petazzoni1-3/+1
The mvpp2_bm_bufs_add() currently creates a fake cookie by calling mvpp2_bm_cookie_pool_set(), just to be able to call mvpp2_pool_refill(). But all what mvpp2_pool_refill() does is extract the pool ID from the cookie, and call mvpp2_bm_pool_put() with this ID. Instead of doing this convoluted thing, just call mvpp2_bm_pool_put() directly, since we have the BM pool ID. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: drop useless fields in mvpp2_bm_pool and related codeThomas Petazzoni1-15/+3
This commit drops dead code from the mvpp2 driver. The 'in_use' and 'in_use_thresh' fields of 'struct mvpp2_bm_pool' are incremented/decremented/initialized in various places. But they are only used in one place: if (is_recycle && (atomic_read(&bm_pool->in_use) < bm_pool->in_use_thresh)) return 0; However 'is_recycle', passed as argument to mvpp2_rx_refill() is always false. So in fact, this code is never reached, and the 'is_recycle' argument is useless. So let's drop this code. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'Thomas Petazzoni1-3/+0
This commit remove a field of 'struct mvpp2_tx_queue' that is not used anywhere. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: release reference to txq_cpu[] entry after unmappingThomas Petazzoni1-5/+4
The mvpp2_txq_bufs_free() function is called upon TX completion to DMA unmap TX buffers, and free the corresponding SKBs. It gets the references to the SKB to free and the DMA buffer to unmap from a per-CPU txq_pcpu data structure. However, the code currently increments the pointer to the next entry before doing the DMA unmap and freeing the SKB. It does not cause any visible problem because for a given SKB the TX completion is guaranteed to take place on the CPU where the TX was started. However, it is much more logical to increment the pointer to the next entry once the current entry has been completely unmapped/released. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()Thomas Petazzoni1-2/+29
When configuring the MVPP2_ISR_RX_THRESHOLD_REG with the RX coalescing time threshold, we do not check for the maximum allowed value supported by the driver, which means we might overflow and use a bogus value. This commit adds a check for this situation, and if a value higher than what is supported by the hardware is provided, then we use the maximum value supported by the hardware. In order to achieve this in a way that avoids overflow and rounding errors, we introduce two utility functions mvpp2_usec_to_cycles() and cycles_to_usec(). Many thanks to Russell King for suggesting this implementation. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()Thomas Petazzoni1-3/+4
Currently, mvpp2_rx_pkts_coal_set() does the following to avoid setting a too large value for the RX coalescing by packet number: val = (pkts & MVPP2_OCCUPIED_THRESH_MASK); This means that if you set a value that is slightly higher the the maximum number of packets, you in fact get a very low value. It makes a lot more sense to simply check if the value is too high, and if it's too high, limit it to the maximum possible value. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_setThomas Petazzoni1-12/+8
As noticed by Russell King, the last argument of mvpp2_rx_{pkts,time}_coal_set() is useless, since the packet/time coalescing value is already stored in the 'struct mvpp2_rx_queue *' passed as argument to these functions. So passing the packet/time value as an additional argument, and setting them again in the mvpp2_rx_queue structure is useles. This commit therefore gets rid of this additional argument, assuming the caller has assigned the appropriate value to rxq->pkts_coal or rxq->time_coal before calling the respective functions. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: mvpp2: fix DMA address calculation in mvpp2_txq_inc_put()Thomas Petazzoni1-1/+1
When TX descriptors are filled in, the buffer DMA address is split between the tx_desc->buf_phys_addr field (high-order bits) and tx_desc->packet_offset field (5 low-order bits). However, when we re-calculate the DMA address from the TX descriptor in mvpp2_txq_inc_put(), we do not take tx_desc->packet_offset into account. This means that when the DMA address is not aligned on a 32 bytes boundary, we end up calling dma_unmap_single() with a DMA address that was not the one returned by dma_map_single(). This inconsistency is detected by the kernel when DMA_API_DEBUG is enabled. We fix this problem by properly calculating the DMA address in mvpp2_txq_inc_put(). Cc: <stable@vger.kernel.org> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21macsec: fix validation failed in asynchronous operation.Lee Ryder1-0/+3
MACSec test failed when asynchronous crypto operations is used. It encounters packet validation failed since macsec_skb_cb(skb)->valid is always 'false'. This patch adds missing "macsec_skb_cb(skb)->valid = true" in macsec_decrypt_done() when "err == 0". Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21net: sock: Use USEC_PER_SEC macro instead of literal 1000000Gao Feng1-3/+3
The USEC_PER_SEC is used once in sock_set_timeout as the max value of tv_usec. But there are other similar codes which use the literal 1000000 in this file. It is minor cleanup to keep consitent. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21virtio-net: switch to use build_skb() for small bufferJason Wang1-75/+63
This patch switch to use build_skb() for small buffer which can have better performance for both TCP and XDP (since we can work at page before skb creation). It also remove lots of XDP codes since both mergeable and small buffer use page frag during refill now. Before | After XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21ip: fix IP_CHECKSUM handlingPaolo Abeni1-4/+4
The skbs processed by ip_cmsg_recv() are not guaranteed to be linear e.g. when sending UDP packets over loopback with MSGMORE. Using csum_partial() on [potentially] the whole skb len is dangerous; instead be on the safe side and use skb_checksum(). Thanks to syzkaller team to detect the issue and provide the reproducer. v1 -> v2: - move the variable declaration in a tighter scope Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21vxlan: remove unused variable saddr in neigh_reduceRoopa Prabhu1-2/+1
silences the below warning: drivers/net/vxlan.c: In function ‘neigh_reduce’: drivers/net/vxlan.c:1599:25: warning: variable ‘saddr’ set but not used [-Wunused-but-set-variable] Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21vxlan: add changelink supportRoopa Prabhu1-113/+270
This patch adds changelink rtnl op support for vxlan netdevs. code changes involve: - refactor vxlan_newlink into vxlan_nl2conf to be used by vxlan_newlink and vxlan_changelink - vxlan_nl2conf and vxlan_dev_configure take a changelink argument to isolate changelink checks and updates. - Allow changing only a few attributes: - return -EOPNOTSUPP for attributes that cannot be changed for now. Incremental patches can make the non-supported one available in the future if needed. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21rtnl: simplify error return path in rtnl_create_link()Tobias Klauser1-6/+1
There is only one possible error path which reaches the err label, so return ERR_PTR(-ENOMEM) directly if alloc_netdev_mqs() fails. This also allows to omit the err variable. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21Merge tag 'gfs2-4.11.fixes' of ↵Linus Torvalds8-60/+105
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Robert Peterson: "We've got eight GFS2 patches for this merge window: - Andy Price submitted a patch to make gfs2_write_full_page a static function. - Dan Carpenter submitted a patch to fix a ERR_PTR thinko. Three patches fix bugs related to deleting very large files, which cause GFS2 to run out of journal space: - The first one prevents GFS2 delete operation from requesting too much journal space. - The second one fixes a problem whereby GFS2 can hang because it wasn't taking journal space demand into its calculations. - The third one wakes up IO waiters when a flush is done to restart processes stuck waiting for journal space to become available. The final three patches are a performance improvement related to spin_lock contention between multiple writers: - The "tr_touched" variable was switched to a flag to be more atomic and eliminate the possibility of some races. - Function meta_lo_add was moved inline with its only caller to make the code more readable and efficient. - Contention on the gfs2_log_lock spinlock was greatly reduced by avoiding the lock altogether in cases where we don't really need it: buffers that already appear in the appropriate metadata list for the journal. Many thanks to Steve Whitehouse for the ideas and principles behind these patches" * tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Make gfs2_write_full_page static GFS2: Reduce contention on gfs2_log_lock GFS2: Inline function meta_lo_add GFS2: Switch tr_touched to flag in transaction GFS2: Wake up io waiters whenever a flush is done GFS2: Made logd daemon take into account log demand GFS2: Limit number of transaction blocks requested for truncates GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
2017-02-21Merge branch 'topic/zx' into for-linusVinod Koul3-5/+7
2017-02-21Merge branch 'topic/stm32-dma' into for-linusVinod Koul3-37/+63
2017-02-21Merge branch 'topic/ste' into for-linusVinod Koul1-2/+5
2017-02-21Merge branch 'for_linus' of ↵Linus Torvalds11-193/+205
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fixes and cleanups from Jan Kara: "Several small UDF fixes and cleanups and a small cleanup of fanotify code" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: simplify the code of fanotify_merge udf: simplify udf_ioctl() udf: fix ioctl errors udf: allow implicit blocksize specification during mount udf: check partition reference in udf_read_inode() udf: atomically read inode size udf: merge module informations in super.c udf: remove next_epos from udf_update_extent_cache() udf: Factor out trimming of crtime udf: remove empty condition udf: remove unneeded line break udf: merge bh free udf: use pointer for kernel_long_ad argument udf: use __packed instead of __attribute__ ((packed)) udf: Make stat on symlink report symlink length as st_size fs/udf: make #ifdef UDF_PREALLOCATE unconditional fs: udf: Replace CURRENT_TIME with current_time()
2017-02-21Merge branch 'topic/intel' into for-linusVinod Koul6-67/+227
2017-02-21x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64Waiman Long2-0/+33
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socket 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patch with patch -------- --------- ---------- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/paravirt: Change vcp_is_preempted() arg type to longWaiman Long4-4/+4
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21KVM: VMX: use correct vmcs_read/write for guest segment selector/baseChao Peng1-2/+2
Guest segment selector is 16 bit field and guest segment base is natural width field. Fix two incorrect invocations accordingly. Without this patch, build fails when aggressive inlining is used with ICC. Cc: stable@vger.kernel.org Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/kvm/vmx: Defer TR reload after VM exitAndy Lutomirski4-14/+72
Intel's VMX is daft and resets the hidden TSS limit register to 0x67 on VMX reload, and the 0x67 is not configurable. KVM currently reloads TR using the LTR instruction on every exit, but this is quite slow because LTR is serializing. The 0x67 limit is entirely harmless unless ioperm() is in use, so defer the reload until a task using ioperm() is actually running. Here's some poorly done benchmarking using kvm-unit-tests: Before: cpuid 1313 vmcall 1195 mov_from_cr8 11 mov_to_cr8 17 inl_from_pmtimer 6770 inl_from_qemu 6856 inl_from_kernel 2435 outl_to_kernel 1402 After: cpuid 1291 vmcall 1181 mov_from_cr8 11 mov_to_cr8 16 inl_from_pmtimer 6457 inl_from_qemu 6209 inl_from_kernel 2339 outl_to_kernel 1391 Signed-off-by: Andy Lutomirski <luto@kernel.org> [Force-reload TR in invalidate_tss_limit. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tssAndy Lutomirski1-1/+1
Historically, the entire TSS + io bitmap structure was cacheline aligned, but commit ca241c75037b ("x86: unify tss_struct") changed it (presumably inadvertently) so that the fixed-layout hardware part is cacheline-aligned and the io bitmap is after the padding. This wastes 24 bytes (the hardware part should be 104 bytes, but this pads it to 128 bytes) and, serves no purpose, and causes sizeof(struct x86_hw_tss) to have a confusing value. Drop the pointless alignment. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/kvm/vmx: Simplify segment_base()Andy Lutomirski1-8/+7
Use actual pointer types for pointers (instead of unsigned long) and replace hardcoded constants with the appropriate self-documenting macros. The function is still a bit messy, but this seems a lot better than before to me. This is mostly borrowed from a patch by Thomas Garnier. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/kvm/vmx: Get rid of segment_base() on 64-bit kernelsAndy Lutomirski1-4/+7
It was a bit buggy (it didn't list all segment types that needed 64-bit fixups), but the bug was irrelevant because it wasn't called in any interesting context on 64-bit kernels and was only used for data segents on 32-bit kernels. To avoid confusion, make it explicitly 32-bit only. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/kvm/vmx: Don't fetch the TSS base from the GDTAndy Lutomirski1-10/+4
The current CPU's TSS base is a foregone conclusion, so there's no need to parse it out of the segment tables. This should save a couple cycles (as STR is surely microcoded and poorly optimized) but, more importantly, it's a cleanup and it means that segment_base() will never be called on 64-bit kernels. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21x86/asm: Define the kernel TSS limit in a macroAndy Lutomirski2-9/+11
Rather than open-coding the kernel TSS limit in set_tss_desc(), make it a real macro near the TSS layout definition. This is purely a cleanup. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-20virito-net: set queues after reset during xdp_setJason Wang1-20/+8
We set queues before reset which will cause a crash[1]. This is because is_xdp_raw_buffer_queue() depends on the old xdp queue pairs number to do the correct detection. So fix this by - passing xdp queue pairs and current queue pairs to virtnet_reset() - change vi->xdp_qp after reset but before refill, to make sure both free_unused_bufs() and refill can make correct detection of XDP. - remove the duplicated queue pairs setting before virtnet_reset() since we will do it inside virtnet_reset() [1] [ 74.328168] general protection fault: 0000 [#1] SMP [ 74.328625] Modules linked in: nfsd xfs libcrc32c virtio_net virtio_pci [ 74.329117] CPU: 0 PID: 2849 Comm: xdp2 Not tainted 4.10.0-rc7+ #499 [ 74.329577] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014 [ 74.330424] task: ffff88007a894000 task.stack: ffffc90004388000 [ 74.330844] RIP: 0010:skb_release_head_state+0x28/0x80 [ 74.331298] RSP: 0018:ffffc9000438b8d0 EFLAGS: 00010206 [ 74.331676] RAX: 0000000000000000 RBX: ffff88007ad96300 RCX: 0000000000000000 [ 74.332217] RDX: ffff88007fc137a8 RSI: ffff88007fc0db28 RDI: 0001bf00000001be [ 74.332758] RBP: ffffc9000438b8d8 R08: 000000000005008f R09: 00000000000005f9 [ 74.333274] R10: ffff88007d001700 R11: ffffffff820a8a4d R12: ffff88007ad96300 [ 74.333787] R13: 0000000000000002 R14: ffff880036604000 R15: 000077ff80000000 [ 74.334308] FS: 00007fc70d8a7b40(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [ 74.334891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 74.335314] CR2: 00007fff4144a710 CR3: 000000007ab56000 CR4: 00000000003406f0 [ 74.335830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 74.336373] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 74.336895] Call Trace: [ 74.337086] skb_release_all+0xd/0x30 [ 74.337356] consume_skb+0x2c/0x90 [ 74.337607] free_unused_bufs+0x1ff/0x270 [virtio_net] [ 74.337988] ? vp_synchronize_vectors+0x3b/0x60 [virtio_pci] [ 74.338398] virtnet_xdp+0x21e/0x440 [virtio_net] [ 74.338741] dev_change_xdp_fd+0x101/0x140 [ 74.339048] do_setlink+0xcf4/0xd20 [ 74.339304] ? symcmp+0xf/0x20 [ 74.339529] ? mls_level_isvalid+0x52/0x60 [ 74.339828] ? mls_range_isvalid+0x43/0x50 [ 74.340135] ? nla_parse+0xa0/0x100 [ 74.340400] rtnl_setlink+0xd4/0x120 [ 74.340664] ? cpumask_next_and+0x30/0x50 [ 74.340966] rtnetlink_rcv_msg+0x7f/0x1f0 [ 74.341259] ? sock_has_perm+0x59/0x60 [ 74.341586] ? napi_consume_skb+0xe2/0x100 [ 74.342010] ? rtnl_newlink+0x890/0x890 [ 74.342435] netlink_rcv_skb+0x92/0xb0 [ 74.342846] rtnetlink_rcv+0x23/0x30 [ 74.343277] netlink_unicast+0x162/0x210 [ 74.343677] netlink_sendmsg+0x2db/0x390 [ 74.343968] sock_sendmsg+0x33/0x40 [ 74.344233] SYSC_sendto+0xee/0x160 [ 74.344482] ? SYSC_bind+0xb0/0xe0 [ 74.344806] ? sock_alloc_file+0x92/0x110 [ 74.345106] ? fd_install+0x20/0x30 [ 74.345360] ? sock_map_fd+0x3f/0x60 [ 74.345586] SyS_sendto+0x9/0x10 [ 74.345790] entry_SYSCALL_64_fastpath+0x1a/0xa9 [ 74.346086] RIP: 0033:0x7fc70d1b8f6d [ 74.346312] RSP: 002b:00007fff4144a708 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 74.346785] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fc70d1b8f6d [ 74.347244] RDX: 000000000000002c RSI: 00007fff4144a720 RDI: 0000000000000003 [ 74.347683] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [ 74.348544] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff4144bd90 [ 74.349082] R13: 0000000000000002 R14: 0000000000000002 R15: 00007fff4144cda0 [ 74.349607] Code: 00 00 00 55 48 89 e5 53 48 89 fb 48 8b 7f 58 48 85 ff 74 0e 40 f6 c7 01 74 3d 48 c7 43 58 00 00 00 00 48 8b 7b 68 48 85 ff 74 05 <f0> ff 0f 74 20 48 8b 43 60 48 85 c0 74 14 65 8b 15 f3 ab 8d 7e [ 74.351008] RIP: skb_release_head_state+0x28/0x80 RSP: ffffc9000438b8d0 [ 74.351625] ---[ end trace fe6e19fd11cfc80b ]--- Fixes: 2de2f7f40ef9 ("virtio_net: XDP support for adjust_head") Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20dpaa_eth: implement ioctl() for PHY-related opsMichael Walle1-0/+8
This commit adds the ndo_do_ioctl() callback which allows the userspace to access PHY registers, for example. This will make mii-diag and similar tools work. Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20net: qualcomm: qca: use new api ethtool_{get|set}_link_ksettingsPhilippe Reynes1-8/+10
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20Merge branch 'bnxt_en-probe-and-open-bugs'David S. Miller1-39/+43
Michael Chan says: ==================== bnxt_en: Fix probe and open bugs. Fix 3 issues related to bnxt_init_one() and bnxt_open(). Don't probe bridge devices and fixup some error code paths. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20bnxt_en: fix pci cleanup in bnxt_init_one() failure pathSathya Perla1-37/+39
In the bnxt_init_one() failure path, bar1 and bar2 are not being unmapped. This commit fixes this issue. Reorganize the code so that bnxt_init_one()'s failure path and bnxt_remove_one() can call the same function to do the PCI cleanup. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20bnxt_en: Fix NULL pointer dereference in a failure path during open.Michael Chan1-1/+3
If bnxt_hwrm_ring_free() is called during a failure path in bnxt_open(), it is possible that the completion rings have not been allocated yet. In that case, the completion doorbell has not been initialized, and calling bnxt_disable_int() will crash. Fix it by checking that the completion ring has been initialized before writing to the completion ring doorbell. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20bnxt_en: Reject driver probe against all bridge devicesRay Jui1-1/+1
There are additional SoC devices that use the same device ID for bridge and NIC devices. The bnxt driver should reject probe against all bridge devices since it's meant to be used with only endpoint devices. Signed-off-by: Ray Jui <ray.jui@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds19-479/+1582
Pull CIFS/SMB3 updates from Steve French: "Includes support for a critical SMB3 security feature: per-share encryption from Pavel, and a cleanup from Jean Delvare. Will have another cifs/smb3 merge next week" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Allow to switch on encryption with seal mount option CIFS: Add capability to decrypt big read responses CIFS: Decrypt and process small encrypted packets CIFS: Add copy into pages callback for a read operation CIFS: Add mid handle callback CIFS: Add transform header handling callbacks CIFS: Encrypt SMB3 requests before sending CIFS: Enable encryption during session setup phase CIFS: Add capability to transform requests before sending CIFS: Separate RFC1001 length processing for SMB2 read CIFS: Separate SMB2 sync header processing CIFS: Send RFC1001 length in a separate iov CIFS: Make send_cancel take rqst as argument CIFS: Make SendReceive2() takes resp iov CIFS: Separate SMB2 header structure CIFS: Fix splice read for non-cached files cifs: Add soft dependencies cifs: Only select the required crypto modules cifs: Simplify SMB2 and SMB311 dependencies