aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2017-07-01sctp: remove the typedef sctp_sctphdr_tXin Long1-2/+2
This patch is to remove the typedef sctp_sctphdr_t, and replace with struct sctphdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type). Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert netpoll_info.refcnt from atomic_t to refcount_tReshetova, Elena1-1/+2
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert in_device.refcnt from atomic_t to refcount_tReshetova, Elena1-5/+6
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert ip_mc_list.refcnt from atomic_t to refcount_tReshetova, Elena1-1/+2
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert sock.sk_wmem_alloc from atomic_t to refcount_tReshetova, Elena1-1/+1
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_tReshetova, Elena1-2/+2
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert sk_buff.users from atomic_t to refcount_tReshetova, Elena1-5/+5
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-07-01net: convert nf_bridge_info.use from atomic_t to refcount_tReshetova, Elena1-3/+3
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-3/+6
A set of overlapping changes in macvlan and the rocker driver, nothing serious. Signed-off-by: David S. Miller <[email protected]>
2017-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-6/+6
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. This batch contains connection tracking updates for the cleanup iteration path, patches from Florian Westphal: X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set dying bit to let the CPU release them. X) Add nf_ct_iterate_destroy() to be used on module removal, to kill conntrack from all namespace. X) Restart iteration on hashtable resizing, since both may occur at the same time. X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT mapping on module removal. X) Use nf_ct_iterate_destroy() to remove conntrack entries helper module removal, from Liping Zhang. X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension if user requests this, also from Liping. X) Add net_ns_barrier() and use it from FTP helper, so make sure no concurrent namespace removal happens at the same time while the helper module is being removed. X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce module size. Same thing in nf_tables. Updates for the nf_tables infrastructure: X) Prepare usage of the extended ACK reporting infrastructure for nf_tables. X) Remove unnecessary forward declaration in nf_tables hash set. X) Skip set size estimation if number of element is not specified. X) Changes to accomodate a (faster) unresizable hash set implementation, for anonymous sets and dynamic size fixed sets with no timeouts. X) Faster lookup function for unresizable hash table for 2 and 4 bytes key. And, finally, a bunch of asorted small updates and cleanups: X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe to device events and look up for index from the packet path, this is fixing an issue that is present since the very beginning, patch from Xin Long. X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal. X) Use ebt_invalid_target() whenever possible in the ebtables tree, from Gao Feng. X) Calm down compilation warning in nf_dup infrastructure, patch from stephen hemminger. X) Statify functions in nftables rt expression, also from stephen. X) Update Makefile to use canonical method to specify nf_tables-objs. From Jike Song. X) Use nf_conntrack_helpers_register() in amanda and H323. X) Space cleanup for ctnetlink, from linzhang. ==================== Signed-off-by: David S. Miller <[email protected]>
2017-06-29bpf: Add syscall lookup support for fd array and htabMartin KaFai Lau1-0/+3
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS. The lookup returns a prog-id or map-id to the userspace. The userspace can then use the BPF_PROG_GET_FD_BY_ID or BPF_MAP_GET_FD_BY_ID to get a fd. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-28block: provide bio_uninit() free freeing integrity/task associationsJens Axboe1-0/+1
Wen reports significant memory leaks with DIF and O_DIRECT: "With nvme devive + T10 enabled, On a system it has 256GB and started logging /proc/meminfo & /proc/slabinfo for every minute and in an hour it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute leaking. /proc/meminfo | grep SUnreclaim... SUnreclaim: 6752128 kB SUnreclaim: 6874880 kB SUnreclaim: 7238080 kB .... SUnreclaim: 22307264 kB SUnreclaim: 22485888 kB SUnreclaim: 22720256 kB When testcases with T10 enabled call into __blkdev_direct_IO_simple, code doesn't free memory allocated by bio_integrity_alloc. The patch fixes the issue. HTX has been run with +60 hours without failure." Since __blkdev_direct_IO_simple() allocates the bio on the stack, it doesn't go through the regular bio free. This means that any ancillary data allocated with the bio through the stack is not freed. Hence, we can leak the integrity data associated with the bio, if the device is using DIF/DIX. Fix this by providing a bio_uninit() and export it, so that we can use it to free this data. Note that this is a minimal fix for this issue. Any current user of bio's that are allocated outside of bio_alloc_bioset() suffers from this issue, most notably some drivers. We will fix those in a more comprehensive patch for 4.13. This also means that the commit marked as being fixed by this isn't the real culprit, it's just the most obvious one out there. Fixes: 542ff7bf18c6 ("block: new direct I/O implementation") Reported-by: Wen Xiong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-06-27net/mlx5e: IPSec, Innova IPSec offload infrastructureIlan Tayari2-4/+18
Add Innova IPSec ESP crypto offload configuration paths. Detect Innova IPSec device and set the NETIF_F_HW_ESP flag. Configure Security Associations using the API introduced in a previous patch. Add Software-parser hardware descriptor layout Software-Parser (swp) is a hardware feature in ConnectX which allows the host software to specify protocol header offsets in the TX path, thus overriding the hardware parser. This is useful for protocols that the ASIC may not be able to parse on its own. Note that due to inline metadata, XDP is not supported in Innova IPSec. Signed-off-by: Ilan Tayari <[email protected]> Signed-off-by: Yossi Kuperman <[email protected]> Signed-off-by: Yevgeny Kliteynik <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-27net/mlx5: Accel, Add IPSec acceleration interfaceIlan Tayari1-0/+67
Add routines for manipulating the hardware IPSec SA database (SADB). In Innova IPSec, a Security Association (SA) is added or deleted via a command message over the SBU connection. The HW then sends a response message over the same connection. Add implementation for Innova IPSec (FPGA-based) hardware. These routines will be used by the IPSec offload support in a later patch However they may also be used by others such as RDMA and RoCE IPSec. mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs to work directly with mlx5_core rather than Innova FPGA or other mlx5 acceleration providers. In this patchset we add Innova IPSec support and mlx5/accel delegates IPSec offloads to Innova routines. In the future, when IPSec/TLS or any other acceleration gets integrated into ConnectX chip, mlx5/accel layer will provide the integrated acceleration, rather than the Innova one. Signed-off-by: Ilan Tayari <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-27net/mlx5: FPGA, Add SBU infrastructureIlan Tayari4-0/+18
Add interface to initialize and interact with Innova FPGA SBU connections. A client driver may use these functions to set up a high-speed DMA connection with its SBU hardware logic, and send/receive messages over this connection. A later patch in this patchset will make use of these functions for Innova IPSec offload in mlx5 Ethernet driver. Add commands to retrieve Innova FPGA SBU capabilities, and to read/write Innova FPGA configuration space registers and memory, over internal I2C. At high level, the FPGA configuration space is divided such: 0x00000000 - 0x007fffff is reserved for the SBU 0x00800000 - 0xffffffff is reserved for the Shell 0x400000000 - ... is DDR memory A later patchset will add support for accessing FPGA CrSpace and memory over a high-speed connection. This is the reason for the ACCESS_TYPE enumeration, which currently only supports I2C. Signed-off-by: Ilan Tayari <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-27net/mlx5: FPGA, Add SBU bypass and reset flowsIlan Tayari1-0/+9
The Innova FPGA includes shell hardware and Sandbox-Unit (SBU) hardware. The shell hardware is handled by mlx5_core itself, while the SBU is handled by a client driver. Reset the SBU to a well-known initial state when initializing a new device, and set the FPGA to bypass mode when uninitializing a device. This allows the client driver to assume that its device has been reset when a new device is detected. During SBU reset, the FPGA is put into SBU-bypass mode. In this mode packets do not pass through the SBU, so it cannot affect the network data stream at all. A factory-image does not have an SBU, so skip these flows. Signed-off-by: Ilan Tayari <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-27net/mlx5: FPGA, Add FW commands for FPGA QPsIlan Tayari2-0/+204
The FPGA QP is a high-bandwidth communication channel between the host CPU and the FPGA device. It allows performing DMA operations between host memory and the FPGA logic via the ConnectX chip. Add ConnectX FW commands which create and manipulate FPGA QPs. Signed-off-by: Ilan Tayari <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-27net/mlx5: Add support for multiple RoCE enableIlan Tayari1-0/+1
Previously, only mlx5_ib enabled RoCE on the port, but FPGA needs it as well. Add support for counting number of enables, so that FPGA and IB can work in parallel and independently. Program the HW to enable RoCE on the first enable call, and program to disable RoCE on the last disable call. Signed-off-by: Ilan Tayari <[email protected]> Reviewed-by: Boris Pismenny <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-27net/mlx5: Add reserved-gids supportIlan Tayari1-0/+17
Reserved GIDs are entries in the GID table in use by the mlx5_core and its submodules (e.g. FPGA, SRIOV, E-Swtich, netdev). The entries are reserved at the high indexes of the GID table. A mlx5 submodule may reserve a certain amount of GIDs for its own use during the load sequence by calling mlx5_core_reserve_gids, and must also take care to un-reserve these GIDs when it closes. Reservation is only allowed during the load sequence and before any interfaces (e.g. mlx5_ib or mlx5_en) are up. After reservation, a submodule may call mlx5_core_reserved_gid_alloc/ free to allocate entries from the reserved GIDs pool. Reserve a GID table entry for every supported FPGA QP. A later patch in the patchset will remove them from being reported to IB core. Another such patch will make use of these for FPGA QPs in Innova NIC. Added lib/mlx5.h to serve as a library for mlx5 submodlues, and to expose only public mlx5 API, more mlx5 library files will be added in future submissions. Signed-off-by: Ilan Tayari <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-25Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A few fixes for timekeeping and timers: - Plug a subtle race due to a missing READ_ONCE() in the timekeeping code where reloading of a pointer results in an inconsistent callback argument being supplied to the clocksource->read function. - Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the time keeping core code, to prevent a possible discontuity. - Apply a similar fix to the arm64 vdso clock_gettime() implementation - Add missing includes to clocksource drivers, which relied on indirect includes which fails in certain configs. - Use the proper iomem pointer for read/iounmap in a probe function" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting time: Fix clock->read(clock) race around clocksource changes clocksource: Explicitly include linux/clocksource.h when needed clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
2017-06-25net: Remove ndo_dfwd_start_xmitMintz, Yuval1-9/+0
Looks like commit f663dd9aaf9e ("net: core: explicitly select a txq before doing l2 forwarding") has removed the need for this dedicated xmit function [it even explicitly states so in its commit log message] but it hasn't removed the definition of the ndo. Signed-off-by: Yuval Mintz <[email protected]> CC: Jason Wang <[email protected]> CC: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-23slub: make sysfs file removal asynchronousTejun Heo1-0/+1
Commit bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") made slub sysfs file removals synchronous to kmem_cache shutdown. Unfortunately, this created a possible ABBA deadlock between slab_mutex and sysfs draining mechanism triggering the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 4.10.0-test+ #48 Not tainted ------------------------------------------------------- rmmod/1211 is trying to acquire lock: (s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40 but task is already holding lock: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x75/0x950 mutex_lock_nested+0x1b/0x20 slab_attr_store+0x75/0xd0 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x13c/0x1c0 __vfs_write+0x28/0x120 vfs_write+0xc8/0x1e0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #0 (s_active#120){++++.+}: __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(s_active#120); lock(slab_mutex); lock(s_active#120); *** DEADLOCK *** 2 locks held by rmmod/1211: #0: (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80 #1: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 stack backtrace: CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 Call Trace: print_circular_bug+0x1be/0x210 __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 ? SyS_delete_module+0x5/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 It'd be the cleanest to deal with the issue by removing sysfs files without holding slab_mutex before the rest of shutdown; however, given the current code structure, it is pretty difficult to do so. This patch punts sysfs file removal to a work item. Before commit bf5eb3de3847, the removal was punted to a RCU delayed work item which is executed after release. Now, we're punting to a different work item on shutdown which still maintains the goal removing the sysfs files earlier when destroying kmem_caches. Link: http://lkml.kernel.org/r/[email protected] Fixes: bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") Signed-off-by: Tejun Heo <[email protected]> Reported-by: Steven Rostedt (VMware) <[email protected]> Tested-by: Steven Rostedt (VMware) <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-06-23net: phy: Support "internal" PHY interfaceFlorian Fainelli1-0/+3
Now that the Device Tree binding has been updated, update the PHY library phy_interface_t and phy_modes to support the "internal" PHY interface type. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-23Merge tag 'mlx5-updates-2017-06-23' of ↵David S. Miller3-3/+106
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2017-06-23 This series provides some updates to the mlx5 core and netdevice drivers. Three patches from Tariq, Introduces page reuse mechanism in non-Striding RQ RX datapath, we allow the the RX descriptor to reuse its allocated page as much as it could, until the page is fully consumed. RX page reuse reduces the stress on page allocator and improves RX performance especially with high speeds (100Gb/s). Next four patches of the series from Or allows to offload tc flower matching on ttl/hoplimit and header re-write of hoplimit. The rest of the series from Yotam and Or enhances mlx5 to support FW flashing through the mlxfw module, in a similar manner done by the mlxsw driver. Currently, only ethtool based flashing is implemented, where both Eth and IB ports are supported. ==================== Signed-off-by: David S. Miller <[email protected]>
2017-06-23bpf: possibly avoid extra masking for narrower load in verifierYonghong Song2-2/+12
Commit 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") permits narrower load for certain ctx fields. The commit however will already generate a masking even if the prog-specific ctx conversion produces the result with narrower size. For example, for __sk_buff->protocol, the ctx conversion loads the data into register with 2-byte load. A narrower 2-byte load should not generate masking. For __sk_buff->vlan_present, the conversion function set the result as either 0 or 1, essentially a byte. The narrower 2-byte or 1-byte load should not generate masking. To avoid unnecessary masking, prog-specific *_is_valid_access now passes converted_op_size back to verifier, which indicates the valid data width after perceived future conversion. Based on this information, verifier is able to avoid unnecessary marking. Since we want more information back from prog-specific *_is_valid_access checking, all of them are packed into one data structure for more clarity. Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-23xdp: add reporting of offload modeJakub Kicinski1-3/+4
Extend the XDP_ATTACHED_* values to include offloaded mode. Let drivers report whether program is installed in the driver or the HW by changing the prog_attached field from bool to u8 (type of the netlink attribute). Exploit the fact that the value of XDP_ATTACHED_DRV is 1, therefore since all drivers currently assign the mode with double negation: mode = !!xdp_prog; no drivers have to be modified. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-23xdp: add HW offload mode flag for installing programsJakub Kicinski1-0/+1
Add an installation-time flag for requesting that the program be installed only if it can be offloaded to HW. Internally new command for ndo_xdp is added, this way we avoid putting checks into drivers since they all return -EINVAL on an unknown command. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-23xdp: pass XDP flags into install handlersJakub Kicinski1-0/+1
Pass XDP flags to the xdp ndo. This will allow drivers to look at the mode flags and make decisions about offload. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-22net/mlx5: Fix offset of hca cap reserved fieldOr Gerlitz1-1/+1
The offending commit pushed fwd the field by two bits but didn't increment the offset, fix that. Currently, no damage was done b/c this is just a field name, but lets have it right. Fixes: f32f5bd2eb7e ('net/mlx5: Configure cache line size for start and end padding') Signed-off-by: Or Gerlitz <[email protected]> Reported-by: Saeed Mahameed <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-22net/mlx5: Enhance MCAM reg to allow query on access reg supportOr Gerlitz2-0/+16
Enhance MCAM to allow the driver to query which access regs are supported. For now, expose the regs needed for FW flashing. Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Gal Pressman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-22net/mlx5: Add MCC (Management Component Control) register definitionsOr Gerlitz2-0/+85
MCC (Management Component Control) allows to control a firmware component update. MCDA (Management Component Data Access) allows to read and write a firmware component. MCQI (Management Component Query Information) allows to query information about firmware components. Signed-off-by: Or Gerlitz <[email protected]> Signed-off-by: Yotam Gigi <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-22net/mlx5e: Add header re-write offloading of IPv6 hop-limitOr Gerlitz1-0/+1
For environments where flow-based ipv6 router is offloaded. Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Paul Blakey <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-22net/mlx5e: Offload TC matching on ip ttlOr Gerlitz1-2/+3
Enable offloading of TC matching on ip ttl / hop-limit As matching on ttl is supported only by newer HW brands (ConnectX-5), we should do capability check before attempting to offload that. Signed-off-by: Or Gerlitz <[email protected]> Reviewed-by: Paul Blakey <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2017-06-21Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+2
Pull block fixes from Jens Axboe: "This contains a set of fixes for xen-blkback by way of Konrad, and a performance regression fix for blk-mq for shared tags. The latter could account for as much as a 50x reduction in performance, with the test case from the user with 500 name spaces. A more realistic setup on my end with 32 drives showed a 3.5x drop. The fix has been thoroughly tested before being committed" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix performance regression with shared tags xen-blkback: don't leak stack data via response ring xen/blkback: don't use xen_blkif_get() in xen-blkback kthread xen/blkback: don't free be structure too early xen/blkback: fix disconnect while I/Os in flight
2017-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-30/+30
Two entries being added at the same time to the IFLA policy table, whilst parallel bug fixes to decnet routing dst handling overlapping with the dst gc removal in net-next. Signed-off-by: David S. Miller <[email protected]>
2017-06-21qed*: Rename qed_roce_if.h to qed_rdma_if.hKalderon, Michal1-2/+2
Rename the qed_roce_if file to qed_rdma_if as it represents a common interface for RoCE and iWARP. this commit affects RDMA/qedr as well. Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: Yuval Mintz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-21blk-mq: fix performance regression with shared tagsJens Axboe1-0/+2
If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <[email protected]> Tested-by: Max Gurtovoy <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Tested-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-06-21qede: Fix compilation without QED_RDMAChad Dupuis1-1/+1
When CONFIG_QED_RDMA isn't defined, we'd hit the following: /include/linux/qed/qede_rdma.h:84:19: warning: ‘qede_rdma_dev_add’ used but never defined [enabled by default] static inline int qede_rdma_dev_add(struct qede_dev *dev); Fixes: bbfcd1e8e167 ("qed*: Set rdma generic functions prefix") Signed-off-by: Chad Dupuis <[email protected]> Signed-off-by: Yuval Mintz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-20net: introduce __skb_put_[zero, data, u8]yuan linyu1-0/+22
follow Johannes Berg, semantic patch file as below, @@ identifier p, p2; expression len; expression skb; type t, t2; @@ ( -p = __skb_put(skb, len); +p = __skb_put_zero(skb, len); | -p = (t)__skb_put(skb, len); +p = __skb_put_zero(skb, len); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, len); | -memset(p, 0, len); ) @@ identifier p; expression len; expression skb; type t; @@ ( -t p = __skb_put(skb, len); +t p = __skb_put_zero(skb, len); ) ... when != p ( -memset(p, 0, len); ) @@ type t, t2; identifier p, p2; expression skb; @@ t *p; ... ( -p = __skb_put(skb, sizeof(t)); +p = __skb_put_zero(skb, sizeof(t)); | -p = (t *)__skb_put(skb, sizeof(t)); +p = __skb_put_zero(skb, sizeof(t)); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, sizeof(*p)); | -memset(p, 0, sizeof(*p)); ) @@ expression skb, len; @@ -memset(__skb_put(skb, len), 0, len); +__skb_put_zero(skb, len); @@ expression skb, len, data; @@ -memcpy(__skb_put(skb, len), data, len); +__skb_put_data(skb, data, len); @@ expression SKB, C, S; typedef u8; identifier fn = {__skb_put}; fresh identifier fn2 = fn ## "_u8"; @@ - *(u8 *)fn(SKB, S) = C; + fn2(SKB, C); Signed-off-by: yuan linyu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-20qed*: Set rdma generic functions prefixMichal Kalderon1-18/+19
Rename the functions common to both iWARP and RoCE to have a prefix of _rdma_ instead of _roce_. Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: Yuval Mintz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-20qed*: qede_roce.[ch] -> qede_rdma.[ch]Michal Kalderon1-0/+5
Once we have iWARP support, the qede portion of the qedr<->qede would serve all the RDMA protocols - so rename the file to be appropriate to its function. While we're at it, we're also moving a couple of inclusions to it into .h files and adding includes to make sure it contains all type definitions it requires. Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: Yuval Mintz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-20qed: Chain support for external PBLMintz, Yuval2-1/+9
iWARP would require the chains to allocate/free their PBL memory independently, so add the infrastructure to provide it externally. Signed-off-by: Yuval Mintz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-20time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accountingJohn Stultz1-2/+2
Due to how the MONOTONIC_RAW accumulation logic was handled, there is the potential for a 1ns discontinuity when we do accumulations. This small discontinuity has for the most part gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW in their vDSO clock_gettime implementation, we've seen failures with the inconsistency-check test in kselftest. This patch addresses the issue by using the same sub-ns accumulation handling that CLOCK_MONOTONIC uses, which avoids the issue for in-kernel users. Since the ARM64 vDSO implementation has its own clock_gettime calculation logic, this patch reduces the frequency of errors, but failures are still seen. The ARM64 vDSO will need to be updated to include the sub-nanosecond xtime_nsec values in its calculation for this issue to be completely fixed. Signed-off-by: John Stultz <[email protected]> Tested-by: Daniel Mentz <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Will Deacon <[email protected]> Cc: "stable #4 . 8+" <[email protected]> Cc: Miroslav Lichvar <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-06-20time: Fix clock->read(clock) race around clocksource changesJohn Stultz1-1/+0
In tests, which excercise switching of clocksources, a NULL pointer dereference can be observed on AMR64 platforms in the clocksource read() function: u64 clocksource_mmio_readl_down(struct clocksource *c) { return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask; } This is called from the core timekeeping code via: cycle_now = tkr->read(tkr->clock); tkr->read is the cached tkr->clock->read() function pointer. When the clocksource is changed then tkr->clock and tkr->read are updated sequentially. The code above results in a sequential load operation of tkr->read and tkr->clock as well. If the store to tkr->clock hits between the loads of tkr->read and tkr->clock, then the old read() function is called with the new clock pointer. As a consequence the read() function dereferences a different data structure and the resulting 'reg' pointer can point anywhere including NULL. This problem was introduced when the timekeeping code was switched over to use struct tk_read_base. Before that, it was theoretically possible as well when the compiler decided to reload clock in the code sequence: now = tk->clock->read(tk->clock); Add a helper function which avoids the issue by reading tk_read_base->clock once into a local variable clk and then issue the read function via clk->read(clk). This guarantees that the read() function always gets the proper clocksource pointer handed in. Since there is now no use for the tkr.read pointer, this patch also removes it, and to address stopping the fast timekeeper during suspend/resume, it introduces a dummy clocksource to use rather then just a dummy read function. Signed-off-by: John Stultz <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: stable <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Daniel Mentz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-06-19netfilter: nfnetlink: extended ACK reportingPablo Neira Ayuso1-4/+6
Pass down struct netlink_ext_ack as parameter to all of our nfnetlink subsystem callbacks, so we can work on follow up patches to provide finer grain error reporting using the new infrastructure that 2d4bc93368f5 ("netlink: extended ACK reporting") provides. No functional change, just pass down this new object to callbacks. Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-06-19netfilter: ebt: Use new helper ebt_invalid_target to check targetGao Feng1-2/+0
Use the new helper function ebt_invalid_target instead of the old macro INVALID_TARGET and other duplicated codes to enhance the readability. Signed-off-by: Gao Feng <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-06-19mm: larger stack guard gap, between vmasHugh Dickins1-28/+25
Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <[email protected]> Original-patch-by: Michal Hocko <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Tested-by: Helge Deller <[email protected]> # parisc Signed-off-by: Linus Torvalds <[email protected]>
2017-06-16Merge tag 'mlx5-updates-2017-06-16' of ↵David S. Miller3-7/+16
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== Mellanox mlx5 updates and cleanups 2017-06-16 mlx5-updates-2017-06-16 This series provide some updates and cleanups for mlx5 core and netdevice driver. From Eli Cohen, add a missing event string. From Or Gerlitz, some checkpatch cleanups. From Moni, Disalbe HW level LAG when SRIOV is enabled. From Tariq, A code reuse cleanup in aRFS flow. From Itay Aveksis, Typo fix. From Gal Pressman, ethtool statistics updates and "update stats" deferred work optimizations. From Majd Dibbiny, Fast unload support on kernel shutdown. ==================== Signed-off-by: David S. Miller <[email protected]>
2017-06-16net: Add IFLA_XDP_PROG_IDMartin KaFai Lau1-2/+5
Expose prog_id through IFLA_XDP_PROG_ID. This patch makes modification to generic_xdp. The later patches will modify other xdp-supported drivers. prog_id is added to struct net_dev_xdp. iproute2 patch will be followed. Here is how the 'ip link' will look like: > ip link show eth0 3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp(prog_id:1) qdisc fq_codel state UP mode DEFAULT group default qlen 1000 Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-06-16networking: add and use skb_put_u8()Johannes Berg1-0/+5
Joe and Bjørn suggested that it'd be nicer to not have the cast in the fairly common case of doing *(u8 *)skb_put(skb, 1) = c; Add skb_put_u8() for this case, and use it across the code, using the following spatch: @@ expression SKB, C, S; typedef u8; identifier fn = {skb_put}; fresh identifier fn2 = fn ## "_u8"; @@ - *(u8 *)fn(SKB, S) = C; + fn2(SKB, C); Note that due to the "S", the spatch isn't perfect, it should have checked that S is 1, but there's also places that use a sizeof expression like sizeof(var) or sizeof(u8) etc. Turns out that nobody ever did something like *(u8 *)skb_put(skb, 2) = c; which would be wrong anyway since the second byte wouldn't be initialized. Suggested-by: Joe Perches <[email protected]> Suggested-by: Bjørn Mork <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: David S. Miller <[email protected]>