aboutsummaryrefslogtreecommitdiff
path: root/drivers/infiniband/hw
AgeCommit message (Collapse)AuthorFilesLines
2018-09-11RDMA/mlx5: Enable decap and packet reformat on flow tablesMark Bloch1-4/+13
If NIC RX flow tables support decap operation, enable it on creation, This allows to perform decapsulation of tunnelled packets by steering rules. If NIC TX flow tables support reformat operation, enable it on creation. We don't enable those capabilities on representors as the E-Switch should handle packet modification (can be configured via TC) and as current hardware can't handle both FDB and NIC flow tables with decap/packet reformat support. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-11RDMA/mlx5: Enable attaching modify header to steering flowsMark Bloch1-0/+8
When creating a flow steering rule, allow the user to attach a modify header action. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-11RDMA/mlx5: Add NIC TX steering supportMark Bloch2-10/+20
Just like ingress steering, allow a user to create steering rules that match egress vport traffic. We expose the same number of priorities as the bypass (NIC RX) steering. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-10infiniband: nes: Use skb_peek_next() and skb_queue_walk().David S. Miller1-5/+3
Instead of direct SKB list accesses. Signed-off-by: David S. Miller <[email protected]>
2018-09-06IB/mlx5: Don't hold spin lock while checking device stateParav Pandit1-14/+12
mdev->state device state is not protected by the QP for which WRs are being processed. Therefore, there is no need to hold spin lock while checking mdev state. Given that device fatal error is unlikely situation, wrap the condition check with unlikely(). Additionally, kernel QP1 is also a kernel ULP for which soft CQEs needs to be generated. Therefore, check for device fatal error before processing QP1 work requests. Fixes: 89ea94a7b6c4 ("IB/mlx5: Reset flow support for IB kernel ULPs") Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Daniel Jurgens <[email protected]> Reviewed-by: Maor Gottlieb <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-06RDMA/mlx4: Ensure that maximal send/receive SGE less than supported by HWLeon Romanovsky1-3/+5
In calculating the global maximum number of the Scatter/Gather elements supported, the following four maximum parameters must be taken into consideration: max_sg_rq, max_sg_sq, max_desc_sz_rq and max_desc_sz_sq. However instead of bringing this complexity to query_device, which still won't be sufficient anyway (the calculations are dependent on QP type), the safer approach will be to restore old code, which will give us 32 SGEs. Fixes: 33023fb85a42 ("IB/core: add max_send_sge and max_recv_sge attributes") Reported-by: Chuck Lever <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05Merge branch 'uverbs_dev_cleanups' into rdma.git for-nextJason Gunthorpe3-1/+9
For dependencies, branch based on rdma.git 'for-rc' of https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/ Pull 'uverbs_dev_cleanups' from Leon Romanovsky: ==================== Reuse the char device code interfaces to simplify ib_uverbs_device creation and destruction. As part of this series, we are sending fix to cleanup path, which was discovered during internal review, The fix definitely can go to -rc, but it means that this series will be dependent on rdma-rc. ==================== * branch 'uverbs_dev_cleanups': RDMA/uverbs: Use device.groups to initialize device attributes RDMA/uverbs: Use cdev_device_add() instead of cdev_add() RDMA/core: Depend on device_add() to add device attributes RDMA/uverbs: Fix error cleanup path of ib_uverbs_add_one() Resolved conflict in ib_device_unregister_sysfs() Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05bnxt_re: Fix couple of memory leaks that could lead to IOMMU call tracesSomnath Kotur2-1/+3
1. DMA-able memory allocated for Shadow QP was not being freed. 2. bnxt_qplib_alloc_qp_hdr_buf() had a bug wherein the SQ pointer was erroneously pointing to the RQ. But since the corresponding free_qp_hdr_buf() was correct, memory being free was less than what was allocated. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Somnath Kotur <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/nes: Delete impossible debug printsLeon Romanovsky3-14/+0
The pci-core and net-core logic ensure that parameters provided to nes_probe() and nes_netdev_open() are valid, hence the assert print are not possible. Cc: Faisal Latif <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Reviewed-by: Dennis Dalessandro <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/qedr: remove set but not used variable 'ctx'YueHaibing1-2/+0
Fixes gcc '-Wunused-but-set-variable' warning: drivers/infiniband/hw/qedr/verbs.c: In function 'qedr_create_srq': drivers/infiniband/hw/qedr/verbs.c:1450:24: warning: variable 'ctx' set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing <[email protected]> Acked-by: Rahul Verma <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/bnxt_re: QPLIB: Add and use #define dev_fmt(fmt) "QPLIB: " fmtJoe Perches4-151/+116
Consistently use the "QPLIB: " prefix for dev_<level> logging. Miscellanea: o Add missing newlines to avoid possible message interleaving o Coalesce consecutive dev_<level> uses that emit a message header to avoid < 80 column lengths and mistakenly output on multiple lines o Reflow modified lines to use 80 columns where appropriate o Consistently use "%s: " where __func__ is output o QPLIB: is now always output immediately after the dev_<level> header Signed-off-by: Joe Perches <[email protected]> Acked-by: Selvin Xavier <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05Merge branch 'mlx5-flow-mutate' into rdma.git for-nextJason Gunthorpe4-4/+325
For dependencies, branch based on 'mellanox/mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git Pull Flow actions to mutate packets from Leon Romanovsky: ==================== This series exposes the ability to create flow actions which can mutate packet headers. We do that by exposing two new verbs: * modify header - can change existing packet headers. packet * reformat - can encapsulate or decapsulate a packet. Once created a flow action must be attached to a steering rule for it to take effect. The first 10 patches refactor mlx5_core code, rename internal structures to better reflect their operation and export needed functions so the RDMA side can allocate the action. The last 5 patches expose via the IOCTL infrastructure mlx5_ib methods which do the actual allocation of resources and return an handle to the user. A user of this API is expected to know how to work with the device's spec as the input to those function is HW depended. An example usage of the modify header action is routing, A user can create an action which edits the L2 header and decrease the TTL. An example usage of the packet reformat action is VXLAN encap/decap which is done by the HW. ==================== * branch 'mlx5-flow-mutate': RDMA/mlx5: Extend packet reformat verbs RDMA/mlx5: Add new flow action verb - packet reformat RDMA/uverbs: Add generic function to fill in flow action object RDMA/mlx5: Add a new flow action verb - modify header RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language net/mlx5: Export packet reformat alloc/dealloc functions net/mlx5: Pass a namespace for packet reformat ID allocation net/mlx5: Expose new packet reformat capabilities {net, RDMA}/mlx5: Rename encap to reformat packet net/mlx5: Move header encap type to IFC header file net/mlx5: Break encap/decap into two separated flow table creation flags net/mlx5: Add support for more namespaces when allocating modify header net/mlx5: Export modify header alloc/dealloc functions net/mlx5: Add proper NIC TX steering flow tables support net/mlx5: Cleanup flow namespace getter switch logic Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/mlx5: Extend packet reformat verbsMark Bloch2-0/+96
We expose new actions: L2_TO_L2_TUNNEL - A generic encap from L2 to L2, the data passed should be the encapsulating headers. L3_TUNNEL_TO_L2 - Will do decap where the inner packet starts from L3, the data should be mac or mac + vlan (14 or 18 bytes). L2_TO_L3_TUNNEL - Will do encap where is L2 of the original packet will not be included, the data should be the encapsulating header. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/mlx5: Add new flow action verb - packet reformatMark Bloch2-1/+76
For now, only add L2_TUNNEL_TO_L2 option. This will allow to perform generic decap operation if the encapsulating protocol is L2 based, and the inner packet is also L2 based. For example this can be used to decap VXLAN packets. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/uverbs: Add generic function to fill in flow action objectMark Bloch1-5/+3
Refactor the initialization of a flow action object to a common function. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05RDMA/mlx5: Add a new flow action verb - modify headerMark Bloch3-1/+153
Expose the ability to create a flow action which changes packet headers. The data passed from userspace should be modify header actions as defined by HW specification. Signed-off-by: Mark Bloch <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-05{net, RDMA}/mlx5: Rename encap to reformat packetMark Bloch1-3/+3
Renames all encap mlx5_{core,ib} code to use the new naming of packet reformat. This change doesn't introduce any function change and is needed to properly reflect the operation being done by this action. For example not only can we encapsulate a packet, but also decapsulate it. Signed-off-by: Mark Bloch <[email protected]> Reviewed-by: Saeed Mahameed <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]>
2018-09-04IB/mlx5: Change TX affinity assignment in RoCE LAG modeMajd Dibbiny3-5/+44
In the current code, the TX affinity is per RoCE device, which can cause unfairness between different contexts. e.g. if we open two contexts, and each open 10 QPs concurrently, all of the QPs of the first context might end up on the first port instead of distributed on the two ports as expected To overcome this unfairness between processes, we maintain per device TX affinity, and per process TX affinity. The allocation algorithm is as follow: 1. Hold two tx_port_affinity atomic variables, one per RoCE device and one per ucontext. Both initialized to 0. 2. In mlx5_ib_alloc_ucontext do: 2.1. ucontext.tx_port_affinity = device.tx_port_affinity 2.2. device.tx_port_affinity += 1 3. In modify QP INIT2RST: 3.1. qp.tx_port_affinity = ucontext.tx_port_affinity % MLX5_PORT_NUM 3.2. ucontext.tx_port_affinity += 1 Signed-off-by: Majd Dibbiny <[email protected]> Reviewed-by: Moni Shoua <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-04iw_cxgb4: only allow 1 flush on user qpsSteve Wise1-0/+6
Once the qp has been flushed, it cannot be flushed again. The user qp flush logic wasn't enforcing it however. The bug can cause touch-after-free crashes like: Unable to handle kernel paging request for data at address 0x000001ec Faulting instruction address: 0xc008000016069100 Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c008000016069100] flush_qp+0x80/0x480 [iw_cxgb4] LR [c00800001606cd6c] c4iw_modify_qp+0x71c/0x11d0 [iw_cxgb4] Call Trace: [c00800001606cd6c] c4iw_modify_qp+0x71c/0x11d0 [iw_cxgb4] [c00800001606e868] c4iw_ib_modify_qp+0x118/0x200 [iw_cxgb4] [c0080000119eae80] ib_security_modify_qp+0xd0/0x3d0 [ib_core] [c0080000119c4e24] ib_modify_qp+0xc4/0x2c0 [ib_core] [c008000011df0284] iwcm_modify_qp_err+0x44/0x70 [iw_cm] [c008000011df0fec] destroy_cm_id+0xcc/0x370 [iw_cm] [c008000011ed4358] rdma_destroy_id+0x3c8/0x520 [rdma_cm] [c0080000134b0540] ucma_close+0x90/0x1b0 [rdma_ucm] [c000000000444da4] __fput+0xe4/0x2f0 So fix flush_qp() to only flush the wq once. Cc: [email protected] Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-01IB/hfi1: Move URGENT IRQ enable to hfi1_rcvctrl()Michael J. Ruhl3-8/+12
User contexts use the receive URGENT interrupt. However, enabling the IRQ SRC in the file_ops module is not as clean as it could be. Augment the _rcvctl() function to be able to enable/disable the IRQ source. Use the new interface from file_ops to enable/disable the IRQ. Reviewed-by: Mike Marciniszyn <[email protected]> Reviewed-by: Sadanand Warrier <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: Rework the IRQ API to be more flexibleMichael J. Ruhl6-104/+162
The current IRQ API is an all or nothing interface. This has two problems: 1. All IRQs are enabled regardless of use 2. Moving from general interrupt to MSIx handling is difficult Introduce a new API to enable/disable specific IRQs or a range of IRQs. Do not enable and disable all IRQs in one step. Rework various modules to enable/disable IRQs when needed. Reviewed-by: Mike Marciniszyn <[email protected]> Reviewed-by: Sadanand Warrier <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: PCIe bus width retryKamenee Arumugam1-3/+8
Retry the PCIe link training up to 'pcie_retry' times if the PCIe link width is narrower than the previous width. Reviewed-by: Michael J. Ruhl <[email protected]> Reviewed-by: Mitko Haralanov <[email protected]> Signed-off-by: Kamenee Arumugam <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: Make the MSIx resource allocation a bit more flexibleMichael J. Ruhl8-243/+295
The current method of allocating MSIx resources is a bit cumbersome, and not very easily added to. Refactor and re-order the code paths into a more consistent interface. Update the interface so that allocations are not order dependent. Reviewed-by: Mike Marciniszyn <[email protected]> Reviewed-by: Sadanand Warrier <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: Prepare for new HFI1 MSIx APIMichael J. Ruhl7-295/+414
The current HFI1 MSIx API is difficult to follow, change, or add to. In anticipation of moving to an more flexible API, move the current MSIx functionality to the new msix.c module. Reviewed-by: Mike Marciniszyn <[email protected]> Reviewed-by: Sadanand Warrier <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: Get the hfi1_devdata structure as early as possibleMichael J. Ruhl4-86/+59
Currently several things occur before the hfi1_devdata structure is allocated. This leads to an inconsistent logging ability and makes it more difficult to restructure some code paths. Allocate (and do a minimal init) the structure as soon as possible. Reviewed-by: Mike Marciniszyn <[email protected]> Reviewed-by: Sadanand Warrier <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: tune_pcie_caps is arbitrarily placed, poorlyMichael J. Ruhl3-10/+15
The tune_pcie_caps needs to occur sometime after PCI is enabled, but before the HFI is enabled. Currently it is placed in the MSIx allocation code which doesn't really fit. Moving it to just after the gen3 bump. Clean up the associated code (modules, etc.). Reviewed-by: Mike Marciniszyn <[email protected]> Reviewed-by: Sadanand Warrier <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: Remove duplicated definesMichael J. Ruhl1-10/+0
TXREQ defines are duplicated, incompletely, in the sdma header file. Remove duplicate defines. Reviewed-by: Mike Marciniszyn <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-01IB/hfi1: Rework file list in MakefileDennis Dalessandro1-6/+34
We want to keep files in alphabetical order in our makefile, however this just makes for messy diffs when adding (or removing) files. Let's just clean this up and make it line by line. Reviewed-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-08-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-3/+21
Pull more rdma updates from Jason Gunthorpe: "This is the SMC cleanup promised, a randconfig regression fix, and kernel oops fix. Summary: - Switch SMC over to rdma_get_gid_attr and remove the compat - Fix a crash in HFI1 with some BIOS's - Fix a randconfig failure" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/ucm: fix UCM link error IB/hfi1: Invalid NUMA node information can cause a divide by zero RDMA/smc: Replace ib_query_gid with rdma_get_gid_attr
2018-08-22mm, oom: distinguish blockable mode for mmu notifiersMichal Hocko2-5/+8
There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [[email protected]: coding style fixes] [[email protected]: minor code simplification] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Christian König <[email protected]> # AMD notifiers Acked-by: Leon Romanovsky <[email protected]> # mlx and umem_odp Reported-by: David Rientjes <[email protected]> Cc: "David (ChunMing) Zhou" <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Alex Deucher <[email protected]> Cc: David Airlie <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Sudeep Dutt <[email protected]> Cc: Ashutosh Dixit <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Juergen Gross <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Felix Kuehling <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-20IB/hfi1: Invalid NUMA node information can cause a divide by zeroMichael J. Ruhl1-3/+21
If the system BIOS does not supply NUMA node information to the PCI devices, the NUMA node is selected by choosing the current node. This can lead to the following crash: divide error: 0000 SMP CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G IOE ------------ 3.10.0-693.21.1.el7.x86_64 #1 Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.01.01.0005.101720141054 10/17/2014 Workqueue: events work_for_cpu_fn task: ffff880174480fd0 ti: ffff880174488000 task.ti: ffff880174488000 RIP: 0010: [<ffffffffc020ac69>] hfi1_dev_affinity_init+0x129/0x6a0 [hfi1] RSP: 0018:ffff88017448bbf8 EFLAGS: 00010246 RAX: 0000000000000011 RBX: ffff88107ffba6c0 RCX: ffff88085c22e130 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880824ad0000 RBP: ffff88017448bc48 R08: 0000000000000011 R09: 0000000000000002 R10: ffff8808582b6ca0 R11: 0000000000003151 R12: ffff8808582b6ca0 R13: ffff8808582b6518 R14: ffff8808582b6010 R15: 0000000000000012 FS: 0000000000000000(0000) GS:ffff88085ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007efc707404f0 CR3: 0000000001a02000 CR4: 00000000001607f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: hfi1_init_dd+0x14b3/0x27a0 [hfi1] ? pcie_capability_write_word+0x46/0x70 ? hfi1_pcie_init+0xc0/0x200 [hfi1] do_init_one+0x153/0x4c0 [hfi1] ? sched_clock_cpu+0x85/0xc0 init_one+0x1b5/0x260 [hfi1] local_pci_probe+0x4a/0xb0 work_for_cpu_fn+0x1a/0x30 process_one_work+0x17f/0x440 worker_thread+0x278/0x3c0 ? manage_workers.isra.24+0x2a0/0x2a0 kthread+0xd1/0xe0 ? insert_kthread_work+0x40/0x40 ret_from_fork+0x77/0xb0 ? insert_kthread_work+0x40/0x40 If the BIOS is not supplying NUMA information: - set the default table count to 1 for all possible nodes - select node 0 (instead of current NUMA) node to get consistent performance - generate an error indicating that the BIOS should be upgraded Reviewed-by: Gary Leshner <[email protected]> Reviewed-by: Mike Marciniszyn <[email protected]> Signed-off-by: Michael J. Ruhl <[email protected]> Signed-off-by: Dennis Dalessandro <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-16Merge branch 'linus/master' into rdma.git for-nextJason Gunthorpe3-10/+6
rdma.git merge resolution for the 4.19 merge window Conflicts: drivers/infiniband/core/rdma_core.c - Use the rdma code and revise with the new spelling for atomic_fetch_add_unless drivers/nvme/host/rdma.c - Replace max_sge with max_send_sge in new blk code drivers/nvme/target/rdma.c - Use the blk code and revise to use NULL for ib_post_recv when appropriate - Replace max_sge with max_recv_sge in new blk code net/rds/ib_send.c - Use the net code and revise to use NULL for ib_post_recv when appropriate Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-16Merge tag 'v4.18' into rdma.git for-nextJason Gunthorpe10-34/+52
Resolve merge conflicts from the -rc cycle against the rdma.git tree: Conflicts: drivers/infiniband/core/uverbs_cmd.c - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next - Merge removal of file->ucontext in for-next with new code in -rc drivers/infiniband/core/uverbs_main.c - for-next removed code from ib_uverbs_write() that was modified in for-rc Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-16Merge tag 'pci-v4.19-changes' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci updates from Bjorn Helgaas: - Decode AER errors with names similar to "lspci" (Tyler Baicar) - Expose AER statistics in sysfs (Rajat Jain) - Clear AER status bits selectively based on the type of recovery (Oza Pawandeep) - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru Gagniuc) - Don't clear AER status bits if we're using the "Firmware-First" strategy where firmware owns the registers (Alexandru Gagniuc) - Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy Shevchenko) - Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas) - Defer DPC event handling to work queue (Keith Busch) - Use threaded IRQ for DPC bottom half (Keith Busch) - Print AER status while handling DPC events (Keith Busch) - Work around IDT switch ACS Source Validation erratum (James Puthukattukaran) - Emit diagnostics for all cases of PCIe Link downtraining (Links operating slower than they're capable of) (Alexandru Gagniuc) - Skip VFs when configuring Max Payload Size (Myron Stowe) - Reduce Root Port Max Payload Size if necessary when hot-adding a device below it (Myron Stowe) - Simplify SHPC existence/permission checks (Bjorn Helgaas) - Remove hotplug sample skeleton driver (Lukas Wunner) - Convert pciehp to threaded IRQ handling (Lukas Wunner) - Improve pciehp tolerance of missed events and initially unstable links (Lukas Wunner) - Clear spurious pciehp events on resume (Lukas Wunner) - Add pciehp runtime PM support, including for Thunderbolt controllers (Lukas Wunner) - Support interrupts from pciehp bridges in D3hot (Lukas Wunner) - Mark fall-through switch cases before enabling -Wimplicit-fallthrough (Gustavo A. R. Silva) - Move DMA-debug PCI init from arch code to PCI core (Christoph Hellwig) - Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is supplied (Heiner Kallweit) - Unify PCI and DMA direction #defines (Shunyong Yang) - Add PCI_DEVICE_DATA() macro (Andy Shevchenko) - Check for VPD completion before checking for timeout (Bert Kenward) - Limit Netronome NFP5000 config space size to work around erratum (Jakub Kicinski) - Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit) - Document ACPI description of PCI host bridges (Bjorn Helgaas) - Add "pci=disable_acs_redir=" parameter to disable ACS redirection for peer-to-peer DMA support (we don't have the peer-to-peer support yet; this is just one piece) (Logan Gunthorpe) - Clean up devm_of_pci_get_host_bridge_resources() resource allocation (Jan Kiszka) - Fixup resizable BARs after suspend/resume (Christian König) - Make "pci=earlydump" generic (Sinan Kaya) - Fix ROM BAR access routines to stay in bounds and check for signature correctly (Rex Zhu) - Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer) - Expand documentation for pci_add_dma_alias() (Logan Gunthorpe) - To avoid bus errors, enable PASID only if entire path supports End-End TLP prefixes (Sinan Kaya) - Unify slot and bus reset functions and remove hotplug knowledge from callers (Sinan Kaya) - Add Function-Level Reset quirks for Intel and Samsung NVMe devices to fix guest reboot issues (Alex Williamson) - Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD Controller (Bjorn Helgaas) - Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt) - Remove Aardvark outbound window configuration (Evan Wang) - Fix Aardvark bridge window sizing issue (Zachary Zhang) - Convert Aardvark to use pci_host_probe() to reduce code duplication (Thomas Petazzoni) - Correct the Cadence cdns_pcie_writel() signature (Alan Douglas) - Add Cadence support for optional generic PHYs (Alan Douglas) - Add Cadence power management ops (Alan Douglas) - Remove redundant variable from Cadence driver (Colin Ian King) - Add Kirin MSI support (Xiaowei Song) - Drop unnecessary root_bus_nr setting from exynos, imx6, keystone, armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn Guo) - Move link notification settings from DesignWare core to individual drivers (Gustavo Pimentel) - Add endpoint library MSI-X interfaces (Gustavo Pimentel) - Correct signature of endpoint library IRQ interfaces (Gustavo Pimentel) - Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel) - Add endpoint library MSI-X test support (Gustavo Pimentel) - Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation (Jia-Ju Bai) - Add more devices to Broadcom PAXC quirk (Ray Jui) - Work around corrupted Broadcom PAXC config space to enable SMMU and GICv3 ITS (Ray Jui) - Disable MSI parsing to work around broken Broadcom PAXC logic in some devices (Ray Jui) - Hide unconfigured functions to work around a Broadcom PAXC defect (Ray Jui) - Lower iproc log level to reduce console output during boot (Ray Jui) - Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi) - Fix mobiveil missing include file (Lorenzo Pieralisi) - Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi) - Fix mvebu I/O space remapping issues (Thomas Petazzoni) - Use generic pci_host_bridge in mvebu instead of ARM-specific API (Thomas Petazzoni) - Whitelist VMD devices with fast interrupt handlers to avoid sharing vectors with slow handlers (Keith Busch) * tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits) PCI/AER: Don't clear AER bits if error handling is Firmware-First PCI: Limit config space size for Netronome NFP5000 PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips PCI/VPD: Check for VPD access completion before checking for timeout PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry PCI: Match Root Port's MPS to endpoint's MPSS as necessary PCI: Skip MPS logic for Virtual Functions (VFs) PCI: Add function 1 DMA alias quirk for Marvell 88SS9183 PCI: Check for PCIe Link downtraining PCI: Add ACS Redirect disable quirk for Intel Sunrise Point PCI: Add device-specific ACS Redirect disable infrastructure PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support PCI: Allow specifying devices using a base bus and path of devfns PCI: Make specifying PCI devices in kernel parameters reusable PCI: Hide ACS quirk declarations inside PCI core PCI: Delay after FLR of Intel DC P3700 NVMe PCI: Disable Samsung SM961/PM961 NVMe before FLR PCI: Export pcie_has_flr() PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers() ...
2018-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds4-2/+18
Pull networking updates from David Miller: "Highlights: - Gustavo A. R. Silva keeps working on the implicit switch fallthru changes. - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From Luca Coelho. - Re-enable ASPM in r8169, from Kai-Heng Feng. - Add virtual XFRM interfaces, which avoids all of the limitations of existing IPSEC tunnels. From Steffen Klassert. - Convert GRO over to use a hash table, so that when we have many flows active we don't traverse a long list during accumluation. - Many new self tests for routing, TC, tunnels, etc. Too many contributors to mention them all, but I'm really happy to keep seeing this stuff. - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu. - Lots of cleanups and fixes in L2TP code from Guillaume Nault. - Add IPSEC offload support to netdevsim, from Shannon Nelson. - Add support for slotting with non-uniform distribution to netem packet scheduler, from Yousuk Seung. - Add UDP GSO support to mlx5e, from Boris Pismenny. - Support offloading of Team LAG in NFP, from John Hurley. - Allow to configure TX queue selection based upon RX queue, from Amritha Nambiar. - Support ethtool ring size configuration in aquantia, from Anton Mikaev. - Support DSCP and flowlabel per-transport in SCTP, from Xin Long. - Support list based batching and stack traversal of SKBs, this is very exciting work. From Edward Cree. - Busyloop optimizations in vhost_net, from Toshiaki Makita. - Introduce the ETF qdisc, which allows time based transmissions. IGB can offload this in hardware. From Vinicius Costa Gomes. - Add parameter support to devlink, from Moshe Shemesh. - Several multiplication and division optimizations for BPF JIT in nfp driver, from Jiong Wang. - Lots of prepatory work to make more of the packet scheduler layer lockless, when possible, from Vlad Buslov. - Add ACK filter and NAT awareness to sch_cake packet scheduler, from Toke Høiland-Jørgensen. - Support regions and region snapshots in devlink, from Alex Vesker. - Allow to attach XDP programs to both HW and SW at the same time on a given device, with initial support in nfp. From Jakub Kicinski. - Add TLS RX offload and support in mlx5, from Ilya Lesokhin. - Use PHYLIB in r8169 driver, from Heiner Kallweit. - All sorts of changes to support Spectrum 2 in mlxsw driver, from Ido Schimmel. - PTP support in mv88e6xxx DSA driver, from Andrew Lunn. - Make TCP_USER_TIMEOUT socket option more accurate, from Jon Maxwell. - Support for templates in packet scheduler classifier, from Jiri Pirko. - IPV6 support in RDS, from Ka-Cheong Poon. - Native tproxy support in nf_tables, from Máté Eckl. - Maintain IP fragment queue in an rbtree, but optimize properly for in-order frags. From Peter Oskolkov. - Improvde handling of ACKs on hole repairs, from Yuchung Cheng" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits) bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT" hv/netvsc: Fix NULL dereference at single queue mode fallback net: filter: mark expected switch fall-through xen-netfront: fix warn message as irq device name has '/' cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0 net: dsa: mv88e6xxx: missing unlock on error path rds: fix building with IPV6=m inet/connection_sock: prefer _THIS_IP_ to current_text_addr net: dsa: mv88e6xxx: bitwise vs logical bug net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd() ieee802154: hwsim: using right kind of iteration net: hns3: Add vlan filter setting by ethtool command -K net: hns3: Set tx ring' tc info when netdev is up net: hns3: Remove tx ring BD len register in hns3_enet net: hns3: Fix desc num set to default when setting channel net: hns3: Fix for phy link issue when using marvell phy driver net: hns3: Fix for information of phydev lost problem when down/up net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero net: hns3: Add support for serdes loopback selftest bnxt_en: take coredump_record structure off stack ...
2018-08-15RDMA/hns: Fix usage of bitmap allocation functions return valuesGal Pressman2-2/+5
hns bitmap allocation functions return 0 on success and -1 on failure. Callers of these functions wrongly used their return value as an errno, fix that by making a proper conversion. Fixes: a598c6f4c5a8 ("IB/hns: Simplify function of pd alloc and qp alloc") Signed-off-by: Gal Pressman <[email protected]> Acked-by: Lijun Ou <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-14qedr: Add user space support for SRQYuval Bason2-44/+191
This patch adds support for SRQ's created in user space and update qedr_affiliated_event to deal with general SRQ events. Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: Yuval Bason <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-14qedr: Add support for kernel mode SRQ'sYuval Bason5-13/+458
Implement the SRQ specific verbs and update the poll_cq verb to deal with SRQ completions. Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: Yuval Bason <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-14qedr: Add wrapping generic structure for qpidr and adjust idr routines.Yuval Bason4-29/+31
Today, we are using idr mechanism for QP's only. This patch prepares the qedr_idr stuctures and the idr routines for both QP's and SRQ's. Signed-off-by: Yuval Bason <[email protected]> Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-14IB/mlx5: Fix leaking stack memory to userspaceJason Gunthorpe1-1/+1
mlx5_ib_create_qp_resp was never initialized and only the first 4 bytes were written. Fixes: 41d902cb7c32 ("RDMA/mlx5: Fix definition of mlx5_ib_create_qp_resp") Cc: <[email protected]> Acked-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-13IB/uverbs: Use uverbs_alloc for allocationsJason Gunthorpe1-53/+30
Several handlers need temporary allocations for the life of the method, switch them to use the uverbs_alloc allocator. Signed-off-by: Jason Gunthorpe <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]>
2018-08-10IB/uverbs: Have the core code create the uverbs_root_specJason Gunthorpe2-29/+17
There is no reason for drivers to do this, the core code should take of everything. The drivers will provide their information from rodata to describe their modifications to the core's base uapi specification. The core uses this to build up the runtime uapi for each device. Signed-off-by: Jason Gunthorpe <[email protected]> Reviewed-by: Michael J. Ruhl <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]>
2018-08-08iw_cxgb4: pass window scale in flowc work requestPotnuri Bharat Teja2-24/+28
This will allow FW to not send more data to TP (which would then need to be buffered). Pass the negotiated TCP window scale to FW in the FLOWC WR. Also refactor send_flowc() a bit to clean it up. Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Potnuri Bharat Teja <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-08RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wqLeon Romanovsky1-1/+3
[ 61.182439] UBSAN: Undefined behaviour in drivers/infiniband/hw/mlx5/qp.c:5366:34 [ 61.183673] shift exponent 4294967288 is too large for 32-bit type 'unsigned int' [ 61.185530] CPU: 0 PID: 639 Comm: qp Not tainted 4.18.0-rc1-00037-g4aa1d69a9c60-dirty #96 [ 61.186981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 [ 61.188315] Call Trace: [ 61.188661] dump_stack+0xc7/0x13b [ 61.190427] ubsan_epilogue+0x9/0x49 [ 61.190899] __ubsan_handle_shift_out_of_bounds+0x1ea/0x22f [ 61.197040] mlx5_ib_create_wq+0x1c99/0x1d50 [ 61.206632] ib_uverbs_ex_create_wq+0x499/0x820 [ 61.213892] ib_uverbs_write+0x77e/0xae0 [ 61.248018] vfs_write+0x121/0x3b0 [ 61.249831] ksys_write+0xa1/0x120 [ 61.254024] do_syscall_64+0x7c/0x2a0 [ 61.256178] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 61.259211] RIP: 0033:0x7f54bab70e99 [ 61.262125] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 [ 61.268678] RSP: 002b:00007ffe1541c318 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 61.271076] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54bab70e99 [ 61.273795] RDX: 0000000000000070 RSI: 0000000020000240 RDI: 0000000000000003 [ 61.276982] RBP: 00007ffe1541c330 R08: 00000000200078e0 R09: 0000000000000002 [ 61.280035] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004005c0 [ 61.283279] R13: 00007ffe1541c420 R14: 0000000000000000 R15: 0000000000000000 Cc: <[email protected]> # 4.7 Fixes: 79b20a6c3014 ("IB/mlx5: Add receive Work Queue verbs") Cc: syzkaller <[email protected]> Reported-by: Noa Osherovich <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-02RDMA/netdev: Use priv_destructor for netdev cleanupJason Gunthorpe1-10/+0
Now that the unregister_netdev flow for IPoIB no longer relies on external code we can now introduce the use of priv_destructor and needs_free_netdev. The rdma_netdev flow is switched to use the netdev common priv_destructor instead of the special free_rdma_netdev and the IPOIB ULP adjusted: - priv_destructor needs to switch to point to the ULP's destructor which will then call the rdma_ndev's in the right order - We need to be careful around the error unwind of register_netdev as it sometimes calls priv_destructor on failure - ULPs need to use ndo_init/uninit to ensure proper ordering of failures around register_netdev Switching to priv_destructor is a necessary pre-requisite to using the rtnl new_link mechanism. The VNIC user for rdma_netdev should also be revised, but that is left for another patch. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Denis Drozdov <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]>
2018-08-02iw_cxgb4: Support FW write completion WRPotnuri Bharat Teja4-2/+183
To optimize NVME-oF READ IOPs, use a specialized WQE that combines the RDMA WRITE and SEND_INV WR chain submitted by the NVME-oF target driver. This reduces uP overhead per NVME-oF IO, and results in over 10% improvement in NVME-oF 4K READ IOPs. Signed-off-by: Potnuri Bharat Teja <[email protected]> Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-02iw_cxgb4: RDMA write with immediate supportPotnuri Bharat Teja4-15/+79
Adds iw_cxgb4 functionality to support RDMA_WRITE_WITH_IMMEDATE opcode. Signed-off-by: Potnuri Bharat Teja <[email protected]> Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-02rdma/cxgb4: fix some info leaksDan Carpenter1-4/+3
In c4iw_create_qp() there are several struct members which potentially aren't inintialized like uresp.rq_key. I've fixed this code before in in commit ae1fe07f3f42 ("RDMA/cxgb4: Fix stack info leak in c4iw_create_qp()") so this time I'm just going to take a big hammer approach and memset the whole struct to zero. Hopefully, it will stay fixed this time. In c4iw_create_srq() we don't clear uresp.reserved. Fixes: 6a0b6174d35a ("rdma/cxgb4: Add support for kernel mode SRQ's") Signed-off-by: Dan Carpenter <[email protected]> Acked-by: Raju Rangoju <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-02RDMA/hns: Support flush cqe for hip08 in kernel spaceYixian Liu4-20/+240
According to IB protocol, there are some cases that work requests must return the flush error completion status through the completion queue. Due to hardware limitation, the driver needs to assist the flush process. This patch adds the support of flush cqe for hip08 in the cases that needed, such as poll cqe, post send, post recv and aeqe handle. The patch also considered the compatibility between kernel and user space. Signed-off-by: Yixian Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-08-01IB/uverbs: Do not pass struct ib_device to the ioctl methodsJason Gunthorpe2-27/+24
This does the same as the patch before, except for ioctl. The rules are the same, but for the ioctl methods the core code handles setting up the uobject. - Retrieve the ib_dev from the uobject->context->device. This is safe under ioctl as the core has already done rdma_alloc_begin_uobject and so CREATE calls are entirely protected by the rwsem. - Retrieve the ib_dev from uobject->object - Call ib_uverbs_get_ucontext() Signed-off-by: Jason Gunthorpe <[email protected]>