Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull rdma updates from Jason Gunthorpe:
"Usual collection of small improvements and fixes:
- Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe,
hf1, qib, ocrdma
- bnxt_re support for MSN, which is a new retransmit logic
- Initial mana support for RC qps
- Use after free bug and cleanups in iwcm
- Reduce resource usage in mlx5 when RDMA verbs features are not used
- New verb to drain shared recieve queues, similar to normal recieve
queues. This is necessary to allow ULPs a clean shutdown. Used in
the iscsi rdma target
- mlx5 support for more than 16 bits of doorbell indexes
- Doorbell moderation support for bnxt_re
- IB multi-plane support for mlx5
- New EFA adaptor PCI IDs
- RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't
rename the device
- A collection of hns bugs
- Fix long standing bug in bnxt_re with incorrect endian handling of
immediate data"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
IB/hfi1: Constify struct flag_table
RDMA/mana_ib: Set correct device into ib
bnxt_re: Fix imm_data endianness
RDMA: Fix netdev tracker in ib_device_set_netdev
RDMA/hns: Fix mbx timing out before CMD execution is completed
RDMA/hns: Fix insufficient extend DB for VFs.
RDMA/hns: Fix undifined behavior caused by invalid max_sge
RDMA/hns: Fix shift-out-bounds when max_inline_data is 0
RDMA/hns: Fix missing pagesize and alignment check in FRMR
RDMA/hns: Fix unmatch exception handling when init eq table fails
RDMA/hns: Fix soft lockup under heavy CEQE load
RDMA/hns: Check atomic wr length
RDMA/ocrdma: Don't inline statistics functions
RDMA/core: Introduce "name_assign_type" for an IB device
RDMA/qib: Fix truncation compilation warnings in qib_verbs.c
RDMA/qib: Fix truncation compilation warnings in qib_init.c
RDMA/efa: Add EFA 0xefa3 PCI ID
RDMA/mlx5: Support per-plane port IB counters by querying PPCNT register
net/mlx5: mlx5_ifc update for accessing ppcnt register of plane ports
RDMA/mlx5: Add plane index support when querying PTYS registers
...
|
|
Changes the create_cq verb signature by sending the entire uverbs attr
bundle as a parameter. This allows drivers to send driver specific attrs
through ioctl for the create_cq verb and access them in their driver
specific code.
Also adds a new enum value for driver specific ioctl attributes for
methods already supporting UHW.
Link: https://lore.kernel.org/r/ed147343987c0d43fd391c1b2f85e2f425747387.1719512393.git.leon@kernel.org
Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
For RDMA Send and Write with IB_SEND_INLINE, the memory buffers
specified in sge list will be placed inline in the Send Request.
The data should be copied by CPU from the virtual addresses of
corresponding sge list DMA addresses.
Cc: stable@kernel.org
Fixes: 8d7c7c0eeb74 ("RDMA: Add ib_virt_dma_to_page()")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240516095052.542767-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Replace calls to rxe_run_task() with rxe_sched_task(). This prevents the
tasks from all running on the same cpu.
This change slightly reduces performance for single qp send and write
benchmarks in loopback mode but greatly improves the performance with
multiple qps because if run task is used all the work tends to be
performed on one cpu. For actual on the wire benchmarks there is no
noticeable performance change.
Link: https://lore.kernel.org/r/20240329145513.35381-11-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently the rxe driver has three work queue tasks per qp. These are the
req.task, comp.task and resp.task which call rxe_requester(),
rxe_completer() and rxe_responder() respectively directly or on work
queues. Each of these subroutines checks to see if there is work to be
performed on the send queue or on the response packet queue or the request
packet queue and will run until there is no work remaining or yield the
cpu and reschedule itself until there is no work remaining.
This commit combines the req.task and comp.task into a single send.task
and renames the resp.task to the recv.task. The combined send.task calls
rxe_requester() and rxe_completer() serially and continues until all work
on both the send queue and the response packet queue are done.
In various benchmarks the performance is either improved or left the
same. At high scale there is a significant reduction in the load on the
cpu.
This is the first step in combining these two tasks. Once they are
serialized cross rescheduling of req.task and comp.task can be more
efficiently handled by just letting the send.task continue to run. This
will be done in the next several patches.
Link: https://lore.kernel.org/r/20240329145513.35381-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In rxe_post_send_kernel() if the qp is in the error state after posting
the work requests the rxe_completer() task is scheduled.
But, the only way to move the qp into the error state is to call
rxe_qp_error() which also schedules the rxe_completer() task to drain the
queues. Calling it a second time has no effect. This commit removes the
redundant call.
Link: https://lore.kernel.org/r/20240329145513.35381-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
A previous commit incorrectly added an 'if(!err)' before scheduling the
requester task in rxe_post_send_kernel(). But if there were send wrs
successfully added to the send queue before a bad wr they might never get
executed.
This commit fixes this by scheduling the requester task if any wqes were
successfully posted in rxe_post_send_kernel() in rxe_verbs.c.
Link: https://lore.kernel.org/r/20240329145513.35381-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 5bf944f24129 ("RDMA/rxe: Add error messages")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
This one is not needed since commit 954afc5a8fd8 ("RDMA/rxe:
Use members of generic struct in rxe_mr").
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20240202124144.16033-1-guoqing.jiang@linux.dev
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Previously rxe_{dbg,info,err}() macros are appended built-in newline,
but some users will add redundant newline sometimes. So remove the
built-in newline for these macros.
In terms of rxe_{dbg,info,err}_xxx() macros, because they don't have
built-in newline, append newline when using them.
CC: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://lore.kernel.org/r/20240109083253.3629967-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Removed unreachable break statement after return.
Signed-off-by: Rohit Chavan <roheetchavan@gmail.com>
Link: https://lore.kernel.org/r/20230822091304.7312-1-roheetchavan@gmail.com
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re:
- Use the common mmap functions
- Support disassociation
- Improve FW command flow
- support for 'low latency push'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
RDMA/bnxt_re: Remove incorrect return check from slow path
RDMA/bnxt_re: Enable low latency push
RDMA/bnxt_re: Reorg the bar mapping
RDMA/bnxt_re: Move the interface version to chip context structure
RDMA/bnxt_re: Query function capabilities from firmware
RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
RDMA/bnxt_re: Add disassociate ucontext support
RDMA/bnxt_re: Use the common mmap helper functions
RDMA/bnxt_re: Initialize opcode while sending message
RDMA/cma: Remove NULL check before dev_{put, hold}
RDMA/rxe: Simplify cq->notify code
RDMA/rxe: Fixes mr access supported list
RDMA/bnxt_re: optimize the parameters passed to helper functions
RDMA/bnxt_re: remove redundant cmdq_bitmap
RDMA/bnxt_re: use firmware provided max request timeout
RDMA/bnxt_re: cancel all control path command waiters upon error
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
"Documentation updates
Miscellaneous fixes, perhaps most notably:
- Remove RCU_NONIDLE(). The new visibility of most of the idle loop
to RCU has obsoleted this API.
- Make the RCU_SOFTIRQ callback-invocation time limit also apply to
the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.
- Add a jiffies-based callback-invocation time limit to handle
long-running callbacks. (The local_clock() function is only invoked
once per 32 callbacks due to its high overhead.)
- Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs, which
fixes a bug that can occur on systems with non-contiguous CPU
numbering.
kvfree_rcu updates:
- Eliminate the single-argument variant of k[v]free_rcu() now that
all uses have been converted to k[v]free_rcu_mightsleep().
- Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks too
soon. Yes, this is closing the barn door after the horse has
escaped, but Murphy says that there will be more horses.
Callback-offloading updates:
- Fix a number of bugs involving the shrinker and lazy callbacks.
Tasks RCU updates
Torture-test updates"
* tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (32 commits)
torture: Remove duplicated argument -enable-kvm for ppc64
doc/rcutorture: Add description of rcutorture.stall_cpu_block
rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
rcutorture: Correct name of use_softirq module parameter
locktorture: Add long_hold to adjust lock-hold delays
rcu/nocb: Make shrinker iterate only over NOCB CPUs
rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs
rcu: Make rcu_cpu_starting() rely on interrupts being disabled
rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work
rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp
rcu: Employ jiffies-based backstop to callback time limit
rcu: Check callback-invocation time limit for rcuc kthreads
rcu: Remove RCU_NONIDLE()
rcu: Add more RCU files to kernel-api.rst
rcu-tasks: Clarify the cblist_init_generic() function's pr_info() output
rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()
rcu/nocb: Recheck lazy callbacks under the ->nocb_lock from shrinker
rcu/nocb: Fix shrinker race against callback enqueuer
rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading
...
|
|
Linux 6.4
Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:
drivers/infiniband/sw/rxe/rxe_cq.c:
https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The flags parameter to the request notify verb is a bitmask. But, rxe
driver treats cq->notify as an int. If someone ever set both the
IB_CQ_SOLICITED and the IB_CQ_NEXT_COMP bits rxe_cq_post could fail to
generate a completion event. This patch treats the notify flags as a bit
mask consistently and can handle the above case correctly.
Link: https://lore.kernel.org/r/20230612162244.20038-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Implement the two easy cases of ib_rereg_user_mr.
Link: https://lore.kernel.org/r/20230530221334.89432-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Introduce supported bit masks for setting the access attributes of MWs,
MRs, and QPs. Check these when attributes are set.
Link: https://lore.kernel.org/r/20230530221334.89432-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
spin_{lock_irqsave,unlock_irqrestore}
We need to call spin_lock_irqsave()/spin_unlock_irqrestore() for
state_lock in rxe, otherwsie the callchain:
ib_post_send_mad
-> spin_lock_irqsave
-> ib_post_send -> rxe_post_send
-> spin_lock_bh
-> spin_unlock_bh
-> spin_unlock_irqrestore
Causes below traces during run block nvmeof-mp/001 test due to mismatched
spinlock nesting:
WARNING: CPU: 0 PID: 94794 at kernel/softirq.c:376 __local_bh_enable_ip+0xc2/0x140
[ ... ]
CPU: 0 PID: 94794 Comm: kworker/u4:1 Tainted: G E 6.4.0-rc1 #9
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
Workqueue: rdma_cm cma_work_handler [rdma_cm]
RIP: 0010:__local_bh_enable_ip+0xc2/0x140
Code: 48 85 c0 74 72 5b 41 5c 5d 31 c0 89 c2 89 c1 89 c6 89 c7 41 89 c0 e9 bd 0e 11 01 65 8b 05 f2 65 72 48 85 c0 0f 85 76 ff ff ff <0f> 0b e9 6f ff ff ff e8 d2 39 1c 00 eb 80 4c 89 e7 e8 68 ad 0a 00
RSP: 0018:ffffb7cf818539f0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000201 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffffc0f25f79
RBP: ffffb7cf81853a00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc0f25f79
R13: ffff8db1f0fa6000 R14: ffff8db2c63ff000 R15: 00000000000000e8
FS: 0000000000000000(0000) GS:ffff8db33bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559758db0f20 CR3: 0000000105124000 CR4: 00000000003506f0
Call Trace:
<TASK>
_raw_spin_unlock_bh+0x31/0x40
rxe_post_send+0x59/0x8b0 [rdma_rxe]
ib_send_mad+0x26b/0x470 [ib_core]
ib_post_send_mad+0x150/0xb40 [ib_core]
? cm_form_tid+0x5b/0x90 [ib_cm]
ib_send_cm_req+0x7c8/0xb70 [ib_cm]
rdma_connect_locked+0x433/0x940 [rdma_cm]
nvme_rdma_cm_handler+0x5d7/0x9c0 [nvme_rdma]
cma_cm_event_handler+0x4f/0x170 [rdma_cm]
cma_work_handler+0x6a/0xe0 [rdma_cm]
process_one_work+0x2a9/0x580
worker_thread+0x52/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0x109/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 0 PID: 94794 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x37/0x60
[ ... ]
CPU: 0 PID: 94794 Comm: kworker/u4:1 Tainted: G W E 6.4.0-rc1 #9
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
Workqueue: rdma_cm cma_work_handler [rdma_cm]
RIP: 0010:warn_bogus_irq_restore+0x37/0x60
Code: fb 01 77 36 83 e3 01 74 0e 48 8b 5d f8 c9 31 f6 89 f7 e9 ac ea 01 00 48 c7 c7 e0 52 33 b9 c6 05 bb 1c 69 01 01 e8 39 24 f0 fe <0f> 0b 48 8b 5d f8 c9 31 f6 89 f7 e9 89 ea 01 00 0f b6 f3 48 c7 c7
RSP: 0018:ffffb7cf81853a58 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffb7cf81853a60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8db2cfb1a9e8
R13: ffff8db2cfb1a9d8 R14: ffff8db2c63ff000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8db33bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559758db0f20 CR3: 0000000105124000 CR4: 00000000003506f0
Call Trace:
<TASK>
_raw_spin_unlock_irqrestore+0x91/0xa0
ib_send_mad+0x1e3/0x470 [ib_core]
ib_post_send_mad+0x150/0xb40 [ib_core]
? cm_form_tid+0x5b/0x90 [ib_cm]
ib_send_cm_req+0x7c8/0xb70 [ib_cm]
rdma_connect_locked+0x433/0x940 [rdma_cm]
nvme_rdma_cm_handler+0x5d7/0x9c0 [nvme_rdma]
cma_cm_event_handler+0x4f/0x170 [rdma_cm]
cma_work_handler+0x6a/0xe0 [rdma_cm]
process_one_work+0x2a9/0x580
worker_thread+0x52/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0x109/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
Fixes: f605f26ea196 ("RDMA/rxe: Protect QP state with qp->state_lock")
Link: https://lore.kernel.org/r/20230510035056.881196-1-guoqing.jiang@linux.dev
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently the rxe driver makes little effort to make the changes to qp
state (which includes qp->attr.qp_state, qp->attr.sq_draining and
qp->valid) atomic between different client threads and IO threads. In
particular a common template is for an RDMA application to call
ib_modify_qp() to move a qp to ERR state and then wait until all the
packet and work queues have drained before calling ib_destroy_qp(). None
of these state changes are protected by locks to assure that the changes
are executed atomically and that memory barriers are included. This has
been observed to lead to incorrect behavior around qp cleanup.
This patch continues the work of the previous patches in this series and
adds locking code around qp state changes and lookups.
Link: https://lore.kernel.org/r/20230405042611.6467-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The rxe driver has four different QP state variables,
qp->attr.qp_state,
qp->req.state,
qp->comp.state, and
qp->resp.state.
All of these basically carry the same information.
This patch replaces uses of qp->req.state by qp->attr.qp_state and enum
rxe_qp_state. This is the third of three patches which will remove all
but the qp->attr.qp_state variable. This will bring the driver closer to
the IBA description.
Link: https://lore.kernel.org/r/20230405042611.6467-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The rxe driver has four different QP state variables,
qp->attr.qp_state,
qp->req.state,
qp->comp.state, and
qp->resp.state.
All of these basically carry the same information.
This patch replaces uses of qp->resp.state by qp->attr.qp_state. This is
the first of three patches which will remove all but the qp->attr.qp_state
variable. This will bring the driver closer to the IBA description.
Link: https://lore.kernel.org/r/20230405042611.6467-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Make it clearer what is going on by adding a function to go back from the
"virtual" dma_addr to a kva and another to a struct page. This is used in the
ib_uses_virt_dma() style drivers (siw, rxe, hfi, qib).
Call them instead of a naked casting and virt_to_page() when working with dma_addr
values encoded by the various ib_map functions.
This also fixes the virt_to_page() casting problem Linus Walleij has been
chasing.
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v2-05ea785520ed+10-ib_virt_page_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
There is no need to print any debug messages after failure to
allocate memory, because kernel will print OOM dumps anyway.
Together with removal of these messages, remove useless goto jumps.
Fixes: 5bf944f24129 ("RDMA/rxe: Add error messages")
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/all/ea43486f-43dd-4054-b1d5-3a0d202be621@kili.mountain
Link: https://lore.kernel.org/r/d3cedf723b84e73e8062a67b7489d33802bafba2.1680113597.git.leon@kernel.org
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Remove the tasklet call in rxe_cq.c and also the is_dying in the
cq struct. There is no reason for the rxe driver to defer the call
to the cq completion handler by scheduling a tasklet. rxe_cq_post()
is not called in a hard irq context.
The rxe driver currently is incorrect because the tasklet call is
made without protecting the cq pointer with a reference from having
the underlying memory freed before the deferred routine is called.
Executing the comp_handler inline fixes this problem.
Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Link: https://lore.kernel.org/r/20230327215643.10410-1-rpearsonhpe@gmail.com
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This patch adds error and debug messages so that every interaction
with rdma-core through a verbs API call or a completion error return
will generate at least one error message backed up by debug messages
with more detail.
With dynamic debugging one can follow up after seeing an error message
by turning on the appropriate debug messages.
Link: https://lore.kernel.org/r/20230303221623.8053-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replace the name rxe_dbg with rxe_dbg_dev which better matches
the remaining rxe_dbg_xxx macros for debug messages with a
rxe device parameter. Reuse the name rxe_dbg for debug messages
which do not have a rxe device parameter.
Link: https://lore.kernel.org/r/20230303221623.8053-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
An earlier patch which introduced smp_load_acquire/smp_store_release
into rxe_queue.h incorrectly assumed that surrounding spin-locks in
rxe_verbs.c around queue updates for kernel ulps was sufficient to
protect the passing of data through the queues between the ulp and
the rxe tasklets. But this was incorrect. The typical sequence was
ulp rxe requester tasklet
------------------------ ---------------------
spin_lock_irqsave() wqe = queue_head(queue)
if (!queue_full(q)) { if (!wqe)
spin_unlock_irqrestore return;
return -ENOMEM
} <process wqe>
wqe = queue_producer_addr(q)
<fill in wqe> queue_advance_consumer(queue)
queue_advance_producer(q)
spin_unlock_irqrestore()
queue_head() calls queue_empty() which calls smp_load_acquire()
For user space apps queue_advance_producer() calls smp_store_release()
so that there is a memory barrier between the producer and the
consumer but for kernel ulps queue_advance_produce() just incremented
the producer index because the lock function is a release function.
But to work the barrier has to come between filling in the wqe and
updating the producer index. This patch adds the missing barriers.
It also changes the enum names for the ulp queue types to
QUEUE_TYPE_FROM/TO_ULP instead of QUEUE_TYPE_TO/FROM_DRIVER
which is very ambiguous. This bug is suspected as the cause of very
rare lockups in a very high scale storage application. It is a bug
in any case and should be corrected.
Fixes: 0a67c46d2e99 ("RDMA/rxe: Protect user space index loads/stores")
Link: https://lore.kernel.org/r/20230214071053.5395-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently all the object types in the rxe driver are allocated in
rdma-core except for MRs. By moving tha kzalloc() call outside of
the pool code the rxe_alloc() subroutine can be eliminated and code
checking for MR as a special case can be removed.
This patch moves the kzalloc() and kfree_rcu() calls into the mr
registration and destruction verbs. It removes that code from
rxe_pool.c including the rxe_alloc() subroutine which is no longer
used.
Link: https://lore.kernel.org/r/20230213225551.12437-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com>
Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Move rxe_map_mr_sg() to rxe_mr.c where it makes a little more sense.
Link: https://lore.kernel.org/r/20230119235936.19728-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replace calls to pr_xxx() in rxe_av.c with rxe_dbg_xxx().
Link: https://lore.kernel.org/r/20221103171013.20659-13-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replace calls to pr_xxx() in rxe_verbs.c with rxe_dbg_xxx().
Link: https://lore.kernel.org/r/20221103171013.20659-12-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replace calls to pr_xxx() in rxe_mr.c by rxe_dbg_mr().
Link: https://lore.kernel.org/r/20221103171013.20659-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Instead of 'goto and return', just return directly to
simplify the error handling, and avoid some unnecessary
return value check.
Link: https://lore.kernel.org/r/20221028075053.3990467-1-xuhaoyue1@hisilicon.com
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Split rxe_run_task(task, sched) into rxe_run_task(task) and
rxe_sched_task(task).
Link: https://lore.kernel.org/r/20221021200118.2163-5-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In include/uapi/rdma/rdma_user_rxe.h there are redundant copies of num_sge
in the rxe_send_wr, rxe_recv_wqe, and rxe_dma_info. Only the ones in
rxe_dma_info are actually used by the rxe kernel driver.
The userspace would set these values, but the kernel never read them.
This change has no affect on the current ABI and new or old versions of
rdma-core operate correctly with new or old versions of the kernel rxe
driver.
Link: https://lore.kernel.org/r/20220913222716.18335-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Move setting of pd in mr objects ahead of any possible errors so that it
will always be set in rxe_mr_cleanup() to avoid seg faults when
rxe_put(mr_pd(mr)) is called.
Fixes: cf40367961d8 ("RDMA/rxe: Move mr cleanup code to rxe_mr_cleanup()")
Link: https://lore.kernel.org/r/20220805183153.32007-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
rxe_mr and ib_mr have interchangeable members. Remove device specific
members and use ones in the generic struct. Both 'iova' and 'length' are
filled in ib_uverbs or ib_core layer after MR registration.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20220921080844.1616883-2-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Commit 1e75550648da ("Revert "RDMA/rxe: Create duplicate mapping tables for
FMRs"") brought back the member 'va' to struct rxe_mr. However, it is
actually used by nobody and thus can be removed.
Fixes: 1e75550648da ("Revert "RDMA/rxe: Create duplicate mapping tables for FMRs"")
Link: https://lore.kernel.org/r/20220829012335.1212697-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Below 2 commits will be reverted:
commit 8ff5f5d9d8cf ("RDMA/rxe: Prevent double freeing rxe_map_set()")
commit 647bf13ce944 ("RDMA/rxe: Create duplicate mapping tables for FMRs")
The community has a few bug reports which pointed this commit at last.
Some proposals are raised up in the meantime but all of them have no
follow-up operation.
The previous commit led the map_set of FMR to be not available any more if
the MR is registered again after invalidating. Although the mentioned
patch try to fix a potential race in building/accessing the same table
for fast memory regions, it broke rtrs etc ULPs. Since the latter could
be worse, revert this patch.
With previous commit, it's observed that a same MR in rnbd server will
trigger below code path:
-> rxe_mr_init_fast()
|-> alloc map_set() # map_set is uninitialized
|...-> rxe_map_mr_sg() # build the map_set
|-> rxe_mr_set_page()
|...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE that means
# we can access host memory(such rxe_mr_copy)
|...-> rxe_invalidate_mr() # mr->state change to FREE from VALID
|...-> rxe_reg_fast_mr() # mr->state change to VALID from FREE,
# but map_set was not built again
|...-> rxe_mr_copy() # kernel crash due to access wild addresses
# that lookup from the map_set
The backtraces are not always identical.
[1st]----------
RIP: 0010:lookup_iova+0x66/0xa0 [rdma_rxe]
Code: 00 00 00 48 d3 ee 89 32 c3 4c 8b 18 49 8b 3b 48 8b 47 08 48 39 c6 72 38 48 29 c6 45 31 d2 b8 01 00 00 00 48 63 c8 48 c1 e1 04 <48> 8b 4c 0f 08 48 39 f1 77 21 83 c0 01 48 29 ce 3d 00 01 00 00 75
RSP: 0018:ffffb7ff80063bf0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9b9949d86800 RCX: 0000000000000000
RDX: ffffb7ff80063c00 RSI: 0000000049f6b378 RDI: 002818da00000004
RBP: 0000000000000120 R08: ffffb7ff80063c08 R09: ffffb7ff80063c04
R10: 0000000000000002 R11: ffff9b9916f7eef8 R12: ffff9b99488a0038
R13: ffff9b99488a0038 R14: ffff9b9914fb346a R15: ffff9b990ab27000
FS: 0000000000000000(0000) GS:ffff9b997dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efc33a98ed0 CR3: 0000000014f32004 CR4: 00000000001706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
rxe_mr_copy.part.0+0x6f/0x140 [rdma_rxe]
rxe_responder+0x12ee/0x1b60 [rdma_rxe]
? rxe_icrc_check+0x7e/0x100 [rdma_rxe]
? rxe_rcv+0x1d0/0x780 [rdma_rxe]
? rxe_icrc_hdr.isra.0+0xf6/0x160 [rdma_rxe]
rxe_do_task+0x67/0xb0 [rdma_rxe]
rxe_xmit_packet+0xc7/0x210 [rdma_rxe]
rxe_requester+0x680/0xee0 [rdma_rxe]
? update_load_avg+0x5f/0x690
? update_load_avg+0x5f/0x690
? rtrs_clt_recv_done+0x1b/0x30 [rtrs_client]
[2nd]----------
RIP: 0010:rxe_mr_copy.part.0+0xa8/0x140 [rdma_rxe]
Code: 00 00 49 c1 e7 04 48 8b 00 4c 8d 2c d0 48 8b 44 24 10 4d 03 7d 00 85 ed 7f 10 eb 6c 89 54 24 0c 49 83 c7 10 31 c0 85 ed 7e 5e <49> 8b 3f 8b 14 24 4c 89 f6 48 01 c7 85 d2 74 06 48 89 fe 4c 89 f7
RSP: 0018:ffffae3580063bf8 EFLAGS: 00010202
RAX: 0000000000018978 RBX: ffff9d7ef7a03600 RCX: 0000000000000008
RDX: 000000000000007c RSI: 000000000000007c RDI: ffff9d7ef7a03600
RBP: 0000000000000120 R08: ffffae3580063c08 R09: ffffae3580063c04
R10: ffff9d7efece0038 R11: ffff9d7ec4b1db00 R12: ffff9d7efece0038
R13: ffff9d7ef4098260 R14: ffff9d7f11e23c6a R15: 4c79500065708144
FS: 0000000000000000(0000) GS:ffff9d7f3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fce47276c60 CR3: 0000000003f66004 CR4: 00000000001706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
rxe_responder+0x12ee/0x1b60 [rdma_rxe]
? rxe_icrc_check+0x7e/0x100 [rdma_rxe]
? rxe_rcv+0x1d0/0x780 [rdma_rxe]
? rxe_icrc_hdr.isra.0+0xf6/0x160 [rdma_rxe]
rxe_do_task+0x67/0xb0 [rdma_rxe]
rxe_xmit_packet+0xc7/0x210 [rdma_rxe]
rxe_requester+0x680/0xee0 [rdma_rxe]
? update_load_avg+0x5f/0x690
? update_load_avg+0x5f/0x690
? rtrs_clt_recv_done+0x1b/0x30 [rtrs_client]
rxe_do_task+0x67/0xb0 [rdma_rxe]
tasklet_action_common.constprop.0+0x92/0xc0
__do_softirq+0xe1/0x2d8
run_ksoftirqd+0x21/0x30
smpboot_thread_fn+0x183/0x220
? sort_range+0x20/0x20
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
Link: https://lore.kernel.org/r/1658805386-2-1-git-send-email-lizhijian@fujitsu.com
Link: https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
Link: https://www.spinics.net/lists/linux-rdma/msg110836.html
Link: https://lore.kernel.org/lkml/94a5ea93-b8bb-3a01-9497-e2021f29598a@linux.dev/t/
Tested-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently the rdma_rxe driver has a security weakness due to giving
objects which are partially initialized indices allowing external actors
to gain access to them by sending packets which refer to their
index (e.g. qpn, rkey, etc) causing unpredictable results.
This patch adds a new API rxe_finalize(obj) which enables looking up pool
objects from indices using rxe_pool_get_index() for AH, QP, MR, and
MW. They are added in create verbs only after the objects are fully
initialized.
It also adds wait for completion to destroy/dealloc verbs to assure that
all references have been dropped before returning to rdma_core by
implementing a new rxe_pool API rxe_cleanup() which drops a reference to
the object and then waits for all other references to be dropped. When
the last reference is dropped the object is completed by kref. After that
it cleans up the object and if locally allocated frees the memory. In the
special case of address handle objects the delay is implemented separately
if the destroy_ah call is not sleepable.
Combined with deferring cleanup code to type specific cleanup routines
this allows all pending activity referring to objects to complete before
returning to rdma_core.
Link: https://lore.kernel.org/r/20220612223434.31462-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add a counter to keep track of the number of WQs connected to a CQ and
return an error if destroy_cq() is called while the counter is non zero.
Link: https://lore.kernel.org/r/20220421014042.26985-8-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Move the code from rxe_qp_destroy() to rxe_qp_do_cleanup(). This allows
flows holding references to qp to complete before the qp object is torn
down.
Link: https://lore.kernel.org/r/20220421014042.26985-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Move cleanup code from rxe_destroy_srq() to rxe_srq_cleanup() which is
called after all references are dropped to allow code depending on the srq
object to complete.
Link: https://lore.kernel.org/r/20220421014042.26985-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently the #define IB_SRQ_INIT_MASK is used to distinguish the
rxe_create_srq verb from the rxe_modify_srq verb so that some code can be
shared between these two subroutines.
This commit splits rxe_srq_chk_attr into two subroutines: rxe_srq_chk_init
and rxe_srq_chk_attr which handle the create_srq and modify_srq verbs
separately.
Link: https://lore.kernel.org/r/20220421014042.26985-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently the rdma_rxe driver supports SMI type QPs in a few places which
is incorrect. RoCE devices never should support SMI QPs. This commit
removes SMI QP support from the driver.
Link: https://lore.kernel.org/r/20220407185416.16372-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Rename rxe_add_ref() to rxe_get() and rxe_drop_ref() to rxe_put().
Significantly improves readability for new readers.
Link: https://lore.kernel.org/r/20220304000808.225811-10-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Currently the rxe driver uses red-black trees to add indices to the rxe
object pools. Linux xarrays provide a better way to implement the same
functionality for indices. This patch replaces red-black trees by xarrays
for pool objects. Since xarrays already have a spinlock use that in place
of the pool rwlock. Make sure that all changes in the xarray(index) and
kref(ref counnt) occur atomically.
Link: https://lore.kernel.org/r/20220304000808.225811-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
A previous patch replaced all irqsave locks in rxe with bh locks. This
ran into problems because rdmacm has a bad habit of calling rdma verbs
APIs while disabling irqs. This is not allowed during spin_unlock_bh()
causing programs that use rdmacm to fail. This patch reverts the changes
to locks that had this problem or got dragged into the same mess. After
this patch blktests/check -q srp now runs correctly.
Link: https://lore.kernel.org/r/20220215194448.44369-1-rpearsonhpe@gmail.com
Fixes: 21adfa7a3c4e ("RDMA/rxe: Replace irqsave locks with bh locks")
Reported-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add code to check if a QP is attached to one or more multicast groups
when destroy_qp is called and return an error if so.
Link: https://lore.kernel.org/r/20220127213755.31697-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Move rxe_mcast_attach and rxe_mcast_detach from rxe_verbs.c to rxe_mcast.c,
Make non-static and add declarations to rxe_loc.h. Make the subroutines
in rxe_mcast.c referenced by these routines static and remove their
declarations from rxe_loc.h.
Link: https://lore.kernel.org/r/20220127213755.31697-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Use the standard method to produce udp source port.
Link: https://lore.kernel.org/r/20220106180359.2915060-5-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|