| Age | Commit message (Collapse) | Author | Files | Lines | 
 | 
1,TDR will kickout guilty job if it hang exceed the threshold
of the given one from kernel paramter "job_hang_limit", that
way a bad command stream will not infinitly cause GPU hang.
by default this threshold is 1 so a job will be kicked out
after it hang.
2,if a job timeout TDR routine will not reset all sched/ring,
instead if will only reset on the givn one which is indicated
by @job of amdgpu_sriov_gpu_reset, that way we don't need to
reset and recover each sched/ring if we already know which job
cause GPU hang.
3,unblock sriov_gpu_reset for AI family.
V2:
1:put kickout guilty job after sched parked.
2:since parking scheduler prior to kickout already occupies a
while, we can do last check on the in question job before
doing hw_reset.
TODO:
1:when a job is considered as guilty, we should mark some flag
in its fence status flag, and let UMD side aware that this
fence signaling is not due to job complete but job hang.
2:if gpu reset cause all video memory lost, we need introduce
a new policy to implement TDR, like drop all jobs not yet
signaled, and all IOCTL on this device will return ERROR
DEVICE_LOST.
this will be implemented later.
Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
because we don't want to do sriov-gpu-reset under certain
cases, so just split those two funtion and don't invoke
sr-iov one from bare-metal one.
V2:
remove debugfs_gpu_reset routine on SRIOV case.
Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
KIQ is used for interaction between driver and
CP, and not exposed to outside client, as such it
doesn't need to be handled by GPU scheduler.
Signed-off-by: Monk Liu <[email protected]>
Signed-off-by: Xiangliang Yu <[email protected]>
Signed-off-by: Trigger Huang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Linux 4.9-rc4
This is needed for nouveau development.
 | 
 | 
I plan to usurp the short name of struct fence for a core kernel struct,
and so I need to rename the specialised fence/timeline for DMA
operations to make room.
A consensus was reached in
https://lists.freedesktop.org/archives/dri-devel/2016-July/113083.html
that making clear this fence applies to DMA operations was a good thing.
Since then the patch has grown a bit as usage increases, so hopefully it
remains a good thing!
(v2...: rebase, rerun spatch)
v3: Compile on msm, spotted a manual fixup that I broke.
v4: Try again for msm, sorry Daniel
coccinelle script:
@@
@@
- struct fence
+ struct dma_fence
@@
@@
- struct fence_ops
+ struct dma_fence_ops
@@
@@
- struct fence_cb
+ struct dma_fence_cb
@@
@@
- struct fence_array
+ struct dma_fence_array
@@
@@
- enum fence_flag_bits
+ enum dma_fence_flag_bits
@@
@@
(
- fence_init
+ dma_fence_init
|
- fence_release
+ dma_fence_release
|
- fence_free
+ dma_fence_free
|
- fence_get
+ dma_fence_get
|
- fence_get_rcu
+ dma_fence_get_rcu
|
- fence_put
+ dma_fence_put
|
- fence_signal
+ dma_fence_signal
|
- fence_signal_locked
+ dma_fence_signal_locked
|
- fence_default_wait
+ dma_fence_default_wait
|
- fence_add_callback
+ dma_fence_add_callback
|
- fence_remove_callback
+ dma_fence_remove_callback
|
- fence_enable_sw_signaling
+ dma_fence_enable_sw_signaling
|
- fence_is_signaled_locked
+ dma_fence_is_signaled_locked
|
- fence_is_signaled
+ dma_fence_is_signaled
|
- fence_is_later
+ dma_fence_is_later
|
- fence_later
+ dma_fence_later
|
- fence_wait_timeout
+ dma_fence_wait_timeout
|
- fence_wait_any_timeout
+ dma_fence_wait_any_timeout
|
- fence_wait
+ dma_fence_wait
|
- fence_context_alloc
+ dma_fence_context_alloc
|
- fence_array_create
+ dma_fence_array_create
|
- to_fence_array
+ to_dma_fence_array
|
- fence_is_array
+ dma_fence_is_array
|
- trace_fence_emit
+ trace_dma_fence_emit
|
- FENCE_TRACE
+ DMA_FENCE_TRACE
|
- FENCE_WARN
+ DMA_FENCE_WARN
|
- FENCE_ERR
+ DMA_FENCE_ERR
)
 (
 ...
 )
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Gustavo Padovan <[email protected]>
Acked-by: Sumit Semwal <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
 | 
 | 
To free fences, call_rcu() is used, which calls amdgpu_fence_free()
after a grace period. During teardown, there is no guarantee all
callbacks have finished, so amdgpu_fence_slab may be destroyed before
all fences have been freed. If we are lucky, this results in some slab
warnings, if not, we get a crash in one of rcu threads because callback
is called after amdgpu has already been unloaded.
Fix it with a rcu_barrier().
Fixes: b44135351a3a ("drm/amdgpu: RCU protected amdgpu_fence_release")
Acked-by: Chunming Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Grazvydas Ignotas <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Right now it's possible to trigger fence_drv.fences[] dereference after
the array has been freed. While the real problem is elsewhere, this still
results in confusing errors that depend on how the freed memory was
reused (I've seen "kernel tried to execute NX-protected page"), it's
better to clear them and get NULL dereference so that it's obvious what's
going wrong.
Signed-off-by: Grazvydas Ignotas <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
A little fallout from "drm/amdgpu: sanitize fence numbers", we
sometimes need to signal all fences in the ring.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Looks like the VCE block sometimes still sends nonsense
fence numbers on startup.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Using wrong counter for walking fences.  Fixes
a crash when unloading the driver.
Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
This avoids problems with multiple GPUs.  For example,
if the first GPU failed before amdgpu_fence_init() was
called, amdgpu_fence_slab_ref is still 0 and it will
get decremented in amdgpu_fence_driver_fini().  This
will lead to a crash during init of the second GPU since
amdgpu_fence_slab_ref is not 0.
v2: add functions for init/exit instead of
    moving the variables into the driver.
Signed-off-by: Rex Zhu <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structre. It is always
safe to use RCU_INIT_POINTER() to NULL a pointer, instead of
rcu_assign_pointer().
This results in slightly smaller/faster code.
The following semantic patch was used:
<smpl>
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)
</smpl>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Muhammad Falak R Wani <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
we introduced vmid fence, so one hw submission could produce two fences.
Signed-off-by: Chunming Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
All these are compile time constand and the
drm_debugfs_create/remove_files functions take a const
pointer argument.
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Nils Wallménius <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
since the rcu slot was initialized to be num_hw_submission,
if command submission doesn't use scheduler, this limitation
will be invalid like uvd test.
Signed-off-by: Chunming Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
We don't need to extend them to 64bits any more, so avoid the extra overhead.
v2: update commit message.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
It's just overhead to check the fence value
when we signal them directly anyway.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Amdgpu doesn't support using scratch registers for fences any more.
So we won't see values like 0xdeadbeef as fence value any more.
v2: reschedule timer even if no change detected
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Because of the scheduler we need to signal all fences immediately
anyway, so try to avoid the waitqueue overhead.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Just wait for last fence instead of waiting for the sequence manually.
v2: don't use amdgpu_sched_jobs for the mask
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Just keep all HW fences in a RCU protected array as a
first step to replace the wait queue.
v2: update commit message, move fixes into separate patch.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Make this a parameter instead of using the global variable directly.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Fences must be freed RCU protected, otherwise the reservation_object_*_rcu()
functions can run into problems.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
 | 
 | 
No need to keep the two separate any more.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
The comment about the loop counter was never valid, even when you have
multiple threads this loop only runs as long as the sequence increases.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
No need to have that in the header file any more.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Not used any more.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Try to avoid using the hardware specific fences even more.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
Not used any more since we now always use the sheduler.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
 | 
 | 
wait_event() never returns before the fence was signaled.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
 | 
 | 
It's not needed any more because all access goes through the scheduler now.
v2: Update commit message.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Ported from similar code in radeon.
Reviewed-by: Junwei Zhang <[email protected]>
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Ken Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Not used any more without semaphores
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Non-scheduler code is longer supported.
v2: agd: rebased on upstream
Signed-off-by: Chunming Zhou <[email protected]>
Reviewed-by: Ken Wang  <[email protected]>
Reviewed-by: Monk Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 | 
 | 
Change-Id: I5ad8dd156ccf27a6f18004aa0a215a0925b6e67b
Signed-off-by: Chunming Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
 | 
 | 
Less overhead than a work item and also adds proper cleanup handling.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Acked-by: Alex Deucher <[email protected]>
 | 
 | 
Mostly unused and replaced by the common trace points.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Acked-by: Alex Deucher <[email protected]>
 | 
 | 
And also add some missing function documentation. No functional change.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
 | 
 | 
Interrupts are notorious unreliable, enable the fallback at
a couple of more places.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
 | 
 | 
Just move the remaining users to fence_put/get.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
 | 
 | 
No need to duplicate the functionality any more.
v2: fix handling if no fence is available.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]> (v1)
 | 
 | 
amdgpu_fence_default_wait isn't needed any more the default wait does the same
thing and amdgpu_test_signaled is dead as well.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
 | 
 | 
Signed-off-by: Junwei Zhang <[email protected]>
Reviewed-by: Christian König <[email protected]>
 | 
 | 
Finally getting rid of it.
Signed-off-by: Christian König <[email protected]>
 | 
 | 
It didn't worked to well anyway.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Reviewed-by: Junwei Zhang <[email protected]>
 | 
 | 
Change-Id: I67e987db0efdca28faa80b332b75571192130d33
Signed-off-by: Junwei Zhang <[email protected]>
Reviewed-by: David Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
 | 
 | 
Embed the scheduler into the ring structure instead of allocating it.
Use the ring name directly instead of the id.
v2: rebased, whitespace cleanup
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Junwei Zhang <[email protected]>
Reviewed-by: Chunming Zhou<[email protected]>
 | 
 | 
Move the fence related stuff into amdgpu_fence.c
v2: rework commit message, cause this is actually not a bug
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou<[email protected]>
Reviewed-by: Junwei Zhang <[email protected]>
 | 
 | 
Just to be consistent with the other members.
v2: rename the ring member as well.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Junwei Zhang <[email protected]> (v1)
Reviewed-by: Chunming Zhou<[email protected]>
 | 
 | 
amdgpu_fence_wait_multiple()" v2
That isn't used any more.
v2: rebase
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
 |