aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
AgeCommit message (Collapse)AuthorFilesLines
2020-09-03drm/amdgpu: disable gpu-sched load balance for uvdNirmoy Das1-1/+5
On hardware with multiple uvd instances, dependent uvd jobs may get scheduled to different uvd instances. Because uvd_enc jobs retain hw context, dependent jobs should always run on the same uvd instance. This patch disables GPU scheduler's load balancer for a context that binds jobs from the same context to a uvd instance. v2: Squash in uvd_enc fix Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/amdgpu: drm_device to amdgpu_device by inline-f (v2)Luben Tuikov1-1/+1
Get the amdgpu_device from the DRM device by use of an inline function, drm_to_adev(). The inline function resolves a pointer to struct drm_device to a pointer to struct amdgpu_device. v2: Use a typed visible static inline function instead of an invisible macro. Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/scheduler: Remove priority macro INVALID (v2)Luben Tuikov1-3/+2
Remove DRM_SCHED_PRIORITY_INVALID. We no longer carry around an invalid priority and cut it off at the source. Backwards compatibility behaviour of AMDGPU CTX IOCTL passing in garbage for context priority from user space and then mapping that to DRM_SCHED_PRIORITY_NORMAL is preserved. v2: Revert "res" --> "r" and "prio" --> "priority". Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/scheduler: Scheduler priority fixes (v2)Luben Tuikov1-2/+2
Remove DRM_SCHED_PRIORITY_LOW, as it was used in only one place. Rename and separate by a line DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT as it represents a (total) count of said priorities and it is used as such in loops throughout the code. (0-based indexing is the the count number.) Remove redundant word HIGH in priority names, and rename *KERNEL* to *HIGH*, as it really means that, high. v2: Add back KERNEL and remove SW and HW, in lieu of a single HIGH between NORMAL and KERNEL. Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: revert "fix system hang issue during GPU reset"Christian König1-4/+0
The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929. Signed-off-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-07-27drm/amdgpu: fix system hang issue during GPU resetDennis Li1-0/+4
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the atomic adev->in_gpu_reset and hive->in_reset are used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore adev->reset_sem is introduced, which protect GPU from being accessed by external threads during recovery. v2: 1. add rwlock for some ioctls, debugfs and file-close function. 2. change to use dqm->is_resetting and dqm_lock for protection in kfd driver. 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. v3: 1. change back to use adev->reset_sem to protect kfd callback functions, because dqm_lock couldn't protect all codes, for example: free_mqd must be called outside of dqm_lock; [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 1230.177221] Call Trace: [ 1230.178249] dump_stack+0x98/0xd5 [ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu] [ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu] [ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu] [ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu] [ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm] [ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm] [ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm] [ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm] [ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm] [ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm] [ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu] [ 1230.193833] free_mqd+0x25/0x40 [amdgpu] [ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu] [ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu] [ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu] [ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu] [ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu] [ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20 [ 1230.202831] ksys_ioctl+0x98/0xb0 [ 1230.204004] __x64_sys_ioctl+0x1a/0x20 [ 1230.205174] do_syscall_64+0x5f/0x250 [ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe 2. remove try_lock and introduce atomic hive->in_reset, to avoid re-enter GPU recovery. v4: 1. remove an unnecessary whitespace change in kfd_chardev.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Reviewed-by: Hawking Zhang <[email protected]> Suggested-by: Andrey Grodzovsky <[email protected]> Suggested-by: Christian König <[email protected]> Suggested-by: Felix Kuehling <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Suggested-by: Luben Tukov <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-09drm/amdgpu: rework sched_list generationNirmoy Das1-126/+34
Generate HW IP's sched_list in amdgpu_ring_init() instead of amdgpu_ctx.c. This makes amdgpu_ctx_init_compute_sched(), ring.has_high_prio and amdgpu_ctx_init_sched() unnecessary. This patch also stores sched_list for all HW IPs in one big array in struct amdgpu_device which makes amdgpu_ctx_init_entity() much more leaner. v2: fix a coding style issue do not use drm hw_ip const to populate amdgpu_ring_type enum v3: remove ctx reference and move sched array and num_sched to a struct use num_scheds to detect uninitialized scheduler list v4: use array_index_nospec for user space controlled variables fix possible checkpatch.pl warnings Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-16drm/amdgpu: disable gpu_sched load balancer for vcn jobsNirmoy Das1-4/+8
VCN HW doesn't support dynamic load balance on multiple instances for a context. This patch initializes VNC entities with only one drm_gpu_scheduler picked by drm_sched_pick_best(). Picking a drm_gpu_scheduler using drm_sched_pick_best() ensures that we do load balance among multiple contexts but not among multiple jobs in a context. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-16drm/amdgpu: fix switch-case indentationNirmoy Das1-41/+41
Fix switch-case indentation in amdgpu_ctx_init_entity() Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-10drm/amdgpu: do not set nil entry in compute_prio_schedNirmoy Das1-4/+11
If there are no high priority compute queues available then set normal priority sched array to compute_prio_sched[AMDGPU_GFX_PIPE_PRIO_HIGH] Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-09drm/amdgpu: change hw sched list on ctx priority overrideNirmoy Das1-4/+25
Switch to appropriate sched list for an entity on priority override. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-09drm/amdgpu: set compute queue priority at mqd_initNirmoy Das1-7/+46
We were changing compute ring priority while rings were being used before every job submission which is not recommended. This patch sets compute queue priority at mqd initialization for gfx8, gfx9 and gfx10. Policy: make queue 0 of each pipe as high priority compute queue High/normal priority compute sched lists are generated from set of high/normal priority compute queues. At context creation, entity of compute queue get a sched list from high or normal priority depending on ctx->priority Signed-off-by: Nirmoy Das <[email protected]> Acked-by: Alex Deucher <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-01-30drm/amdgpu: allocate entities on demandNirmoy Das1-115/+120
Currently we pre-allocate entities and fences for all the HW IPs on context creation and some of which are might never be used. This patch tries to resolve entity/fences wastage by creating entity only when needed. v2: allocate memory for entity and fences together Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-01-22drm/amdgpu: individualize fence allocation per entityNirmoy Das1-17/+27
Allocate fences for each entity and remove ctx->fences reference as fences should be bound to amdgpu_ctx_entity instead amdgpu_ctx. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-23drm/amdgpu: fix ctx init failure for asics without gfx ringLe Ma1-1/+2
This workaround does not affect other asics because amdgpu only need expose one gfx sched to user for now. Signed-off-by: Le Ma <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-18amd/amdgpu: add sched array to IPs with multiple run-queuesNirmoy Das1-44/+69
This sched array can be passed on to entity creation routine instead of manually creating such sched array on every context creation. v2: squash in missing break fix Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-18drm/scheduler: rework entity creationNirmoy Das1-3/+4
Entity currently keeps a copy of run_queue list and modify it in drm_sched_entity_set_priority(). Entities shouldn't modify run_queue list. Use drm_gpu_scheduler list instead of drm_sched_rq list in drm_sched_entity struct. In this way we can select a runqueue based on entity/ctx's priority for a drm scheduler. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-11drm/amdgpu: fix JPEG instance checking when ctx initLeo Liu1-1/+1
Use proper structure. Fixes: 0388aee766376ed ("drm/amdgpu: use the JPEG structure for general driver support") Signed-off-by: Leo Liu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: James Zhu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: use the JPEG structure for general driver supportLeo Liu1-2/+2
JPEG1.0 will be functional along with VCN1.0 Signed-off-by: Leo Liu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-07Revert "drm/amdgpu: dont schedule jobs while in reset"Andrey Grodzovsky1-4/+1
This reverts commit 89b3d86403f1025f6b430d8f9ffc590efbadce62. We will do a proper fix in next patch. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-06drm/amdgpu: dont schedule jobs while in resetShirish S1-1/+4
[Why] doing kthread_park()/unpark() from drm_sched_entity_fini while GPU reset is in progress defeats all the purpose of drm_sched_stop->kthread_park. If drm_sched_entity_fini->kthread_unpark() happens AFTER drm_sched_stop->kthread_park nothing prevents from another (third) thread to keep submitting job to HW which will be picked up by the unparked scheduler thread and try to submit to HW but fail because the HW ring is deactivated. [How] grab the reset lock before calling drm_sched_entity_fini() Signed-off-by: Shirish S <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-23drm/amdgpu: correct ras error count typeGuchun Chen1-1/+1
Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-22drm/amdgpu: fix dma_fence_wait without referenceChristian König1-12/+15
We need to grab a reference to the fence we wait for. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-15drm/amdgpu: use exiting amdgpu_ctx_total_num_entities functionKevin Wang1-4/+1
simplify driver code. Signed-off-by: Kevin Wang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-15drm/amdgpu: fix typo error amdgput -> amdgpuKevin Wang1-6/+6
fix typo error: change function name from "amdgput_ctx_total_num_entities" to "amdgpu_ctx_total_num_entities". Signed-off-by: Kevin Wang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-07-18drm/amdgpu:add all VCN rings into schedule request queueJames Zhu1-8/+18
Add all VCN instances' decode/encode/jpeg decode rings into drm_sched_rq list. Signed-off-by: James Zhu <[email protected]> Reviewed-by: Leo Liu <[email protected]> Reviewed-by: Boyuan Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-07-18drm/amdgpu: modify amdgpu_vcn to support multiple instancesJames Zhu1-3/+3
Arcturus has dual-VCN. Need Restruct amdgpu_device::vcn to support multiple vcns. There are no any logical changes here Signed-off-by: James Zhu <[email protected]> Reviewed-by: Leo Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-06-10drm/amd: drop use of drmP.h in amdgpu/amdgpu*Sam Ravnborg1-1/+0
Drop use of drmP.h in all files named amdgpu* in drm/amd/amdgpu/ Fix fallout. Signed-off-by: Sam Ravnborg <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: "David (ChunMing) Zhou" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-03-19drm/amdgpu: wait for VM to become idle during flushChristian König1-3/+3
Make sure that not only the entities are flush, but that we also wait for the HW to finish all processing. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19drm/amdgpu: remove non-sense NULL ptr checkChristian König1-10/+0
It's a bug having a dead pointer in the IDR, silently returning is the worst we can do. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2xinhui pan1-0/+17
Add AMDGPU_CTX_QUERY2_FLAGS_RAS_CE/UE which indicate if any error happened between previous query and this query. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-02-15drm/amdgpu: Only add rqs for initialized rings.Bas Nieuwenhuizen1-3/+8
I don't see another way to figure out if a ring is initialized if the hardware block might not be initialized. Entities have been fixed up to handle num_rqs = 0. Signed-off-by: Bas Nieuwenhuizen <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-12-10drm/amdgpu: Limit vm max ctx number to 4096Rex Zhu1-1/+1
driver need to reserve resource for each ctx for some hw features. so add this limitation. Reviewed-by: Christian König <[email protected]> Signed-off-by: Rex Zhu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-11-30drm/amdgpu: add VCN JPEG support amdgpu_ctx_num_entitiesAlex Deucher1-0/+1
Looks like it was missed when setting support was added. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/amdgpu: amdgpu_ctx_add_fence can't failChristian König1-5/+3
No more waiting for a fence done here. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Reviewed-by: Junwei Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/amdgpu: rework ctx entity creationChristian König1-142/+149
Use a fixed number of entities for each hardware IP. The number of compute entities is reduced to four, SDMA keeps it two entities and all other engines just expose one entity. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/amdgpu: use entity instead of ring for CSChristian König1-23/+30
Further demangle ring from entity handling. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/amdgpu: remove the queue managerChristian König1-6/+61
Not needed any more since that is now done by the scheduler. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/amdgpu: use scheduler load balancing for compute CSChristian König1-1/+9
Start to use the scheduler load balancing for userspace compute command submissions. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/amdgpu: use scheduler load balancing for SDMA CSChristian König1-4/+21
Start to use the scheduler load balancing for userspace SDMA command submissions. Signed-off-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-08-27drm/scheduler: fix setting the priorty for entities (v2)Christian König1-3/+1
Since we now deal with multiple rq we need to update all of them, not just the current one. v2: Trivial: Removed unused variable (Alex) Signed-off-by: Christian König <[email protected]> Acked-by: Nayan Deshmukh <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-07-25drm/scheduler: modify API to avoid redundancyNayan Deshmukh1-8/+5
entity has a scheduler field and we don't need the sched argument in any of the functions where entity is provided. Signed-off-by: Nayan Deshmukh <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-07-13drm/scheduler: modify args of drm_sched_entity_initNayan Deshmukh1-2/+2
replace run queue by a list of run queues and remove the sched arg as that is part of run queue itself Signed-off-by: Nayan Deshmukh <[email protected]> Reviewed-by: Christian König <[email protected]> Acked-by: Eric Anholt <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-07-05drm/amdgpu: Rename entity cleanup finctions.Andrey Grodzovsky1-3/+3
Everything in the flush code path (i.e. waiting for SW queue to become empty) names with *_flush() and everything in the release code path names *_fini() Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-07-05drm/scheduler: Rename cleanup functions v2.Andrey Grodzovsky1-4/+4
Everything in the flush code path (i.e. waiting for SW queue to become empty) names with *_flush() and everything in the release code path names *_fini() This patch also effect the amdgpu and etnaviv drivers which use those functions. v2: Also pplay the change to vd3. Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Christian König <[email protected]> Acked-by: Lucas Stach <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-06-15drm/amdgpu: move amdgpu_ctx_mgr_entity_fini to f_ops flush hook (V4)Andrey Grodzovsky1-6/+8
With this we can now terminate jobs enqueue into SW queue the moment the task is being killed instead of waiting for last user of drm file to release it. Also stop checking for kref_read(&ctx->refcount) == 1 when calling drm_sched_entity_do_release since other task might still hold a reference to this entity but we don't care since KILL means terminate job submission regardless of what other tasks are doing. v2: Use returned remaining timeout as parameter for the next call. Rebase. v3: Switch to working with jiffies. Streamline remainder TO usage. Rebase. v4: Rebase. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-05-18drm/amdgpu: Skip drm_sched_entity related ops for KIQ ring.Andrey Grodzovsky1-3/+18
Following change 75fbed2 we never initialize or use the GPU scheduler for KIQ and hence we need to skip KIQ ring when iterating amdgpu_ctx's scheduler entites. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-05-15drm/scheduler: remove unused parameterNayan Deshmukh1-1/+1
this patch also effect the amdgpu and etnaviv drivers which use the function drm_sched_entity_init Signed-off-by: Nayan Deshmukh <[email protected]> Suggested-by: Christian König <[email protected]> Acked-by: Lucas Stach <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-05-15drm/amdgpu: Switch to interruptable wait to recover from ring hang.Andrey Grodzovsky1-2/+4
v2: Use dma_fence_wait instead of dma_fence_wait_timeout(...,MAX_SCHEDULE_TIMEOUT) Avoid printing error message for ERESTARTSYS Originally-by: David Panariti <[email protected]> Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2018-05-15drm/gpu-sched: fix force APP kill hang(v4)Emily Deng1-8/+56
issue: there are VMC page fault occurred if force APP kill during 3dmark test, the cause is in entity_fini we manually signal all those jobs in entity's queue which confuse the sync/dep mechanism: 1)page fault occurred in sdma's clear job which operate on shadow buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release since the fence in its reservation was fake signaled by entity_fini() under the case of SIGKILL received. 2)page fault occurred in gfx' job because during the lifetime of gfx job we manually fake signal all jobs from its entity in entity_fini(), thus the unmapping/clear PTE job depend on those result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault. fix: 1)should at least wait all jobs already scheduled complete in entity_fini() if SIGKILL is the case. 2)if a fence signaled and try to clear some entity's dependency, should set this entity guilty to prevent its job really run since the dependency is fake signaled. v2: splitting drm_sched_entity_fini() into two functions: 1)The first one is does the waiting, removes the entity from the runqueue and returns an error when the process was killed. 2)The second one then goes over the entity, install it as completion signal for the remaining jobs and signals all jobs with an error code. v3: 1)Replace the fini1 and fini2 with better name 2)Call the first part before the VM teardown in amdgpu_driver_postclose_kms() and the second part after the VM teardown 3)Keep the original function drm_sched_entity_fini to refine the code. v4: 1)Rename entity->finished to entity->last_scheduled; 2)Rename drm_sched_entity_fini_job_cb() to drm_sched_entity_kill_jobs_cb(); 3)Pass NULL to drm_sched_entity_fini_job_cb() if -ENOENT; 4)Replace the type of entity->fini_status with "int"; 5)Remove the check about entity->finished. Signed-off-by: Monk Liu <[email protected]> Signed-off-by: Emily Deng <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>