aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2022-02-11drm/amdgpu: adjust register address calculationStanley.Yang1-0/+7
the UMC_STATUS register is not linear, adjust offset calculation formula to get correct address Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-11drm/amdgpu: skipping SDMA hw_init and hw_fini for S0ix.Rajib Mahapatra1-0/+8
[Why] SDMA ring buffer test failed if suspend is aborted during S0i3 resume. [How] If suspend is aborted for some reason during S0i3 resume cycle, it follows SDMA ring test failing and errors in amdgpu resume. For RN/CZN/Picasso, SMU saves and restores SDMA registers during S0ix cycle. So, skipping SDMA suspend and resume from driver solves the issue. This time, the system is able to resume gracefully even the suspend is aborted. Reviewed-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Rajib Mahapatra <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-11drm/amdgpu: remove ctx->lockKen Xue3-8/+11
KMD reports a warning on holding a lock from drm_syncobj_find_fence, when running amdgpu_test case “syncobj timeline test”. ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence" calls and avoid dead reservation lock from GPU reset. since no reservation lock is held in latest GPU reset any more, ctx->lock can be simply removed and concurrent "amdgpu_ctx_wait_prev_fence" call also can be prevented by PD root bo reservation lock. call stacks: ================= //hold lock amdgpu_cs_ioctl->amdgpu_cs_parser_init->mutex_lock(&parser->ctx->lock); … //report warning amdgpu_cs_dependencies->amdgpu_cs_process_syncobj_timeline_in_dep \ ->amdgpu_syncobj_lookup_and_add_to_sync -> drm_syncobj_find_fence \ -> lockdep_assert_none_held_once … amdgpu_cs_ioctl->amdgpu_cs_parser_fini->mutex_unlock(&parser->ctx->lock); Signed-off-by: Ken Xue <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-11drm/amdgpu: Reset OOB table error count infoStanley.Yang1-0/+3
The OOB table error count info should be reset after reset eeprom table Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-11drm/amdgpu: loose check for umc poison modeTao Zhou1-13/+5
No need to check poison setting for each channel, check for umc0 channel0 is enough. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-11drm/amdgpu: add support for GC 10.1.4Lang Yu6-3/+23
Add basic support for GC 10.1.4, it uses same IP blocks with GC 10.1.3 Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Huang Rui <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-11drm/amdgpu: Fix htmldoc warningAndrey Grodzovsky1-1/+1
Update function name. Signed-off-by: Andrey Grodzovsky <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-02-10drm/amdgpu: Fix compile error.Andrey Grodzovsky1-2/+1
Seems I forgot to add this to the relevant commit when submitting. Signed-off-by: Andrey Grodzovsky <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-02-09drm/amdgpu: fix gmc init fail in sriov modeYang Wang2-4/+6
"adev->gfx.rlc.rlcg_reg_access_supported = true;" the above varible were set too late during driver initialization. it will cause the driver to fail to write/read register during GMC hw init in sriov mode. move gfx_xxx_init_rlcg_reg_access_ctrl() function to gfx early init stage to avoid this issue. Fixes: 5d447e29670148 ("drm/amdgpu: add helper for rlcg indirect reg access") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-09drm/amd/amdgpu/amdgpu_uvd: Fix forgotten unmap buffer objectzhanglianjie1-0/+2
After the buffer object is successfully mapped, call amdgpu_bo_kunmap before the function returns. Signed-off-by: zhanglianjie <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-09drm/amdkfd: Remove unused old debugger implementationMukul Joshi8-209/+0
Cleanup the kfd code by removing the unused old debugger implementation. The address watch was only ever implemented in the upstream driver for GFXv7 (Kaveri). The user mode tools runtime using this API was never open-sourced. Work on the old debugger prototype that used this API has been discontinued years ago. Only a small piece of resetting wavefronts is kept and is moved to kfd_device_queue_manager.c. Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-09drm/amdgpu: add utcl2_harvest to gc 10.3.1Aaron Liu1-1/+6
Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <[email protected]> Reviewed-by: Huang Rui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-09drm/amdgpu: Revert 'drm/amdgpu: annotate a false positive recursive locking'Andrey Grodzovsky2-10/+4
Since we have a single instance of reset semaphore which we lock only once even for XGMI hive we don't need the nested locking hint anymore. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74120.html
2022-02-09drm/amdgpu: Rework amdgpu_device_lock_adevAndrey Grodzovsky3-17/+43
This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74118.html
2022-02-09drm/amdgpu: Move in_gpu_reset into reset_domainAndrey Grodzovsky6-12/+15
We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
2022-02-09drm/amdgpu: Move reset sem into reset_domainAndrey Grodzovsky10-37/+46
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
2022-02-09drm/amdgpu: Rework reset domain to be refcounted.Andrey Grodzovsky9-34/+140
The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
2022-02-09drm/amdgpu: Drop concurrent GPU reset protection for deviceAndrey Grodzovsky1-82/+7
Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74119.html
2022-02-09drm/amdgpu: Drop hive->in_resetAndrey Grodzovsky3-20/+1
Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74115.html
2022-02-09drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.Andrey Grodzovsky3-9/+18
No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun <[email protected]> Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Liu Shaoyun <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74114.html
2022-02-09drm/amdgpu: Serialize non TDR gpu recovery with TDRsAndrey Grodzovsky3-2/+35
Use reset domain wq also for non TDR gpu recovery trigers such as sysfs and RAS. We must serialize all possible GPU recoveries to gurantee no concurrency there. For TDR call the original recovery function directly since it's already executed from within the wq. For others just use a wrapper to qeueue work and wait on it to finish. v2: Rename to amdgpu_recover_work_struct Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74113.html
2022-02-09drm/amdgpu: Move scheduler init to after XGMI is readyAndrey Grodzovsky4-43/+56
Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74112.html
2022-02-09drm/amdgpu: Introduce reset domainAndrey Grodzovsky4-1/+35
Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Daniel Vetter <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74111.html
2022-02-08drm/amdgpu: use dma_fence_chain_containedChristian König1-3/+2
Instead of manually extracting the fence. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-02-07drm/amdgpu: drop experimental flag on aldebaranAlex Deucher1-4/+4
These have been at production level for a while. Drop the flag. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: reserve the pd while cleaning up PRTsChristian König1-2/+4
We want to have lockdep annotation here, so make sure that we reserve the PD while removing PRTs even if it isn't strictly necessary since the VM object is about to be destroyed anyway. Signed-off-by: Christian König <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: move lockdep assert to the right place.Christian König1-2/+2
Since newly added BOs don't have any mappings it's ok to add them without holding the VM lock. Only when we add per VM BOs the lock is mandatory. Signed-off-by: Christian König <[email protected]> Reported-by: Bhardwaj, Rajneesh <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: check the GART table before invalidating TLBAaron Liu1-0/+3
Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send invalidation command to harvested SE/SA. Once invalidation command comes into harvested SE/SA, SE/SA has no response and system hang. This patch is to add checking if the GART table is already allocated before invalidating TLB. The new procedure is as following: 1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT BOs can be allocated, but GART mappings are still ignored. 2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates the GART backing store. 3. Initializing the hardware, and programming the backing store into VMID0 for all VMHUBs. 4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with the GTT allocations done before it was allocated. Signed-off-by: Christian König <[email protected]> Signed-off-by: Aaron Liu <[email protected]> Acked-by: Huang Rui <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: add utcl2_harvest to gc 10.3.1Aaron Liu1-1/+6
Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <[email protected]> Reviewed-by: Huang Rui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: fix list add issue in vram reserveTao Zhou1-1/+1
The parameter order in the list_add_tail is incorrect, it causes the reuse of ras reserved page. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07Revert "drm/amdgpu: Add judgement to avoid infinite loop"yipechai1-4/+0
The commit d5e8ff5f7b2a ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop") had fixed this defect. Revert workaround commit a2170b4af62f ("drm/amdgpu: Add judgement to avoid infinite loop"). Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Fixed the defect of soft lock caused by infinite loopyipechai2-7/+31
1. The infinite loop case only occurs on multiple cards support ras functions. 2. The explanation of root cause refer to commit 76641cbbf196 ("drm/amdgpu: Add judgement to avoid infinite loop"). 3. Create new node to manage each unique ras instance to guarantee each device .ras_list is completely independent. 4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block interface for each ras block"). 5. The soft locked logs are as follows: [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [ 262.166267] Call Trace: [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Set FRU bus for Aldebaran and Vega 20Luben Tuikov1-1/+2
The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20 ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU is sought through that bus "pointer" and not through the RAS bus "pointer". Cc: Roy Sun <[email protected]> Cc: Alex Deucher <[email protected]> Fixes: 2f60dd50769efc ("drm/amd: Expose the FRU SMU I2C bus") Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Fix recursive locking warningRajneesh Bhardwaj1-1/+2
Noticed the below warning while running a pytorch workload on vega10 GPUs. Change to trylock to avoid conflicts with already held reservation locks. [ +0.000003] WARNING: possible recursive locking detected [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted [ +0.000004] -------------------------------------------- [ +0.000002] python/4822 is trying to acquire lock: [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000203] but task is already holding lock: [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000017] other info that might help us debug this: [ +0.000002] Possible unsafe locking scenario: [ +0.000003] CPU0 [ +0.000002] ---- [ +0.000002] lock(reservation_ww_class_mutex); [ +0.000004] lock(reservation_ww_class_mutex); [ +0.000003] *** DEADLOCK *** [ +0.000002] May be due to missing lock nesting notation [ +0.000003] 7 locks held by python/4822: [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] [ +0.000236] #3: ffffb2b35606fd28 (reservation_ww_class_acquire){+.+.}-{0:0}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] [ +0.000235] #4: ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: drm_dev_enter+0x5/0xa0 [drm] [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] [ +0.000195] stack backtrace: [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted 5.13.0-kfd-rajneesh #1030 [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 08/29/2018 [ +0.000003] Call Trace: [ +0.000003] dump_stack+0x6d/0x89 [ +0.000010] __lock_acquire+0xb93/0x1a90 [ +0.000009] lock_acquire+0x25d/0x2d0 [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] ? lock_release+0x13f/0x270 [ +0.000005] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 [amdgpu] [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] [ +0.000216] ? lock_release+0x13f/0x270 [ +0.000006] ? __fget_files+0x107/0x1e0 [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 [ +0.000007] do_syscall_64+0x36/0x70 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000007] RIP: 0033:0x7fbff90a7317 [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: 00007fbff90a7317 [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: 0000000000000004 [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: 00007fbcc402d880 [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: 00000000c0184b18 [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: 00007fbcc402d820 Cc: Christian König <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Prevent random memory access in FRU codeLuben Tuikov1-10/+12
Prevent random memory access in the FRU EEPROM code by passing the size of the destination buffer to the reading routine, and reading no more than the size of the buffer. Cc: Kent Russell <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Don't offset by 2 in FRU EEPROMLuben Tuikov1-9/+4
Read buffers no longer expose the I2C address, and so we don't need to offset by two when we get the read data. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Fixes: bd607166af7fe3 ("drm/amdgpu: Enable reading FRU chip via I2C v3") Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Nerf "buff" to "buf"Luben Tuikov1-14/+14
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er). This is consistent with the rest of the kernel code. Cc: Kent Russell <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD resume ioctlRajneesh Bhardwaj2-4/+55
This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD checkpoint ioctlRajneesh Bhardwaj4-0/+34
This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Print once if RAS unsupportedLuben Tuikov1-8/+8
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes the dmesg log to fill up with the same message, e.g, [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function. Make it dev_dbg_once(), as it isn't something correctible during boot or thereafter, so printing just once is sufficient. Also sanitize the message. Cc: Alex Deucher <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: John Clements <[email protected]> Cc: Tao Zhou <[email protected]> Cc: yipechai <[email protected]> Fixes: 8b0fb0e967c1 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops") Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: rename amdgpu_vm_bo_rmv to _delChristian König6-9/+9
Some people complained about the name and this matches much more Linux naming conventions for object functions. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Daniel Vetter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: add some lockdep checks to the VM codeChristian König1-0/+6
Whenever a bo_va structure is added or removed the VM and eventually added BO should be locked. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Daniel Vetter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm: Convert open-coded yes/no strings to yesno()Lucas De Marchi1-1/+3
linux/string_helpers.h provides a helper to return "yes"/"no" strings. Replace the open coded versions with str_yes_no(). The places were identified with the following semantic patch: @@ expression b; @@ - b ? "yes" : "no" + str_yes_no(b) Then the includes were added, so we include-what-we-use, and parenthesis adjusted in drivers/gpu/drm/v3d/v3d_debugfs.c. After the conversion we still see the same binary sizes: text data bss dec hex filename 51149 3295 212 54656 d580 virtio/virtio-gpu.ko.old 51149 3295 212 54656 d580 virtio/virtio-gpu.ko 1441491 60340 800 1502631 16eda7 radeon/radeon.ko.old 1441491 60340 800 1502631 16eda7 radeon/radeon.ko 6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko.old 6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko 411986 10490 6176 428652 68a6c drm.ko.old 411986 10490 6176 428652 68a6c drm.ko 98129 1636 264 100029 186bd dp/drm_dp_helper.ko.old 98129 1636 264 100029 186bd dp/drm_dp_helper.ko 1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko.old 1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko Signed-off-by: Lucas De Marchi <[email protected]> Reviewed-by: Jani Nikula <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-02-07Merge remote-tracking branch 'drm/drm-next' into drm-misc-nextMaarten Lankhorst79-809/+1339
First backmerge into drm-misc-next. Required for more helpers backmerged, and to pull in 5.17 (rc2). Signed-off-by: Maarten Lankhorst <[email protected]>
2022-02-02drm/amdgpu: fix logic inversion in checkChristian König1-1/+1
We probably never trigger this, but the logic inside the check is inverted. Signed-off-by: Christian König <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabledMario Limonciello1-2/+1
dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: e25443d2765f4 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879 Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amdgpu: fix a potential GPU hang on cyan skillfishLang Yu1-0/+3
We observed a GPU hang when querying GMC CG state(i.e., cat amdgpu_pm_info) on cyan skillfish. Acctually, cyan skillfish doesn't support any CG features. Just prevent it from accessing GMC CG registers. Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-02-02drm/amd: Only run s3 or s0ix if system is configured properlyMario Limonciello1-0/+8
This will cause misconfigured systems to not run the GPU suspend routines. * In APUs that are properly configured system will go into s2idle. * In APUs that are intended to be S3 but user selects s2idle the GPU will stay fully powered for the suspend. * In APUs that are intended to be s2idle and system misconfigured the GPU will stay fully powered for the suspend. * In systems that are intended to be s2idle, but AMD dGPU is also present, the dGPU will go through S3 Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd: add support to check whether the system is set to s3Mario Limonciello2-0/+15
This will be used to help make decisions on what to do in misconfigured systems. v2: squash in semicolon fix from Stephen Rothwell Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amdgpu: limit the number of dst address in traceSomalapuram Amaranath2-4/+3
trace_amdgpu_vm_update_ptes trace unable to log when nptes too large Signed-off-by: Somalapuram Amaranath <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>