aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2020-08-24drm/amdgpu: refine create and release logic of hive infoDennis Li5-106/+131
Change to dynamically create and release hive info object, which help driver support more hives in the future. v2: Change to save hive object pointer in adev, to avoid locking xgmi_mutex every time when calling amdgpu_get_xgmi_hive. v3: 1. Change type of hive object pointer in adev from void* to amdgpu_hive_info*. 2. remove unnecessary variable initialization. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/amdgpu: refine message print for devices of hiveDennis Li5-21/+21
Using dev_xxx instead of DRM_xxx/pr_xxx to indicate which device of a hive is the message for. Reviewed-by: Christian König <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/amdgpu: fix the nullptr issue when reenter GPU recoveryDennis Li1-2/+3
in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/amdgpu: change reset lock from mutex to rw_semaphoreDennis Li5-35/+32
clients don't need reset-lock for synchronization when no GPU recovery. v2: change to return the return value of down_read_killable. v3: if GPU recovery begin, VF ignore FLR notification. Reviewed-by: Monk Liu <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/amd/display: remove unintended executable modeLukas Bulwahn2-0/+0
Besides the intended change, commit 4cc1178e166a ("drm/amdgpu: replace DRM prefix with PCI device info for gfx/mmhub") also set the source files mmhub_v1_0.c and gfx_v9_4.c to be executable, i.e., changed fromold mode 644 to new mode 755. Commit 241b2ec9317e ("drm/amd/display: Add dcn30 Headers (v2)") added the four header files {dpcs,dcn}_3_0_0_{offset,sh_mask}.h as executable, i.e., mode 755. Set to the usual modes for source and headers files and clean up those mistakes. No functional change. Reviewed-by: Christian König <[email protected]> Signed-off-by: Lukas Bulwahn <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/amdgpu: refine codes to avoid reentering GPU recoveryDennis Li24-60/+65
if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. v2: drop "? true : false" in the definition of amdgpu_in_reset Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-24drm/ttm: drop bus.size from bus placement.Dave Airlie1-1/+2
This is always calculated the same, and only used in a couple of places. Signed-off-by: Dave Airlie <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2020-08-24drm/ttm: init mem->bus in common code.Dave Airlie1-6/+0
The drivers all do the same thing here. Reviewed-by: Christian König <[email protected]> for both. Signed-off-by: Dave Airlie <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva4-4/+4
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <[email protected]>
2020-08-21Merge tag 'amd-drm-fixes-5.9-2020-08-20' of ↵Dave Airlie3-13/+23
git://people.freedesktop.org/~agd5f/linux into drm-fixes amd-drm-fixes-5.9-2020-08-20: amdgpu: - Fixes for Navy Flounder - Misc display fixes - RAS fix amdkfd: - SDMA fix for renoir Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2020-08-19Revert "drm/amdgpu: disable gfxoff for navy_flounder"Jiansong Chen1-3/+0
This reverts commit 9c9b17a7d19a8e21db2e378784fff1128b46c9d3. Newly released sdma fw (51.52) provides a fix for the issue. Signed-off-by: Jiansong Chen <[email protected]> Reviewed-by: Kenneth Feng <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-19Merge tag 'amd-drm-fixes-5.9-2020-08-12' of ↵Dave Airlie4-1/+40
git://people.freedesktop.org/~agd5f/linux into drm-fixes amd-drm-fixes-5.9-2020-08-12: amdgpu: - Fix allocation size - SR-IOV fixes - Vega20 SMU feature state caching fix - Fix custom pptable handling - Arcturus golden settings update - Several display fixes Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2020-08-18drm/amdgpu/jpeg: remove redundant check when it returnsLeo Liu1-6/+1
Fix warning from kernel test robot v2: remove the local variable as well Signed-off-by: Leo Liu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/amdgpu: Limit the error info print ratejqdeng1-1/+2
Use function printk_ratelimit to limit the print rate. Signed-off-by: jqdeng <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/amdgpu: Fix repeatly flr issuejqdeng4-2/+32
Only for no job running test case need to do recover in flr notification. For having job in mirror list, then let guest driver to hit job timeout, and then do recover. Signed-off-by: jqdeng <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18Revert "drm/amdgpu: disable gfxoff for navy_flounder"Jiansong Chen1-3/+0
This reverts commit ba4e049e63b607ac2e0c070b1406826390d5047e. Newly released sdma fw (51.52) provides a fix for the issue. Signed-off-by: Jiansong Chen <[email protected]> Reviewed-by: Kenneth Feng <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/scheduler: Remove priority macro INVALID (v2)Luben Tuikov3-14/+34
Remove DRM_SCHED_PRIORITY_INVALID. We no longer carry around an invalid priority and cut it off at the source. Backwards compatibility behaviour of AMDGPU CTX IOCTL passing in garbage for context priority from user space and then mapping that to DRM_SCHED_PRIORITY_NORMAL is preserved. v2: Revert "res" --> "r" and "prio" --> "priority". Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/scheduler: Scheduler priority fixes (v2)Luben Tuikov6-9/+9
Remove DRM_SCHED_PRIORITY_LOW, as it was used in only one place. Rename and separate by a line DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT as it represents a (total) count of said priorities and it is used as such in loops throughout the code. (0-based indexing is the the count number.) Remove redundant word HIGH in priority names, and rename *KERNEL* to *HIGH*, as it really means that, high. v2: Add back KERNEL and remove SW and HW, in lieu of a single HIGH between NORMAL and KERNEL. Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/amdkfd: fix the wrong sdma instance query for renoirHuang Rui1-9/+22
Renoir only has one sdma instance, it will get failed once query the sdma1 registers. So use switch-case instead of static register array. Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/amdgpu: parse ta firmware for navy_flounderBhawanpreet Lakha1-2/+1
Use the same case as sienna_cichlid Signed-off-by: Bhawanpreet Lakha <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/amdgpu: fix NULL pointer access issue when unloading driverGuchun Chen1-2/+0
When unloading driver by "modprobe -r amdgpu", one NULL pointer dereference bug occurs in ras debugfs releasing. The cause is the duplicated debugfs_remove, as drm debugfs_root dir has been cleaned up already by drm_minor_unregister. BUG: kernel NULL pointer dereference, address: 00000000000000a0 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 11 PID: 1526 Comm: modprobe Tainted: G OE 5.6.0-guchchen #1 Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 RIP: 0010:down_write+0x15/0x40 Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92 d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3 RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0 RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0 R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40 R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000 FS: 00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: simple_recursive_removal+0x63/0x370 ? debugfs_remove+0x60/0x60 debugfs_remove+0x40/0x60 amdgpu_ras_fini+0x82/0x230 [amdgpu] ? __kernfs_remove.part.17+0x101/0x1f0 ? kernfs_name_hash+0x12/0x80 amdgpu_device_fini+0x1c0/0x580 [amdgpu] amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu] amdgpu_pci_remove+0x36/0x60 [amdgpu] pci_device_remove+0x3b/0xb0 device_release_driver_internal+0xe5/0x1c0 driver_detach+0x46/0x90 bus_remove_driver+0x58/0xd0 pci_unregister_driver+0x29/0x90 amdgpu_exit+0x11/0x25 [amdgpu] __x64_sys_delete_module+0x13d/0x210 do_syscall_64+0x5f/0x250 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18drm/amdgpu: disable gfxoff for navy_flounderJiansong Chen1-0/+3
gfxoff is temporarily disabled for navy_flounder, since at present the feature has broken some basic amdgpu test. Signed-off-by: Jiansong Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-18Merge v5.9-rc1 into drm-misc-nextMaxime Ripard1-2/+0
Sam needs 5.9-rc1 to have dev_err_probe in to merge some patches. Signed-off-by: Maxime Ripard <[email protected]>
2020-08-17drm/amdgpu: add condition check for trace_amdgpu_cs()Kevin Wang1-3/+13
v1: add trace event enabled check to avoid nop loop when submit multi ibs in amdgpu_cs_ioctl() function. v2: add a new wrapper function to trace all amdgpu cs ibs. Signed-off-by: Kevin Wang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-17drm/amdgpu: fix amdgpu_bo_release_notify() comment errorKevin Wang1-1/+1
fix amdgpu_bo_release_notify() comment error. Signed-off-by: Kevin Wang <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Dennis Li <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-17drm/amdkfd: fix the wrong sdma instance query for renoirHuang Rui1-9/+22
Renoir only has one sdma instance, it will get failed once query the sdma1 registers. So use switch-case instead of static register array. Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: note what type of reset we are usingAlex Deucher5-2/+13
When we reset the GPU, note what type of reset will be used. This makes debugging different reset scenarios more clear as the driver may use different reset methods depending on conditions on the system. Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: print where we get the vbios image fromAlex Deucher1-7/+21
ACPI, ROM, PCI BAR, etc. Acked-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: parse ta firmware for navy_flounderBhawanpreet Lakha1-2/+1
Use the same case as sienna_cichlid Signed-off-by: Bhawanpreet Lakha <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu/vcn3.0: only SIENNA_CICHLID need specify instance for dec/encJames Zhu1-2/+2
Only SIENNA_CICHLID(VCN3) has 2 unsymmetrical instances, there're less codecs on instance 1, we use 0 for decode and 1 for encode. Signed-off-by: James Zhu <[email protected]> Reviewed-by: Leo Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/pm: optimize the power related source code layoutEvan Quan15-19940/+5
The target is to provide a clear entry point(for power routines). Also this can help to maintain a clear view about the frameworks used on different ASICs. Hopefully all these can make power part more friendly to play with. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/powerplay: put those exposed power interfaces in amdgpu_dpm.cEvan Quan4-431/+439
As other power interfaces. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/powerplay: optimize i2c bus access implementationEvan Quan3-13/+20
The caller needs not care about the internal details how the powerplay API implemented. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/powerplay: drop unnecessary pp_funcs checkerEvan Quan3-13/+6
It's redundant. Also, the callers should not care about the implementation details. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/powerplay: optimize amdgpu_dpm_set_clockgating_by_smu() implementationEvan Quan4-32/+32
Cover the implementation details from outside(of power part). Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FSGuchun Chen1-0/+4
It can avoid potential build warn/error when CONFIG_DEBUG_FS is not set. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: fix NULL pointer access issue when unloading driverGuchun Chen1-2/+0
When unloading driver by "modprobe -r amdgpu", one NULL pointer dereference bug occurs in ras debugfs releasing. The cause is the duplicated debugfs_remove, as drm debugfs_root dir has been cleaned up already by drm_minor_unregister. BUG: kernel NULL pointer dereference, address: 00000000000000a0 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 11 PID: 1526 Comm: modprobe Tainted: G OE 5.6.0-guchchen #1 Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 RIP: 0010:down_write+0x15/0x40 Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92 d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3 RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0 RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0 R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40 R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000 FS: 00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: simple_recursive_removal+0x63/0x370 ? debugfs_remove+0x60/0x60 debugfs_remove+0x40/0x60 amdgpu_ras_fini+0x82/0x230 [amdgpu] ? __kernfs_remove.part.17+0x101/0x1f0 ? kernfs_name_hash+0x12/0x80 amdgpu_device_fini+0x1c0/0x580 [amdgpu] amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu] amdgpu_pci_remove+0x36/0x60 [amdgpu] pci_device_remove+0x3b/0xb0 device_release_driver_internal+0xe5/0x1c0 driver_detach+0x46/0x90 bus_remove_driver+0x58/0xd0 pci_unregister_driver+0x29/0x90 amdgpu_exit+0x11/0x25 [amdgpu] __x64_sys_delete_module+0x13d/0x210 do_syscall_64+0x5f/0x250 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: revert "fix system hang issue during GPU reset"Christian König34-452/+173
The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929. Signed-off-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/powerplay: enable swSMU mgpu fan boost supportEvan Quan1-1/+4
Enable mgpu fan boost feature on swSMU routines. Signed-off-by: Evan Quan <[email protected]> Acked-by: Nirmoy Das <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amd/powerplay: optimize the interface for mgpu fan boost enablementEvan Quan3-7/+16
Cover the implementation details from outside(of power). Also preparing for expanding this to swSMU. Signed-off-by: Evan Quan <[email protected]> Acked-by: Nirmoy Das <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: disable gfxoff for navy_flounderJiansong Chen1-0/+3
gfxoff is temporarily disabled for navy_flounder, since at present the feature has broken some basic amdgpu test. Signed-off-by: Jiansong Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: Use function pointer for some mmhub functionsOak Zeng12-121/+100
Add more function pointers to amdgpu_mmhub_funcs. ASIC specific implementation of most mmhub functions are called from a general function pointer, instead of calling different function for different ASIC. Simplify the code by deleting duplicate functions Signed-off-by: Oak Zeng <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: pass NULL pointer instead of 0Nirmoy Das1-1/+1
Fixes: c030f2e4166c3f55 ("drm/amdgpu: add amdgpu_ras.c to support ras (v2)") Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: annotate a false positive recursive lockingDennis Li1-3/+7
[ 584.110304] ============================================ [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.111164] -------------------------------------------- [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350] CPU0 [ 584.114685] ---- [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. v2: 1. register adev->lock_key into lockdep, otherwise lockdep will report the below warning [ 1216.705820] BUG: key ffff890183b647d0 has not been registered! [ 1216.705924] ------------[ cut here ]------------ [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 v3: change to use down_write_nest_lock to annotate the false dead-lock warning. Reviewed-by: Daniel Vetter <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: add debugfs interface for RAP testWenhui Sheng4-1/+161
After amdgpu driver loading successfully, we can use RAP debugfs interface <debugfs_dir>/dri/xxx/rap_test to trigger RAP test. Currently only L0 validate test is supported. v2: refine amdgpu_rap.h Signed-off-by: Wenhui Sheng <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: enable RAP TA loadWenhui Sheng3-0/+201
Enable the RAP TA loading path and add RAP test trigger interface. v2: fix potential mem leak issue Signed-off-by: Wenhui Sheng <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: add RAP TA header fileWenhui Sheng1-0/+84
The RAP TA contains tests used to verify if RAP(Register Access Policy), or otherwise known as Security Policy is applied correctly by PSP BL&TOS. The RAP test is a measure to ensure that we reduce the avenue of complexity and mistakes when dealing with RAP in post-si execution, where debugging failures related to RAP is quite difficult and expensive. v2: add introduction for RAP TA Signed-off-by: Wenhui Sheng <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)Tianci.Yin1-1/+7
On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so reconfigure the golden settings after GFXOFF exit. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Tianci.Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: add interface amdgpu_gfx_init_spm_golden for Navi1xTianci.Yin2-9/+27
On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so reconfiguration is needed. Make the configuration code as an interface for future use. Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Signed-off-by: Tianci.Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14drm/amdgpu: add debugfs node to toggle ras error cnt harvestGuchun Chen1-0/+7
Before ras recovery is issued, user could operate this debugfs node to enable/disable the harvest of all RAS IPs' ras error count registers, which will help keep hardware's registers' status instead of cleaning up them. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>