aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
AgeCommit message (Collapse)AuthorFilesLines
2020-05-14drm/amdgpu: Update RAS XGMI error inject sequenceJohn Clements1-1/+29
Disable XGMI link power down prior to issuing a XGMI RAS error Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-05-05drm/amdgpu: allocate large structures dynamicallyArnd Bergmann1-10/+21
After the structure was padded to 1024 bytes, it is no longer suitable for being a local variable, as the function surpasses the warning limit for 32-bit architectures: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:587:5: error: stack frame size of 1072 bytes in function 'amdgpu_ras_feature_enable' [-Werror,-Wframe-larger-than=] int amdgpu_ras_feature_enable(struct amdgpu_device *adev, ^ Use kzalloc() instead to get it from the heap. Fixes: a0d254820f43 ("drm/amdgpu: update RAS TA to Host interface") Acked-by: Christian König <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-30drm/amdgpu: update RAS error handlingJohn Clements1-9/+31
Parse return status from TA to determine error severity Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-22drm/amdgpu: set error query ready after all IPs late initDennis Li1-5/+1
If set error query ready in amdgpu_ras_late_init, which will cause some IP blocks aren't initialized, but their error query is ready. Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-22drm/amdgpu: fix kernel page fault issue by ras recovery on sGPUGuchun Chen1-2/+3
When running ras uncorrectable error injection and triggering GPU reset on sGPU, below issue is observed. It's caused by the list uninitialized when accessing. [ 80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750 [ 80.047300] #PF: supervisor write access in kernel mode [ 80.047351] #PF: error_code(0x0003) - permissions violation [ 80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061 [ 80.047477] Oops: 0003 [#1] SMP PTI [ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1 [ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018 [ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu] Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-13drm/amdgpu: refine ras related message printGuchun Chen1-20/+31
Prefix ras related kernel message logging with PCI device info by replacing DRM_INFO/WARN/ERROR with dev_info/warn/err. This can clearly tell user about GPU device information where ras is. And add some other ras message printing to make it more clear and friendly as well. Suggested-by: Hawking Zhang <[email protected]> Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-09drm/amdgpu: resolve mGPU RAS query instabilityJohn Clements1-5/+15
upon receiving uncorrectable error, query every GPU node for ras errors Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-01drm/amdgpu: fix non-pointer dereference for non-RAS supportedEvan Quan1-2/+2
Backtrace on gpu recover test on Navi10. [ 1324.516681] RIP: 0010:amdgpu_ras_set_error_query_ready+0x15/0x20 [amdgpu] [ 1324.523778] Code: 4c 89 f7 e8 cd a2 a0 d8 e9 99 fe ff ff 45 31 ff e9 91 fe ff ff 0f 1f 44 00 00 55 48 85 ff 48 89 e5 74 0e 48 8b 87 d8 2b 01 00 <40> 88 b0 38 01 00 00 5d c3 66 90 0f 1f 44 00 00 55 31 c0 48 85 ff [ 1324.543452] RSP: 0018:ffffaa1040e4bd28 EFLAGS: 00010286 [ 1324.549025] RAX: 0000000000000000 RBX: ffff911198b20000 RCX: 0000000000000000 [ 1324.556217] RDX: 00000000000c0a01 RSI: 0000000000000000 RDI: ffff911198b20000 [ 1324.563514] RBP: ffffaa1040e4bd28 R08: 0000000000001000 R09: ffff91119d0028c0 [ 1324.570804] R10: ffffffff9a606b40 R11: 0000000000000000 R12: 0000000000000000 [ 1324.578413] R13: ffffaa1040e4bd70 R14: ffff911198b20000 R15: 0000000000000000 [ 1324.586464] FS: 00007f4441cbf540(0000) GS:ffff91119ed80000(0000) knlGS:0000000000000000 [ 1324.595434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1324.601345] CR2: 0000000000000138 CR3: 00000003fcdf8004 CR4: 00000000003606e0 [ 1324.608694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1324.616303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1324.623678] Call Trace: [ 1324.626270] amdgpu_device_gpu_recover+0x6e7/0xc50 [amdgpu] [ 1324.632018] ? seq_printf+0x4e/0x70 [ 1324.636652] amdgpu_debugfs_gpu_recover+0x50/0x80 [amdgpu] [ 1324.643371] seq_read+0xda/0x420 [ 1324.647601] full_proxy_read+0x5c/0x90 [ 1324.652426] __vfs_read+0x1b/0x40 [ 1324.656734] vfs_read+0x8e/0x130 [ 1324.660981] ksys_read+0xa7/0xe0 [ 1324.665201] __x64_sys_read+0x1a/0x20 [ 1324.669907] do_syscall_64+0x57/0x1c0 [ 1324.674517] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1324.680654] RIP: 0033:0x7f44417cf081 Signed-off-by: Evan Quan <[email protected]> Reviewed-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-01drm/amdgpu: disable ras query and iject during gpu resetJohn Clements1-3/+21
added flag to ras context to indicate if ras query functionality is ready Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-20drm/amdgpu: protect RAS sysfs during GPU resetJohn Clements1-0/+9
MMHub EDC becomes dirty after BACO reset EDC registers should be cleared early on in reset phase Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-13drm/amdgpu: fix warning in ras_debugfs_create_all()Stanley.Yang1-5/+2
Fix the warning "warn: variable dereferenced before check 'obj' (see line 1131)" by removing unnecessary checks as amdgpu_ras_debugfs_create_all() is only called from amdgpu_debugfs_init() where obj member in con->head list is not NULL. Use list_for_each_entry() instead list_for_each_entry_safe() as obj do not to be freeing or removing from list during this process. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-13drm/amdgpu: update ras capability's query based on mem ecc configurationGuchun Chen1-6/+18
RAS support capability needs to be updated on top of different memeory ECC enablement, and remove redundant memory ecc check in gmc module for vega20 and arcturus. v2: check HBM ECC enablement and set ras mask accordingly. v3: avoid to invoke atomfirmware interface to query twice. Suggested-by: Hawking Zhang <[email protected]> Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-10drm/amdgpu: call ras_debugfs_create_all in debugfs_initTao Zhou1-5/+0
and remove each ras IP's own debugfs creation this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-10drm/amdgpu: add function to creat all ras debugfs nodeTao Zhou1-0/+29
centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-06drm/amdgpu: enable PCS error report on VG20Hawking Zhang1-0/+3
Now driver will report XGMI/WAFL PCS error through sysfs xgmi_wafl_err_count node on Vega20 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-02-26drm/amdgpu: move get_xgmi_relative_phy_addr to amdgpu_xgmi.cHawking Zhang1-16/+4
centralize all the xgmi related function to amdgpu_xgmi.c Signed-off-by: Hawking Zhang <[email protected]> Acked-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-02-19drm/amdgpu: log on non-zero error conter per IP before GPU resetGuchun Chen1-0/+33
Once sync flood interrupt is triggered by RAS error, before actual GPU recovery job, it's necessary to log on and print non-zero error counter, this will help user knows where the RAS error source is from quickly. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-01-22drm/amdgpu: added support to get mGPU DRAM baseJohn Clements1-0/+20
resolves issue with RAS error injection in mGPU configuration Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-01-16drm/amdgpu: check if driver should try recovery in ras recovery pathHawking Zhang1-1/+2
To allow the flexibilty for user to disable gpu recovery in RAS recovery path by module parameter amdgpu_gpu_recovery Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-01-14drm/amdgpu: support error reporting for sdma ip blockHawking Zhang1-0/+8
invoke sdma query_ras_error_count to get sdma single bit error count Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-23drm/amdgpu: add missed return value set for error caseGuchun Chen1-0/+1
Return value should be set when going to error handle tag for error case, this can avoid potential invalid array access by upper caller. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-18drm/amdgpu: Remove unneeded semicolon in amdgpu_ras.czhengbin1-1/+1
Fixes coccicheck warning: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:318:2-3: Unneeded semicolon Reported-by: Hulk Robot <[email protected]> Signed-off-by: zhengbin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-18drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpuGuchun Chen1-2/+2
BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05drm/amdgpu: export amdgpu_ras_find_obj to use externallyLe Ma1-4/+1
Change it to external interface. Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: enable ras capablity check on arcturusHawking Zhang1-1/+2
check hw ras capablity via atomfirmware Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-06drm/amdgpu: Improve RAS documentation (v2)Alex Deucher1-7/+33
Clarify some areas, clean up formatting, add section for unrecoverable error handling. v2: fix grammatical errors Reviewed-by: Yong Zhao <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-30drm/amdgpu: bypass some cleanup work after err_event_athub (v2)Le Ma1-9/+11
PSP lost connection when err_event_athub occurs. These cleanup work can be skipped in BACO reset. v2: squash in missing include (Alex) Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-25drm/amdgpu: define macros for retire page reservationGuchun Chen1-6/+11
Easy for maintainance. Signed-off-by: Guchun Chen <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-25drm/amdgpu: refine reboot debugfs operation in ras case (v3)Guchun Chen1-7/+12
Ras reboot debugfs node allows user one easy control to avoid gpu recovery hang problem and directly reboot system per card basis, after ras uncorrectable error happens. However, it is one common entry, which should get rid of ras_ctrl node and remove ip dependence when inputting by user. So add one new auto_reboot node in ras debugfs dir to achieve this. v2: in commit mssage, add justification why ras reboot debugfs node is needed. v3: use debugfs_create_bool to create debugfs file for boolean value Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-15dmr/amdgpu: Fix crash on SRIOV for ERREVENT_ATHUB_INTERRUPT interrupt.Andrey Grodzovsky1-0/+6
Ignre the ERREVENT_ATHUB_INTERRUPT for systems without RAS. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-and-tested-by: Jack Zhang <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-10drm/amdgpu: avoid ras error injection for retired pageTao Zhou1-0/+44
check whether a page is bad page before umc error injection, bad page should not be accessed again Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-10drm/amdgpu/ras: document the reboot ras optionAlex Deucher1-1/+2
We recently added it, but never documented it. Reviewed-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-10drm/amdgpu/ras: fix typos in documentationAlex Deucher1-2/+2
Fix a couple of spelling typos. Reviewed-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-07drm/amdgpu: Fix error handling in amdgpu_ras_recovery_initFelix Kuehling1-1/+1
Don't set a struct pointer to NULL before freeing its members. It's hard to see what's happening due to a local pointer-to-pointer data aliasing con->eh_data. Signed-off-by: Felix Kuehling <[email protected]> Tested-by: Philip Cox <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: simplify the access to eeprom_control structTao Zhou1-3/+3
simplify the code of accessing to eeprom_control struct Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: replace mmhub_funcs with mmhub.funcsTao Zhou1-2/+2
remove mmhub_funcs in adev Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu/ras: fix and update the documentation for RASAlex Deucher1-7/+46
Add new sections to amdgpu.rst, fix up formatting issues, add additional documentation to each section. Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: avoid null pointer dereferenceGuchun Chen1-2/+2
null ptr should be checked first to avoid null ptr access Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu/ras: use GPU PAGE_SIZE/SHIFT for reserving pagesAlex Deucher1-1/+2
We are reserving vram pages so they should be aligned to the GPU page size. Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: replace DRM_ERROR with DRM_WARN in ras_reserve_bad_pagesTao Zhou1-1/+6
There are two cases of reserve error should be ignored: 1) a ras bad page has been allocated (used by someone); 2) a ras bad page has been reserved (duplicate error injection for one page); DRM_ERROR is unnecessary for the failure of bad page reserve Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03docs: drm/amdgpu: Resolve build warningsAdam Zerella1-18/+26
Some of the documentation formatting could be improved which will resolve some Sphinx amdgpu build warnings e.g WARNING: Unexpected indentation. WARNING: Block quote ends without a blank line; unexpected unindent. WARNING: Inline emphasis start-string without end-string. Signed-off-by: Adam Zerella <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-17drm/amdgpu: cleanup creating BOs at fixed location (v2)Christian König1-79/+6
The placement is something TTM/BO internal and the RAS code should avoid touching that directly. Add a helper to create a BO at a fixed location and use that instead. v2: squash in fixes (Alex) Signed-off-by: Christian König <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: fix ras ctrl debugfs node leakGuchun Chen1-7/+5
Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: support pcie bif ras query and injectGuchun Chen1-0/+5
Call pcie bif ras query/inject in amdgpu ras. Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: enable error injection to XGMI block via debugfsHawking Zhang1-0/+1
allow inject error to XGMI block via debugfs node ras_ctrl Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: Allow to reset to EERPOM table.Andrey Grodzovsky1-0/+25
The table grows quickly during debug/development effort when multiple RAS errors are injected. Allow to avoid this by setting table header back to empty if needed. v2: Switch to debugfs entry instead of load time parameter. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: move umc ras init to umc blockTao Zhou1-4/+0
move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: Avoid RAS recovery init when no RAS support.Andrey Grodzovsky1-1/+6
Fixes driver load regression on APUs. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13drm/amdgpu: move the call of ras recovery_init and bad page reserve to ↵Tao Zhou1-13/+26
proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13drm/amdgpu: Hook EEPROM table to RASTao Zhou1-28/+81
support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>