Age | Commit message (Collapse) | Author | Files | Lines |
|
Change it to external interface.
Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
check hw ras capablity via atomfirmware
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Clarify some areas, clean up formatting, add section for
unrecoverable error handling.
v2: fix grammatical errors
Reviewed-by: Yong Zhao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.
v2: squash in missing include (Alex)
Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Easy for maintainance.
Signed-off-by: Guchun Chen <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Ras reboot debugfs node allows user one easy control to avoid
gpu recovery hang problem and directly reboot system per card
basis, after ras uncorrectable error happens. However, it is
one common entry, which should get rid of ras_ctrl node and
remove ip dependence when inputting by user. So add one new
auto_reboot node in ras debugfs dir to achieve this.
v2: in commit mssage, add justification why ras reboot debugfs
node is needed.
v3: use debugfs_create_bool to create debugfs file for boolean value
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Ignre the ERREVENT_ATHUB_INTERRUPT for systems without RAS.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-and-tested-by: Jack Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
check whether a page is bad page before umc error injection, bad page
should not be accessed again
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
We recently added it, but never documented it.
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fix a couple of spelling typos.
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Don't set a struct pointer to NULL before freeing its members. It's
hard to see what's happening due to a local pointer-to-pointer data
aliasing con->eh_data.
Signed-off-by: Felix Kuehling <[email protected]>
Tested-by: Philip Cox <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
simplify the code of accessing to eeprom_control struct
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
remove mmhub_funcs in adev
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add new sections to amdgpu.rst, fix up formatting issues,
add additional documentation to each section.
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
null ptr should be checked first to avoid null ptr access
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
We are reserving vram pages so they should be aligned to the
GPU page size.
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
There are two cases of reserve error should be ignored:
1) a ras bad page has been allocated (used by someone);
2) a ras bad page has been reserved (duplicate error injection for one page);
DRM_ERROR is unnecessary for the failure of bad page reserve
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Some of the documentation formatting could be improved
which will resolve some Sphinx amdgpu build warnings e.g
WARNING: Unexpected indentation.
WARNING: Block quote ends without a blank line; unexpected unindent.
WARNING: Inline emphasis start-string without end-string.
Signed-off-by: Adam Zerella <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The placement is something TTM/BO internal and the RAS code should
avoid touching that directly.
Add a helper to create a BO at a fixed location and use that instead.
v2: squash in fixes (Alex)
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Use debugfs_remove_recursive to remove the whole debugfs
directory instead of removing the node one by one.
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Call pcie bif ras query/inject in amdgpu ras.
Signed-off-by: Tao Zhou <[email protected]>
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
allow inject error to XGMI block via debugfs node ras_ctrl
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The table grows quickly during debug/development effort when
multiple RAS errors are injected. Allow to avoid this by setting
table header back to empty if needed.
v2: Switch to debugfs entry instead of load time parameter.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
move umc ras init from ras module to umc block, generic ras module
should pay less attention to specific ras block.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fixes driver load regression on APUs.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
proper place
ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini
v2: add more comment and print.
remove cancel_work_sync in recovery_init.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
support eeprom records load and save for ras,
move EEPROM records storing to bad page reserving
v2: remove redundant check for con->eh_data
Signed-off-by: Tao Zhou <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
change bps type from retired page to eeprom table record, prepare for
saving umc error records to eeprom
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.
v4: Use latest kernel API for disk sync.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.
Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.
v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count
v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
In late_init for ras, the helper function will be used to
1). disable ras feature if the IP block is masked as disabled
2). send enable feature command if the ip block was masked as enabled
3). create debugfs/sysfs node per IP block
4). register interrupt handler
v2: check ih_info.cb to decide add interrupt handler or not
v3: add ras_late_fini for cleanup all the ras fs node and remove
interrupt handler
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Ras controller interrupt and Ras err event athub interrupt are two dedicated
interrupts for RAS support.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Use unsigned long type for the same ras count variable.
This will avoid overflow on 64 bit system.
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
feature mask info is enough for rocm tool,
"cat /sys/class/drm/card0/device/ras/features" will get the
info like this:
feature mask: 0x3ffb
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
call mmhub ras query/inject in amdgpu ras
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
ras sub block index could be passed from shell command
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Guchun Chen <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
remove confused ras error type info
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
ce can also trigger interrupt, and even both ce and ue error can be
found in one ras query, distinguishing between ce and ue in interrupt
handler is uncessary.
Signed-off-by: Tao Zhou <[email protected]>
Suggested-by: Guchun Chen <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
correctable error can also trigger interrupt in some ras blocks
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
umc error address query can get ce/ue error address and clear error
status
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
check gfx error count in both ras querry function and
ras interrupt handler.
gfx ras is still disabled by default due to known stability
issue found in gpu reset.
Signed-off-by: Dennis Li <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
error injection address is not in gpu address space
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
only ue and ce errors are supported
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add error data as parameter for ras interrupt cb and process it
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
more than one error address may be recorded in one query
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
create new amdgpu_umc structure to for more umc
settings in future and switch to the new structure
Signed-off-by: Tao Zhou <[email protected]>
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
v1: increase ras ce/ue error count
v2: log the number of correctable and uncorrectable errors
Signed-off-by: Tao Zhou <[email protected]>
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
check umc error count in both ras querry function and
ras interrupt handler
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
These are common structures that can be included by IP specific
source files
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
this function is not needed any more. error injection is
the only way to validate ras but it can't be executed in
amdgpu_ras_init, where gpu is even not initialized
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|