diff options
author | YiPeng Chai <YiPeng.Chai@amd.com> | 2024-07-02 17:53:02 +0800 |
---|---|---|
committer | Alex Deucher <alexander.deucher@amd.com> | 2024-07-10 10:13:41 -0400 |
commit | e23300dfffa178b19abc1b1b94ed7de74b0e0930 (patch) | |
tree | 04539b69ce53af53923d6e4c4edb822179699c14 /drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | |
parent | c04706914ddeb9098a509a5647c0b46c7e07cf11 (diff) |
drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
The problem case is as follows:
1. GPU A triggers a gpu ras reset, and GPU A drives
GPU B to also perform a gpu ras reset.
2. After gpu B ras reset started, gpu B queried a DE
data. Since the DE data was queried in the ras reset
thread instead of the page retirement thread, bad
page retirement work would not be triggered. Then
even if all gpu resets are completed, the bad pages
will be cached in RAM until GPU B's bad page retirement
work is triggered again and then saved to eeprom.
This patch can save the bad pages to eeprom in time after gpu
ras reset is completed.
v2:
1. Add the above description to code comments.
2. Reuse existing function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Diffstat (limited to 'drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c')
-rw-r--r-- | drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 |
1 files changed, 5 insertions, 1 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 64bee125f17a..d0307c55da50 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -2934,8 +2934,12 @@ static void amdgpu_ras_do_page_retirement(struct work_struct *work) struct ras_err_data err_data; unsigned long err_cnt; - if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) + /* If gpu reset is ongoing, delay retiring the bad pages */ + if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) { + amdgpu_ras_schedule_retirement_dwork(con, + AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3); return; + } amdgpu_ras_error_data_init(&err_data); |