aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
diff options
context:
space:
mode:
authorYiPeng Chai <YiPeng.Chai@amd.com>2024-07-02 17:53:02 +0800
committerAlex Deucher <alexander.deucher@amd.com>2024-07-10 10:13:41 -0400
commite23300dfffa178b19abc1b1b94ed7de74b0e0930 (patch)
tree04539b69ce53af53923d6e4c4edb822179699c14 /drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
parentc04706914ddeb9098a509a5647c0b46c7e07cf11 (diff)
drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Diffstat (limited to 'drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c')
-rw-r--r--drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c6
1 files changed, 5 insertions, 1 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 64bee125f17a..d0307c55da50 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2934,8 +2934,12 @@ static void amdgpu_ras_do_page_retirement(struct work_struct *work)
struct ras_err_data err_data;
unsigned long err_cnt;
- if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
+ /* If gpu reset is ongoing, delay retiring the bad pages */
+ if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
+ amdgpu_ras_schedule_retirement_dwork(con,
+ AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3);
return;
+ }
amdgpu_ras_error_data_init(&err_data);