drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed

The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
author: YiPeng Chai <[email protected]> 2024-07-02 17:53:02 +0800
committer: Alex Deucher <[email protected]> 2024-07-10 10:13:41 -0400
commit: e23300dfffa178b19abc1b1b94ed7de74b0e0930 (patch)
tree: 04539b69ce53af53923d6e4c4edb822179699c14 /drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
parent: c04706914ddeb9098a509a5647c0b46c7e07cf11 (diff)
1 files changed, 5 insertions, 1 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 64bee125f17a..d0307c55da50 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2934,8 +2934,12 @@ static void amdgpu_ras_do_page_retirement(struct work_struct *work)
 	struct ras_err_data err_data;
 	unsigned long err_cnt;
 
-	if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
+	/* If gpu reset is ongoing, delay retiring the bad pages */
+	if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
+		amdgpu_ras_schedule_retirement_dwork(con,
+				AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3);
 		return;
+	}
 
 	amdgpu_ras_error_data_init(&err_data);
author	YiPeng Chai <[email protected]>	2024-07-02 17:53:02 +0800
committer	Alex Deucher <[email protected]>	2024-07-10 10:13:41 -0400
commit	e23300dfffa178b19abc1b1b94ed7de74b0e0930 (patch)
tree	04539b69ce53af53923d6e4c4edb822179699c14 /drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
parent	c04706914ddeb9098a509a5647c0b46c7e07cf11 (diff)