diff options
author | Vignesh Chander <Vignesh.Chander@amd.com> | 2024-06-24 16:44:26 -0500 |
---|---|---|
committer | Alex Deucher <alexander.deucher@amd.com> | 2024-06-27 17:31:37 -0400 |
commit | cbda2758d8bfae323b846210a3e52f0ad5fe7164 (patch) | |
tree | 034e4d34e668d0dd014fe139efa6e9736b7b28e4 /drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | |
parent | 78146c1dcd220ae98fd5f4114f992299fc5ee161 (diff) |
drm/amdgpu: process RAS fatal error MB notification
For RAS error scenario, VF guest driver will check mailbox
and set fed flag to avoid unnecessary HW accesses.
additionally, poll for reset completion message first
to avoid accidentally spamming multiple reset requests to host.
v2: add another mailbox check for handling case where kfd detects
timeout first
v3: set host_flr bit and use wait_for_reset
Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Diffstat (limited to 'drivers/gpu/drm/amd/amdgpu/amdgpu_device.c')
-rw-r--r-- | drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 |
1 files changed, 8 insertions, 1 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 183e219b6a85..b27336a05aae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5069,7 +5069,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, struct amdgpu_hive_info *hive = NULL; if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) { - amdgpu_virt_ready_to_reset(adev); + if (!amdgpu_ras_get_fed_status(adev)) + amdgpu_virt_ready_to_reset(adev); amdgpu_virt_wait_reset(adev); clear_bit(AMDGPU_HOST_FLR, &reset_context->flags); r = amdgpu_virt_request_full_gpu(adev, true); @@ -5837,6 +5838,12 @@ retry: /* Rest of adevs pre asic reset from XGMI hive. */ /* Actual ASIC resets if needed.*/ /* Host driver will handle XGMI hive reset for SRIOV */ if (amdgpu_sriov_vf(adev)) { + if (amdgpu_ras_get_fed_status(adev) || amdgpu_virt_rcvd_ras_interrupt(adev)) { + dev_dbg(adev->dev, "Detected RAS error, wait for FLR completion\n"); + amdgpu_ras_set_fed(adev, true); + set_bit(AMDGPU_HOST_FLR, &reset_context->flags); + } + r = amdgpu_device_reset_sriov(adev, reset_context); if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) { amdgpu_virt_release_full_gpu(adev, true); |