aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
AgeCommit message (Collapse)AuthorFilesLines
2024-05-23drm/amdgpu: correct hbm field in boot statusHawking Zhang1-1/+1
hbm filed takes bit 13 and bit 14 in boot status. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-04-26drm/amdgpu: add condition check for amdgpu_umc_fill_error_recordYiPeng Chai1-0/+1
Add condition check for amdgpu_umc_fill_error_record. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-04-26drm/amdgpu: Add delay work to retire bad pagesYiPeng Chai1-0/+1
Add delay work to retire bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-04-26drm/amdgpu: prepare for logging ecc errorsYiPeng Chai1-0/+23
Prepare for logging ecc errors. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-04-26drm/amdgpu: add message fifo to handle RAS poison eventsYiPeng Chai1-0/+18
Add message fifo to handle RAS poison events. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-04-26drm/amdgpu: Add interface to reserve bad pageYiPeng Chai1-0/+4
Add interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-03-20drm/amdgpu: add ras event id supportYang Wang1-0/+30
add amdgpu ras event id support to better distinguish different error information sources in dmesg logs. the following log will be identify by event id: {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors statistic since from current injection/error query {event_id} errors statistic since from gpu load Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-02-26drm/amdgpu: Add fatal error detected flagLijo Lazar1-0/+6
For a RAS error that needs a full reset to recover, set the fatal error status. Clear the status once the device is reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-31drm/amdgpu: Update boot time errors polling sequenceHawking Zhang1-0/+5
Update boot time errors polling sequence to align with the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Frank Min <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-22drm/amdgpu:Support retiring multiple MCA error address pagesYiPeng Chai1-1/+7
Support retiring multiple MCA error address pages in one in-band query for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-22drm/amdgpu: Fix ras features value calltraceStanley.Yang1-0/+6
The high three bits of ras features mask indicate socket id, it should skip to check high three bits of ras features mask before disable all ras features. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-22drm/amdgpu: Prepare for asynchronous processing of umc page retirementYiPeng Chai1-0/+5
Preparing for asynchronous processing of umc page retirement. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: Align ras block enum with firmwareHawking Zhang1-0/+2
Driver and firmware share the same ras block enum. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: Log deferred error separatelyCandice Li1-0/+6
Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: add aca sysfs supportYang Wang1-0/+3
add aca sysfs node support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: add amdgpu ras aca query interfaceYang Wang1-3/+9
v1: add ACA error query interface v2: Add a new helper function to determine whether to use ACA or MCA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: add ACA bank dump debugfs supportYang Wang1-0/+1
add ACA bank dump debugfs support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: Add ras helper to query boot errors v2Hawking Zhang1-1/+14
Add ras helper function to query boot time gpu errors. v2: use aqua_vanjaram smn addressing pattern Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-15drm/amdgpu: implement RAS ACA driver frameworkYang Wang1-0/+1
v1: implement new RAS ACA driver code framework. v2: - rename aca_bank_set to aca_banks. - rename aca_source_xxx to aca_handle_xxx. v3: Optimize some function implementation details. (from Hawking's suggestion) Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-12-19drm/amdgpu: MCA supports recording umc address informationYiPeng Chai1-2/+11
MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-29drm/amdgpu: Move mca debug mode decision to rasLijo Lazar1-1/+1
Refactor code such that ras block decides the default mca debug mode, and not swsmu block. By default mca debug mode is set to false. v2: squash in uninitialized value fix (Alex) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-09drm/amdgpu: Support multiple error query modesHawking Zhang1-0/+8
Direct error query mode and firmware error query mode are supported for now. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-26drm/amdgpu: refine ras error kernel log printYang Wang1-4/+1
refine ras error kernel log to avoid user-ridden ambiguity. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: add set/get mca debug mode operationsTao Zhou1-0/+5
Record the debug mode status in RAS. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: define ras_reset_error_count functionTao Zhou1-0/+2
Make the code architecture more simple. v2: reuse ras_reset_error_count in ras_reset_error_status. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-13drm/amdgpu: add ras_err_info to identify RAS error sourceYang Wang1-0/+25
introduced "ras_err_info" to better identify a RAS ERROR source. NOTE: For legacy chips, keep the original RAS error print format. v1: RAS errors may come from different dies during a RAS error query, therefore, need a new data structure to identify the source of RAS ERROR. v2: - use new data structure 'amdgpu_smuio_mcm_config_info' instead of ras_err_id (in v1 patch) - refine ras error dump function name - refine ras error dump log format Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-13drm/amdgpu: make err_data structure built-in for ras_managerYang Wang1-1/+4
(No effect outside the ras_mgr data structure) Since a new member was added to the ras_err_data data structure, it becomes unreasonable for the ras_mgr instance to contain this data, because ras mgr only uses the 2 member information of ue_count/ce_count in err_data. This patch changes the code err_data into built-in structure members, making the code directly compatible. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-13drm/amdgpu: Expose ras version & schema infoAsad Kamal1-0/+3
Expose ras table version & schema info to sysfs v2: Updated schema to get poison support info from ras context, removed asic specific checks Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-30drm/amd/amdgpu/amdgpu_ras: Increase buffer size to account for all possible ↵Lee Jones1-1/+1
values Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_sysfs_create’: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1406:20: warning: ‘_err_count’ directive output may be truncated writing 10 bytes into a region of size between 1 and 32 [-Wformat-truncation=] drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1405:9: note: ‘snprintf’ output between 11 and 42 bytes into a destination of size 32 Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-06-30drm/amdgpu: gpu recovers from fatal error in poison modeYiPeng Chai1-0/+1
Fatal error occurs in ras poison mode, mode1 reset is used to recover gpu. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-06-09drm/amdgpu: perform mode2 reset for sdma fed error on gfx v11_0_3YiPeng Chai1-0/+5
perform mode2 reset for sdma fed error on gfx v11_0_3. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-06-09drm/amdgpu: add instance mask for RAS injectTao Zhou1-1/+8
User can specify injected instances by the mask. For backward compatibility, the mask value is incorporated into sub block index without interface change of RAS TA. User uses logical mask and driver should convert it to physical value before sending it to RAS TA. v2: update parameter name. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-06-09drm/amdgpu: Add common helper to reset ras errorHawking Zhang1-0/+4
Add common helper to reset ras error status. It applies to IP blocks that follow the new ras error logging register design, and need to write 0 to reset the error status. For IP blocks that don't support the new design, please still implement ip specific helper. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-06-09drm/amdgpu: Add common helper to query ras error (v2)Hawking Zhang1-0/+54
Add common helper to query ras error status and log error information, including memory block id and erorr count. The helpers are applicable to IP blocks that follow the new ras error logging design. For IP blocks that don't support the new design, please still implement ip specific helper to query ras error. v2: optimize struct amdgpu_ras_err_status_reg_entry and the implementaion in helper (Lijo/Tao) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-04-11drm/amdgpu: fix unexpected block idStanley.Yang1-0/+4
Aldebaran supports VCN and JPEG RAS, it reports unexpected block id message during VCN and JPEG RAS initialization if VCN and JPEG block id not defined. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-02-23drm/amdgpu: exclude duplicate pages from UMC RAS UE countTao Zhou1-1/+2
If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page. v2: add specific function to do the check. v3: remove duplicate pages, calculate new added bad page number. v4: reuse save_bad_pages to calculate new added bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-01-05drm/amdgpu: allow query error counters for specific IP blockHawking Zhang1-1/+2
amdgpu_ras_block_late_init will be invoked in IP specific ras_late_init call as a common helper for all the IP blocks. However, when amdgpu_ras_block_late_init call amdgpu_ras_query_error_count to query ras error counters, amdgpu_ras_query_error_count queries all the IP blocks that support ras query interface. This results to wrong error counters cached in software copies when there are ras errors detected at time zero or warm reset procedure. i.e., in sdma_ras_late_init phase, it counts on sdma/mmhub errors, while, in mmhub_ras_late_init phase, it still counts on sdma/mmhub errors. The change updates amdgpu_ras_query_error_count interface to allow query specific ip error counter. It introduces a new input parameter: query_info. if query_info is NULL, it means query all the IP blocks, otherwise, only query the ip block specified by query_info. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-06-03drm/amdgpu: print umc correctable error addressStanley.Yang1-0/+5
Signed-off-by: Stanley.Yang <[email protected]> Acked-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-06-03drm/amdgpu/pm: support mca_ceumc_addr in ecctableStanley.Yang1-0/+1
SMU add a new variable mca_ceumc_addr to record umc correctable error address in EccInfo table, driver side add EccInfo_V2_t to support this feature Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-04-22drm/amdgpu: add RAS fatal error interrupt handlerTao Zhou1-0/+1
The fatal error handler is independent from general ras interrupt handler since there is no related IH ring. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-04-22drm/amdgpu: add RAS poison consumption handler (v2)Tao Zhou1-0/+1
Add support for general RAS poison consumption handler. v2: remove callback function for poison consumption. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-28drm/amdgpu/vcn: Add VCN ras error query supportMohammad Zafar Ziya1-0/+1
RAS error query support addition for VCN 2.6 V2: removed unused option and corrected comment format Moved the register definition under header file V3: poison query status check added. Removed error query interface V4: MMSCH poison check option removed, return true/false refactored. Signed-off-by: Mohammad Zafar Ziya <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-28drm/amdgpu: Add vcn and jpeg ras support flagMohammad Zafar Ziya1-0/+2
Add vcn and jpeg ras support options V2: vcn and jpeg ras flag enabled for aldebaran asic only V3: vcn and jpeg ras flag disabled for error counter query Generic poison query interface added VCN and JPEG ras enabled based on IP version check V4: vcn and jpeg ras flag moved under ecc flag for dGPU Signed-off-by: Mohammad Zafar Ziya <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-15drm/amdgpu: message smu to update bad channel infoStanley.Yang1-0/+3
It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-02drm/amdgpu: Modify .ras_fini function pointer parameteryipechai1-1/+1
Modify .ras_fini function pointer parameter so that we can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-17drm/amdgpu: define amdgpu_ras_late_init to call all ras blocks' .ras_late_inityipechai1-0/+1
Define amdgpu_ras_late_init to call all ras blocks' .ras_late_init. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-17drm/amdgpu: Modify .ras_late_init function pointer parameteryipechai1-1/+1
Modify .ras_late_init function pointer parameter so that it can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-14drm/amdgpu: Merge amdgpu_ras_late_init/amdgpu_ras_late_fini to ↵yipechai1-8/+0
amdgpu_ras_block_late_init/amdgpu_ras_block_late_fini 1. Merge amdgpu_ras_late_init to amdgpu_ras_block_late_init. 2. Remove amdgpu_ras_late_init since no ras block calls amdgpu_ras_late_init. 3. Merge amdgpu_ras_late_fini to amdgpu_ras_block_late_fini. 4. Remove amdgpu_ras_late_fini since no ras block calls amdgpu_ras_late_fini. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-14drm/amdgpu: Optimize operating sysfs and interrupt function interface in ↵yipechai1-3/+3
amdgpu_ras.c In order to reduce redundant struct conversion, modify operating sysfs and interrupt function interface parameters. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-14drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras blockyipechai1-6/+9
1. Define amdgpu_ras_block_late_init to create sysfs nodes and interrupt handles. 2. Define amdgpu_ras_block_late_fini to remove sysfs nodes and interrupt handles. 3. Replace ras block variable members in struct amdgpu_ras_block_object with struct ras_common_if, which can make it easy to associate each ras block instance with each ras block functional interface. 4. Add .ras_cb to struct amdgpu_ras_block_object. 5. Change each ras block to fit for the changement of struct amdgpu_ras_block_object. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>