Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2024-09-18 | drm/amdgpu: fix typo in the comment | Yan Zhen | 1 | -1/+1 | |
Correctly spelled comments make it easier for the reader to understand the code. Replace 'udpate' with 'update' in the comment & replace 'recieved' with 'received' in the comment & replace 'dsiable' with 'disable' in the comment & replace 'Initiailize' with 'Initialize' in the comment & replace 'disble' with 'disable' in the comment & replace 'Disbale' with 'Disable' in the comment & replace 'enogh' with 'enough' in the comment & replace 'availabe' with 'available' in the comment. Acked-by: Christian König <[email protected]> Signed-off-by: Yan Zhen <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-09-17 | drm/amdgpu: disable GPU RAS bad page feature for specific ASIC | Tao Zhou | 1 | -0/+5 | |
The feature is not applicable to specific app platform. v2: update the disablement condition and commit description v3: move the setting to amdgpu_ras_check_supported Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-08-06 | drm/amdgpu: remove RAS unused paramter 'err_addr' | Yang Wang | 1 | -9/+9 | |
- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-08-06 | drm/amdgpu: create function to check RAS RMA status | Tao Zhou | 1 | -6/+16 | |
In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-08-06 | drm/amdgpu: Add more types for boot time error reporting | Hawking Zhang | 1 | -0/+10 | |
Data abort exception and unknown errors are supported. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-23 | drm/amdgpu: Remove unused code | YiPeng Chai | 1 | -23/+0 | |
Remove unused code. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-10 | drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed | YiPeng Chai | 1 | -1/+5 | |
The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-10 | drm/amdgpu: flush all cached ras bad pages to eeprom | YiPeng Chai | 1 | -6/+29 | |
Before uninstalling gpu driver, flush all cached ras bad pages to eeprom. v2: Put the same code into a function and reuse the function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-08 | drm/amdgpu: add ras event state device attribute support | Yang Wang | 1 | -4/+52 | |
add amdgpu ras 'event_state' sysfs device attribute support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-08 | drm/amdgpu: add ras POSION_CONSUMPTION event id support | Yang Wang | 1 | -3/+13 | |
add amdgpu ras POSION_CONSUMPTION event id support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-08 | drm/amdgpu: add ras POSION_CREATION event id support | Yang Wang | 1 | -3/+14 | |
add amdgpu ras POSION_CREATION event id support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-08 | drm/amdgpu: refine amdgpu ras event id core code | Yang Wang | 1 | -18/+84 | |
v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter. v2: add a warn log to show the location of function failure when calling amdgpu_ras_mark_event(). (Tao Zhou) v3: change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL. v4: rename amdgpu_ras_get_recovery_event() to amdgpu_ras_get_fatal_error_event(). Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-08 | drm/amdgpu: sysfs node disable query error count during gpu reset | YiPeng Chai | 1 | -0/+3 | |
Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-07-01 | drm/amdgpu: Fix hbm stack id in boot error report | Hawking Zhang | 1 | -1/+1 | |
To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to hbm statck 1. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-27 | drm/amdgpu: add gpu reset check and exception handling | YiPeng Chai | 1 | -0/+53 | |
Add gpu reset check and exception handling for page retirement. v2: Clear poison consumption messages cached in fifo after non mode-1 reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-27 | drm/amdgpu: refine poison consumption interrupt handler | YiPeng Chai | 1 | -18/+37 | |
1. The poison fifo is only used for poison consumption requests. 2. Merge reset requests when poison fifo caches multiple poison consumption messages Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-27 | drm/amdgpu: refine poison creation interrupt handler | YiPeng Chai | 1 | -22/+17 | |
In order to apply to the case where a large number of ras poison interrupts: 1. Change to use variable to record poison creation requests to avoid fifo full. 2. Prioritize handling poison creation requests instead of following the order of requests received by the driver. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-27 | drm/amdgpu: add variable to record the deferred error number read by driver | YiPeng Chai | 1 | -18/+44 | |
Add variable to record the deferred error number read by driver. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: set RAS fed status for more cases | Tao Zhou | 1 | -0/+1 | |
Indicate fatal error for each RAS block and NBIO. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: create amdgpu_ras_in_recovery to simplify code | Tao Zhou | 1 | -12/+19 | |
Reduce redundant code and user doesn't need to pay attention to RAS details. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: trigger mode1 reset for RAS RMA status | Tao Zhou | 1 | -6/+22 | |
Check RMA status in bad page retirement flow. v2: fix coding bugs in v1. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: move aca/mca init functions into ras_init() stage | Yang Wang | 1 | -23/+50 | |
adjust the function position to better match aca/mca fini code in ras_fini(). Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: add reset source in various cases | Eric Huang | 1 | -0/+1 | |
To fullfill the reset event description. Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-05 | drm/amdgpu: add RAS is_rma flag | Tao Zhou | 1 | -5/+4 | |
Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-05 | drm/amdgpu: Update programming for boot error reporting | Hawking Zhang | 1 | -54/+45 | |
AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-05 | drm/amdgpu: Estimate RAS reservation when report capacity v2 | Hawking Zhang | 1 | -0/+20 | |
Add estimate of how much vram we need to reserve for RAS when caculating the total available vram. v2: apply the change to MP0 v13_0_2 and v13_0_14 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-29 | drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() function | Yang Wang | 1 | -1/+1 | |
fix typo "info.ue_count" in amdgpu_ras_aca_sysfs_read() function. Fixes: 865d3397630b ("drm/amdgpu: add aca deferred error type support") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-23 | drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabled | Yang Wang | 1 | -0/+6 | |
skip to create 'xxx_err_count' node when ACA is enabled. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: fix ACA no query result after gpu reset | Yang Wang | 1 | -5/+4 | |
fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG() | Yang Wang | 1 | -0/+18 | |
create a new helper function to avoid compiler 'side-effect' check about RAS_EVENT_LOG() macro. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: Fix the null pointer dereference to ras_manager | Ma Jun | 1 | -2/+5 | |
Check ras_manager before using it Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr | Ma Jun | 1 | -13/+0 | |
Remove dead code in amdgpu_ras_add_mca_err_addr Signed-off-by: Ma Jun <[email protected]> Reviewed-by: YiPeng Chai <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-08 | drm/amdgpu: change log level | YiPeng Chai | 1 | -1/+1 | |
Change log level. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-08 | drm/amdgpu: fix RAS unload driver issue in SRIOV | Yang Wang | 1 | -6/+8 | |
Fix null pointer issue when unload driver in SRIOV mode. Adjust the function position to ensure that the amdgpu_mca/aca_xxx_init() related functions can be initialized properly. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-02 | drm/amdgpu: Add psp v13_0_14 ip block | Hawking Zhang | 1 | -0/+2 | |
Add psp v13_0_14 ip block support. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-30 | drm/amdgpu: Remove redundant function call | YiPeng Chai | 1 | -16/+6 | |
Remove redundant function call. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-30 | drm/amdgpu: add MCA smu cache support | Yang Wang | 1 | -0/+9 | |
v1: because SMU CE valid mca bank will be cleared after reading, this patch adds mca cache at the driver level to ensure that the mca bank is not lost. v2: refine amdgpu_mca_init/fini/reset() function name. v3: add mca_cache.lock support only add CE bank to mca bank cache. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Fix ras mode2 reset failure in ras aca mode | YiPeng Chai | 1 | -0/+4 | |
Fix ras mode2 reset failure in ras aca mode. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Use new interface to reserve bad page | YiPeng Chai | 1 | -3/+1 | |
Use new interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Fix address translation defect | YiPeng Chai | 1 | -1/+1 | |
retired_page is page frame and should be expanded to the full address when querying status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add poison consumption handler | YiPeng Chai | 1 | -4/+39 | |
Add poison consumption handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Add delay work to retire bad pages | YiPeng Chai | 1 | -1/+35 | |
Add delay work to retire bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add interface to update umc v12_0 ecc status | YiPeng Chai | 1 | -0/+2 | |
Add interface to update umc v12_0 ecc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add poison creation handler | YiPeng Chai | 1 | -7/+69 | |
Add poison creation handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: prepare for logging ecc errors | YiPeng Chai | 1 | -0/+32 | |
Prepare for logging ecc errors. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add message fifo to handle RAS poison events | YiPeng Chai | 1 | -0/+35 | |
Add message fifo to handle RAS poison events. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Add interface to reserve bad page | YiPeng Chai | 1 | -0/+19 | |
Add interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-09 | drm/amdgpu: Set fatal errror detected flag earlier | Lijo Lazar | 1 | -13/+28 | |
In case of fatal errors, set FED status when interrupt is received. Set the flag on other devices in the hive before RAS recovery work. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-03-22 | drm/amdgpu: add ras event id support for ACA | Yang Wang | 1 | -5/+6 | |
add ras event id support for ACA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-03-20 | drm/amdgpu: add aca deferred error type support | Yang Wang | 1 | -2/+6 | |
add aca deferred error type support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> |