Age | Commit message (Collapse) | Author | Files | Lines | |
---|---|---|---|---|---|
2024-06-14 | drm/amdgpu: set RAS fed status for more cases | Tao Zhou | 1 | -0/+1 | |
Indicate fatal error for each RAS block and NBIO. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: create amdgpu_ras_in_recovery to simplify code | Tao Zhou | 1 | -12/+19 | |
Reduce redundant code and user doesn't need to pay attention to RAS details. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: trigger mode1 reset for RAS RMA status | Tao Zhou | 1 | -6/+22 | |
Check RMA status in bad page retirement flow. v2: fix coding bugs in v1. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: move aca/mca init functions into ras_init() stage | Yang Wang | 1 | -23/+50 | |
adjust the function position to better match aca/mca fini code in ras_fini(). Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-14 | drm/amdgpu: add reset source in various cases | Eric Huang | 1 | -0/+1 | |
To fullfill the reset event description. Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-05 | drm/amdgpu: add RAS is_rma flag | Tao Zhou | 1 | -5/+4 | |
Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-05 | drm/amdgpu: Update programming for boot error reporting | Hawking Zhang | 1 | -54/+45 | |
AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-06-05 | drm/amdgpu: Estimate RAS reservation when report capacity v2 | Hawking Zhang | 1 | -0/+20 | |
Add estimate of how much vram we need to reserve for RAS when caculating the total available vram. v2: apply the change to MP0 v13_0_2 and v13_0_14 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-29 | drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() function | Yang Wang | 1 | -1/+1 | |
fix typo "info.ue_count" in amdgpu_ras_aca_sysfs_read() function. Fixes: 865d3397630b ("drm/amdgpu: add aca deferred error type support") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-23 | drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabled | Yang Wang | 1 | -0/+6 | |
skip to create 'xxx_err_count' node when ACA is enabled. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: fix ACA no query result after gpu reset | Yang Wang | 1 | -5/+4 | |
fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG() | Yang Wang | 1 | -0/+18 | |
create a new helper function to avoid compiler 'side-effect' check about RAS_EVENT_LOG() macro. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: Fix the null pointer dereference to ras_manager | Ma Jun | 1 | -2/+5 | |
Check ras_manager before using it Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-17 | drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr | Ma Jun | 1 | -13/+0 | |
Remove dead code in amdgpu_ras_add_mca_err_addr Signed-off-by: Ma Jun <[email protected]> Reviewed-by: YiPeng Chai <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-08 | drm/amdgpu: change log level | YiPeng Chai | 1 | -1/+1 | |
Change log level. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-08 | drm/amdgpu: fix RAS unload driver issue in SRIOV | Yang Wang | 1 | -6/+8 | |
Fix null pointer issue when unload driver in SRIOV mode. Adjust the function position to ensure that the amdgpu_mca/aca_xxx_init() related functions can be initialized properly. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-05-02 | drm/amdgpu: Add psp v13_0_14 ip block | Hawking Zhang | 1 | -0/+2 | |
Add psp v13_0_14 ip block support. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-30 | drm/amdgpu: Remove redundant function call | YiPeng Chai | 1 | -16/+6 | |
Remove redundant function call. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-30 | drm/amdgpu: add MCA smu cache support | Yang Wang | 1 | -0/+9 | |
v1: because SMU CE valid mca bank will be cleared after reading, this patch adds mca cache at the driver level to ensure that the mca bank is not lost. v2: refine amdgpu_mca_init/fini/reset() function name. v3: add mca_cache.lock support only add CE bank to mca bank cache. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Fix ras mode2 reset failure in ras aca mode | YiPeng Chai | 1 | -0/+4 | |
Fix ras mode2 reset failure in ras aca mode. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Use new interface to reserve bad page | YiPeng Chai | 1 | -3/+1 | |
Use new interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Fix address translation defect | YiPeng Chai | 1 | -1/+1 | |
retired_page is page frame and should be expanded to the full address when querying status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add poison consumption handler | YiPeng Chai | 1 | -4/+39 | |
Add poison consumption handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Add delay work to retire bad pages | YiPeng Chai | 1 | -1/+35 | |
Add delay work to retire bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add interface to update umc v12_0 ecc status | YiPeng Chai | 1 | -0/+2 | |
Add interface to update umc v12_0 ecc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add poison creation handler | YiPeng Chai | 1 | -7/+69 | |
Add poison creation handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: prepare for logging ecc errors | YiPeng Chai | 1 | -0/+32 | |
Prepare for logging ecc errors. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: add message fifo to handle RAS poison events | YiPeng Chai | 1 | -0/+35 | |
Add message fifo to handle RAS poison events. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-26 | drm/amdgpu: Add interface to reserve bad page | YiPeng Chai | 1 | -0/+19 | |
Add interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-04-09 | drm/amdgpu: Set fatal errror detected flag earlier | Lijo Lazar | 1 | -13/+28 | |
In case of fatal errors, set FED status when interrupt is received. Set the flag on other devices in the hive before RAS recovery work. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-03-22 | drm/amdgpu: add ras event id support for ACA | Yang Wang | 1 | -5/+6 | |
add ras event id support for ACA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-03-20 | drm/amdgpu: add aca deferred error type support | Yang Wang | 1 | -2/+6 | |
add aca deferred error type support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-03-20 | drm/amdgpu: make reset method configurable for RAS poison | Tao Zhou | 1 | -2/+2 | |
Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-03-20 | drm/amdgpu: add ras event id support | Yang Wang | 1 | -71/+136 | |
add amdgpu ras event id support to better distinguish different error information sources in dmesg logs. the following log will be identify by event id: {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors statistic since from current injection/error query {event_id} errors statistic since from gpu load Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-02-26 | drm/amdgpu: Fix ineffective ras_mask settings | Stanley.Yang | 1 | -0/+1 | |
Check amdgpu_ras_mask to fix ineffective ras_mask setting due to special asic without sram ecc enable but with poison supported. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-02-26 | drm/amdgpu: Add fatal error detected flag | Lijo Lazar | 1 | -0/+32 | |
For a RAS error that needs a full reset to recover, set the fatal error status. Clear the status once the device is reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-31 | drm/amdgpu: disable RAS feature when fini | Tao Zhou | 1 | -1/+1 | |
Send RAS disable feature command in fini. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-31 | drm/amdgpu: Update boot time errors polling sequence | Hawking Zhang | 1 | -1/+13 | |
Update boot time errors polling sequence to align with the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Frank Min <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-25 | drm/amdgpu: Support passing poison consumption ras block to SRIOV | YiPeng Chai | 1 | -1/+1 | |
Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-25 | drm/amdgpu: adjust aca init/fini sequence to match gpu reset | Yang Wang | 1 | -2/+13 | |
- move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset() Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-25 | drm/amdgpu: Fix module unload hang with RAS enabled | Mukul Joshi | 1 | -0/+4 | |
The driver unload hangs because the page retirement kthread cannot be stopped as it is sleeping and waiting on page retirement event to occur. Add kthread_should_stop() to the event condition to wake up the kthread when kthread stop is called during driver unload. Fixes: 3fdcd0a31d7a ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement") Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-22 | drm/amdgpu: skip call ras_late_init if ras block is not supported | Yang Wang | 1 | -2/+5 | |
skip call ras_late_init callback if ras block is not supported. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-22 | drm/amdgpu:Support retiring multiple MCA error address pages | YiPeng Chai | 1 | -8/+35 | |
Support retiring multiple MCA error address pages in one in-band query for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-22 | drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning | YiPeng Chai | 1 | -0/+5 | |
Use asynchronous polling to handle umc_v12_0 poisoning. v2: 1. Change function name. 2. Change the debugging information content. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-22 | drm/amdgpu: Fix ras features value calltrace | Stanley.Yang | 1 | -5/+6 | |
The high three bits of ras features mask indicate socket id, it should skip to check high three bits of ras features mask before disable all ras features. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-22 | drm/amdgpu: Prepare for asynchronous processing of umc page retirement | YiPeng Chai | 1 | -0/+34 | |
Preparing for asynchronous processing of umc page retirement. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-18 | drm/amdgpu: Show deferred error count for UMC | Stanley.Yang | 1 | -2/+6 | |
Show deferred error count for UMC syfs node Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-18 | drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[] | Yang Wang | 1 | -1/+4 | |
fix array index out of bounds issue for ras_block_string[] array. Fixes: 30df05fb74f6 ("drm/amdgpu: Align ras block enum with firmware") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-15 | drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported | Hawking Zhang | 1 | -77/+93 | |
Move ras capablity check to amdgpu_ras_check_supported. Driver will query ras capablity through psp interace, or vbios interface, or specific ip callbacks. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | |||||
2024-01-15 | drm/amdgpu: Log deferred error separately | Candice Li | 1 | -20/+96 | |
Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> |