aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
AgeCommit message (Collapse)AuthorFilesLines
2023-06-15drm/amdgpu: Optimize checking ras supportedStanley.Yang1-15/+19
Using "is_app_apu" to identify device in the native APU mode or carveout mode. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-04-11drm/amdgpu: optimize redundant code in umc_v8_10YiPeng Chai1-0/+31
Optimize redundant code in umc_v8_10 Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-03-22drm/amdgpu: Initialize umc ras callbackHawking Zhang1-1/+1
Fix a coding error which results to null interrupt handler for umc ras. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-03-13drm/amdgpu: Move umc ras block init to gmc ras sw_initHawking Zhang1-0/+30
Initialize umc ras block only when umc ip block supports ras. Driver queries ras capabilities after early_init, ras block init needs to be moved to sw_init. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-02-23drm/amdgpu: exclude duplicate pages from UMC RAS UE countTao Zhou1-2/+2
If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page. v2: add specific function to do the check. v3: remove duplicate pages, calculate new added bad page number. v4: reuse save_bad_pages to calculate new added bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-12-15drm/amdgpu: add RAS poison consumption handler for SRIOVTao Zhou1-18/+26
Send message to PF if VF receives RAS poison consumption interrupt. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-10-27drm/amdgpu: remove ras_error_status parameter for UMC poison handlerTao Zhou1-8/+5
Make the code simpler. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-10-27drm/amdgpu: add RAS poison handling for MCATao Zhou1-11/+20
For MCA poison, if unmap queue fails, only gpu reset should be triggered without page retirement handling, MCA notifier will do it. v2: handle MCA poison consumption in umc_poison_handler directly. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-10-27drm/amdgpu: add RAS page retirement functions for MCATao Zhou1-0/+53
Define page retirement functions for MCA platform. v2: remove page retirement handling from MCA poison handler, let MCA notifier do page retirement. v3: remove specific poison handler for MCA to simplify code. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-15drm/amdgpu: message smu to update bad channel infoStanley.Yang1-0/+5
It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-02drm/amdgpu: Remove redundant calls of amdgpu_ras_block_late_fini in umc ras ↵yipechai1-7/+0
block Remove redundant calls of amdgpu_ras_block_late_fini in umc ras block. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-02drm/amdgpu: Optimize xxx_ras_fini function of each ras blockyipechai1-2/+2
1. Move the variables of ras block instance members from specific xxx_ras_fini to general ras_fini call. 2. Function calls inside the modules only use parameters passed from xxx_ras_fini instead of ras block instance members. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-03-02drm/amdgpu: Modify .ras_fini function pointer parameteryipechai1-1/+1
Modify .ras_fini function pointer parameter so that we can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-17drm/amdgpu: Optimize xxx_ras_late_init function of each ras blockyipechai1-3/+3
1. Move calling ras block instance members from module internal function to the top calling xxx_ras_late_init. 2. Module internal function calls can only use parameter variables of xxx_ras_late_init instead of ras block instance members. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-17drm/amdgpu: Modify .ras_late_init function pointer parameteryipechai1-1/+1
Modify .ras_late_init function pointer parameter so that it can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-14drm/amdgpu: Optimize amdgpu_umc_ras_late_init/amdgpu_umc_ras_fini function codeyipechai1-38/+6
Optimize amdgpu_umc_ras_late_init/amdgpu_umc_ras_fini function code. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-01-27drm/amdgpu: add umc_fill_error_record to make code more simpleTao Zhou1-0/+21
Create common amdgpu_umc_fill_error_record function for all versions of UMC and clean up related codes. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-01-14drm/amdgpu: Modify umc block to fit for the unified ras block data and opsyipechai1-16/+16
1.Modify umc block to fit for the unified ras block data and ops. 2.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of umc ras variable so that umc ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register umc ras block into amdgpu device ras block link list. 5.Remove the redundant code about umc in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of umc versions. If .ras_late_init and .ras_fini had been defined by the selected umc version, the defined functions will take effect; if not defined, default fill them with amdgpu_umc_ras_late_init and amdgpu_umc_ras_fini. Signed-off-by: yipechai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: John Clements <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-01-14drm/amdgpu: Modify the compilation failed problem when other ras blocks' .h ↵yipechai1-1/+1
include amdgpu_ras.h Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h. v2: squash in forward declaration warning fix (Alex) Signed-off-by: yipechai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: John Clements <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-01-14drm/amd/pm: do not expose implementation details to other blocks out of powerEvan Quan1-3/+2
Those implementation details(whether swsmu supported, some ppt_funcs supported, accessing internal statistics ...)should be kept internally. It's not a good practice and even error prone to expose implementation details. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: save error count in RAS poison handlerTao Zhou1-73/+95
Otherwise the RAS error count couldn't be queried from sysfs. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: add gpu reset control for umc page retirementTao Zhou1-3/+12
Add a reset parameter for umc page retirement, let user decide whether call gpu reset in umc page retirement. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-11-22drm/amdgpu: query umc error info from ecc_table v2Stanley.Yang1-22/+50
if smu support ECCTABLE, driver can message smu to get ecc_table then query umc error info from ECCTABLE v2: optimize source code makes logical more reasonable Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-08-16drm/amd/amdgpu: remove unnecessary RAS context fieldCandice Li1-1/+0
Delete ras_if->name in the RAS ctx structure and remove related lines. Signed-off-by: Candice Li <[email protected]> Reviewed-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-07-01drm/amdgpu: Some renamesLuben Tuikov1-1/+1
Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-06-18drm/amdgpu: message smu to update hbm bad page numberStanley.Yang1-0/+4
Use SMU to update the bad pages rather than directly accessing the EEPROM from the driver. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-09drm/amdgpu: split umc callbacks to ras and non-ras onesHawking Zhang1-8/+9
umc ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split umc callbacks into ras and non-ras ones so gpu driver only initializes umc ras callbacks when it manages umc ras. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Dennis Li <[email protected]> Reviewed-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-10-30drm/amdgpu: change to save bad pages in UMC error interrupt callbackDennis Li1-3/+4
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-04drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0Guchun Chen1-2/+3
When amdgpu_bad_page_threshold = 0, bad page reservation stuffs are skipped in either UMC ECC irq or page retirement calling of sync flood isr. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-13drm/amdgpu: refine ras related message printGuchun Chen1-4/+6
Prefix ras related kernel message logging with PCI device info by replacing DRM_INFO/WARN/ERROR with dev_info/warn/err. This can clearly tell user about GPU device information where ras is. And add some other ras message printing to make it more clear and friendly as well. Suggested-by: Hawking Zhang <[email protected]> Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-13drm/amdgpu: add uncorrectable error count print in UMC ecc irq cbGuchun Chen1-0/+3
Uncorrectable error count printing is missed when issuing UMC UE injection. When going to the error count log function in GPU recover work thread, there is no chance to get correct error count value by last error injection and print, because the error status register is automatically cleared after reading in UMC ecc irq callback. So add such message printing in UMC ecc irq cb to be consistent with other RAS error interrupt cases. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-10drm/amdgpu: call ras_debugfs_create_all in debugfs_initTao Zhou1-1/+0
and remove each ras IP's own debugfs creation this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-01-07drm/amdgpu: removed GFX RAS support check in UMC ECC callbackJohn Clements1-7/+1
enable GPU recovery in event of uncorrectable UMC error Signed-off-by: John Clements <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-18drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpuGuchun Chen1-1/+1
BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: add comments in ras interrupt callbackTao Zhou1-0/+4
add comments to clarify why checking GFX IP BLOCK for each ras interrupt callback Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: move umc ras fini to umc blockTao Zhou1-0/+15
it's more suitable to put umc ras fini in umc block Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: remove ih_info parameter of umc_ras_late_initTao Zhou1-8/+7
umc_ras_late_init can get the info by itself Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: move umc_ras_if from gmc to umc blockTao Zhou1-14/+14
umc_ras_if is relevant to umc Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: move umc ras irq functions to umc blockTao Zhou1-1/+64
move umc ras irq functions from gmc v9 to generic umc block, these functions are relevant to umc and they can be shared among all generations of umc Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: rename umc ras_init to err_cnt_initTao Zhou1-2/+2
this interface is related to specific version of umc, distinguish it from ras_late_init Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: move umc ras init to umc blockTao Zhou1-0/+4
move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: move umc late init from gmc to umc blockTao Zhou1-0/+73
umc late init is umc specific, it's more suitable to be put in umc block Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>