aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
AgeCommit message (Collapse)AuthorFilesLines
2024-01-09drm/amdgpu: Packed socket_id to ras feature maskHawking Zhang1-0/+5
Initialize RAS feature mask bit[31:29] with socket_id. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-09drm/amdgpu: Support poison error injection via ras_ctrl debugfsCandice Li1-2/+5
Support poison error injection. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-09drm/amdgpu: Drop unnecessary sentences about CE and deferred error.Candice Li1-9/+5
Remove "no user action is needed" for correctable and deferred error to avoid confusion. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-05Revert "drm/amdgpu: enable mca debug mode on APU by default"Hawking Zhang1-2/+1
Not needed any more with firmware fixes Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-03drm/amdgpu: Fix possible NULL dereference in ↵Srinivasan Shanmugam1-0/+3
amdgpu_ras_query_error_status_helper() Return invalid error code -EINVAL for invalid block id. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176) Suggested-by: Hawking Zhang <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2024-01-03drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.cSrinivasan Shanmugam1-3/+3
Fixes the below smatch warnings: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-12-19drm/amdgpu: MCA supports recording umc address informationYiPeng Chai1-7/+15
MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-12-12Backmerge tag 'v6.7-rc5' into drm-nextDave Airlie1-0/+17
Linux 6.7-rc5 Alex requested this for some amdkfd work relying on the symbols exports. Signed-off-by: Dave Airlie <[email protected]>
2023-12-06drm/amdgpu: optimize the printing order of error dataYang Wang1-0/+17
sort error data list to optimize the printing order. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-30drm/amdgpu: enable mca debug mode on APU by defaultYang Wang1-1/+2
enable MCA debug mode on APU device by default. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-29drm/amdgpu: Move mca debug mode decision to rasLijo Lazar1-3/+11
Refactor code such that ras block decides the default mca debug mode, and not swsmu block. By default mca debug mode is set to false. v2: squash in uninitialized value fix (Alex) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-17drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.cYang Wang1-1/+1
fix ras err_data null pointer issue in amdgpu_ras.c Fixes: 8cc0f5669eb6 ("drm/amdgpu: Support multiple error query modes") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-09drm/amdgpu: fix software pci_unplug on some chipsVitaly Prosyak1-3/+6
When software 'pci unplug' using IGT is executed we got a sysfs directory entry is NULL for differant ras blocks like hdp, umc, etc. Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group' check that 'sd' is not NULL. [ +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90 [ +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 [ +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246 [ +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000 [ +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000 [ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000 [ +0.000001] FS: 00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000 [ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0 [ +0.000001] Call Trace: [ +0.000001] <TASK> [ +0.000001] ? show_regs+0x72/0x90 [ +0.000002] ? sysfs_remove_group+0x83/0x90 [ +0.000002] ? __warn+0x8d/0x160 [ +0.000001] ? sysfs_remove_group+0x83/0x90 [ +0.000001] ? report_bug+0x1bb/0x1d0 [ +0.000003] ? handle_bug+0x46/0x90 [ +0.000001] ? exc_invalid_op+0x19/0x80 [ +0.000002] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000003] ? sysfs_remove_group+0x83/0x90 [ +0.000001] dpm_sysfs_remove+0x61/0x70 [ +0.000002] device_del+0xa3/0x3d0 [ +0.000002] ? ktime_get_mono_fast_ns+0x46/0xb0 [ +0.000002] device_unregister+0x18/0x70 [ +0.000001] i2c_del_adapter+0x26d/0x330 [ +0.000002] arcturus_i2c_control_fini+0x25/0x50 [amdgpu] [ +0.000236] smu_sw_fini+0x38/0x260 [amdgpu] [ +0.000241] amdgpu_device_fini_sw+0x116/0x670 [amdgpu] [ +0.000186] ? mutex_lock+0x13/0x50 [ +0.000003] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] [ +0.000192] drm_minor_release+0x4f/0x80 [drm] [ +0.000025] drm_release+0xfe/0x150 [drm] [ +0.000027] __fput+0x9f/0x290 [ +0.000002] ____fput+0xe/0x20 [ +0.000002] task_work_run+0x61/0xa0 [ +0.000002] exit_to_user_mode_prepare+0x150/0x170 [ +0.000002] syscall_exit_to_user_mode+0x2a/0x50 Cc: Hawking Zhang <[email protected]> Cc: Luben Tuikov <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Christian Koenig <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-09drm/amdgpu: Support multiple error query modesHawking Zhang1-23/+70
Direct error query mode and firmware error query mode are supported for now. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-11-03drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_countTao Zhou1-1/+10
Handle xgmi hive case. Suggested-by: Hawking Zhang <[email protected]> Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-31drm/amdgpu: check RAS supported first in ras_reset_error_countTao Zhou1-4/+4
Not all platforms support RAS. Fixes: 73582be11ac8 ("drm/amdgpu: bypass RAS error reset in some conditions") Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-26drm/amdgpu: bypass RAS error reset in some conditionsTao Zhou1-1/+9
PMFW is responsible for RAS error reset in some conditions, driver can skip the operation. v2: add check for ras->in_recovery, it's set earlier than amdgpu_in_reset. v3: fix error in gpu reset check. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-26drm/amdgpu: enable RAS poison mode for APUTao Zhou1-1/+2
Enable it by default on APU platform. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-26drm/amdgpu: refine ras error kernel log printYang Wang1-35/+81
refine ras error kernel log to avoid user-ridden ambiguity. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-26drm/amdgpu: fix find ras error node errorYang Wang1-4/+3
the origin function might return the wrong node. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: add set/get mca debug mode operationsTao Zhou1-0/+21
Record the debug mode status in RAS. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: Enable RAS feature by default for APUStanley.Yang1-12/+2
Enable RAS feature by default for aqua vanjaram on apu platform. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: fix typo for amdgpu ras error data printYang Wang1-2/+2
typo fix. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Candice Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: Fix delete nodes that have been relesedStanley.Yang1-3/+1
Fix delete nodes that it has been freed. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: Enable software RAS in vcn v4_0_3Hawking Zhang1-1/+3
Set VCN/JPEG RAS masks to enable software RAS for VCN and JPEG. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-20drm/amdgpu: define ras_reset_error_count functionTao Zhou1-4/+15
Make the code architecture more simple. v2: reuse ras_reset_error_count in ras_reset_error_status. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-19drm/amdgpu : Add hive ras recovery checkAsad Kamal1-2/+7
If one of the devices in the hive detects a fatal error, need to send ras recovery reset message to PMFW of all devices in the hive. For that add a flag in hive to indicate that it's undergoing ras recovery Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-13drm/amdgpu: add ras_err_info to identify RAS error sourceYang Wang1-45/+249
introduced "ras_err_info" to better identify a RAS ERROR source. NOTE: For legacy chips, keep the original RAS error print format. v1: RAS errors may come from different dies during a RAS error query, therefore, need a new data structure to identify the source of RAS ERROR. v2: - use new data structure 'amdgpu_smuio_mcm_config_info' instead of ras_err_id (in v1 patch) - refine ras error dump function name - refine ras error dump log format Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-10-13drm/amdgpu: Expose ras version & schema infoAsad Kamal1-3/+48
Expose ras table version & schema info to sysfs v2: Updated schema to get poison support info from ras context, removed asic specific checks Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-09-20drm/amdgpu: Fix false positive error logStanley.Yang1-5/+5
It should first check block ras obj whether be set, it should return 0 directly if block ras obj or hw_ops is not set. If block doesn't support RAS just return 0 is fine. Changed from V1: return 0 directly if block ras obj or hw ops is not set Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-09-20drm/amdgpu: fix a memory leak in amdgpu_ras_feature_enableCong Liu1-0/+1
This patch fixes a memory leak in the amdgpu_ras_feature_enable() function. The leak occurs when the function sends a command to the firmware to enable or disable a RAS feature for a GFX block. If the command fails, the kfree() function is not called to free the info memory. Fixes: 9f051d6ff13f ("drm/amdgpu: Free ras cmd input buffer properly") Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Cong Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-09-20drm/amdgpu: add amdgpu mca debug sysfs supportYang Wang1-0/+2
add amdgpu mca debug sysfs support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-09-20drm/amdgpu: Use function for IP version checkLijo Lazar1-16/+22
Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-09-11drm/amdgpu: fallback to old RAS error message for aqua_vanjaramHawking Zhang1-2/+4
So driver doesn't generate incorrect message until the new format is settled down for aqua_vanjaram Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-30drm/amdgpu: Free ras cmd input buffer properlyHawking Zhang1-7/+7
Do not access the pointer for ras input cmd buffer if it is even not allocated. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-30drm/amdgpu: Allow issue disable gfx ras cmd to firmwareHawking Zhang1-3/+4
Disable gfx ras command is needed in some use cases Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-30drm/amdgpu: Enable ras for mp0 v13_0_6 sriovYiPeng Chai1-0/+1
Enable ras for mp0 v13_0_6 sriov Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-15drm/amdgpu: mode1 reset needs to recover mp1 for mp0 v13_0_10YiPeng Chai1-0/+2
Mode1 reset needs to recover mp1 in fatal error case for mp0 v13_0_10. v2: Define a macro to wrap psp function calls. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-15drm/amdgpu: Remove unnecessary ras cap checkHawking Zhang1-4/+0
RAS global isr will only be invoked by hardware interrupt. Don't need to query ras capability in isr In addition, amdgpu_ras_interrupt_fatal_error_handler ensures the isr won't be called from guest linux side by accident. The RAS cap check in isr that introduced to fix sriov crash is not needed any more Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-09drm/amdgpu: Extend poison mode check to SDMA/VCN/JPEGCandice Li1-1/+4
Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-09drm/amdgpu: add RAS fatal error handler for NBIO v7.9Tao Zhou1-0/+5
Register RAS fatal error interrupt and add handler. v2: only register NBIO RAS for dGPU platform. change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state to dummy functions. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-07drm/amdgpu: Issue ras enable_feature for gfx ip onlyHawking Zhang1-20/+10
For non-GFX IP blocks, set up ras obj if ras feature is allowed. For GFX IP blocks, force issue ras enable_feature command to firmware and only set up ras obj if ras feature is allowed Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-07drm/amdgpu: Apply poison mode check to GFX IP onlyHawking Zhang1-0/+1
For GFX IP that only supports poison consumption, GFX RAS won't be marked as enabled. i.e., hardware doesn't support gfx sram ecc. But driver still needs to issue firmware to enable poison consumption mode for GFX IP. In such case, check poison mode and treat GFX IP as RAS capable IP block. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-08-07drm/amdgpu: Only create err_count sysfs when hw_op is supportedHawking Zhang1-13/+18
Some IP blocks only support partial ras feature and don't have ras counter and/or ras error status register at all. Driver should not create err_count sysfs node for those IP blocks. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-07-25drm/amdgpu: Check APU flag to disable RASStanley.Yang1-1/+2
Only disable RAS by default for aqua vanjaram on APU platform. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-07-21drm/amd: Avoid reading the VBIOS part number twiceMario Limonciello1-4/+4
The VBIOS part number is read both in amdgpu_atom_parse() as well as in atom_get_vbios_pn() and stored twice in the `struct atom_context` structure. Remove the first unnecessary read and move the `pr_info` line from that read into the second. v2: squash in unused variable removal Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-07-18drm/amdgpu: Disable RAS by default on APU flatformStanley.Yang1-2/+11
Disable RAS feature by default for aqua vanjaram on APU platform. Changed from V1: Splite Disable RAS by default on APU platform into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-07-18drm/amdgpu: Enable aqua vanjaram RASStanley.Yang1-0/+1
Enable RAS for aqua vanjaram. Changed from V1: Split the change in amdgpu_ras_asic_supported into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-07-07drm/amdgpu: skip address adjustment for GFX RAS injectionTao Zhou1-1/+2
The address parameter of GFX RAS injection isn't related to XGMI node number, keep it unchanged. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Candice Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2023-06-30drm/amdgpu: gpu recovers from fatal error in poison modeYiPeng Chai1-0/+11
Fatal error occurs in ras poison mode, mode1 reset is used to recover gpu. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>