Age | Commit message (Collapse) | Author | Files | Lines |
|
ce can also trigger interrupt, and even both ce and ue error can be
found in one ras query, distinguishing between ce and ue in interrupt
handler is uncessary.
Signed-off-by: Tao Zhou <[email protected]>
Suggested-by: Guchun Chen <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
we only read error information for correctable error in interrupt
handler, gpu reset is unnecessary since there is no data lost
in correctable error
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
the initial value of ecc error count can be adjusted
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
enable umc ce interrupt and initialize ecc error count
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
correctable error can also trigger interrupt in some ras blocks
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
umc error address query can get ce/ue error address and clear error
status
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
use umc_for_each_channel to make code simpler
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
common function for all umc versions, loop for each umc channel is
a frequent used operation in umc block, define it as a macro to
simplify code
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add initialization for new members of amdgpu_umc structure
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
expose more parameters and functions of specific umc version to common
umc layer, so amdgpu_umc layer and other blocks could access them
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
clearing MCA_STATUS is enough to reset the whole MCA, writing zero to
MCA_ADDR is unnecessary
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Currenly the error check on variable instance is always false because
it is a uint32_t type and this is never less than zero. Fix this by
making it an int type.
Addresses-Coverity: ("Unsigned compared against 0")
Fixes: 7d0e6329dfdc ("drm/amdgpu: update more sdma instances irq support")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Following bitmap layout logic introduced by:
"drm/amdgpu: support get_cu_info for Arcturus".
v2: squash in fixup for gfx_v9_0.c (Alex)
v3: squash in debug print output fix
Signed-off-by: Jay Cornwall <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This change is because SE/SH layout on Arcturus is 8*1, different from
4*2(or 4*1) on Vega ASICs.
Currently the cu bitmap array is 4x4 size, and besides the bitmap is used widely
across SW stack. To mostly reduce the scale of impact, we make the cu bitmap
array compatible with SE/SH layout on Arcturus. Then the store of cu bits of
each shader array for Arcturus will be like below:
SE0,SH0 --> bitmap[0][0]
SE1,SH0 --> bitmap[1][0]
SE2,SH0 --> bitmap[2][0]
SE3,SH0 --> bitmap[3][0]
SE4,SH0 --> bitmap[0][1]
SE5,SH0 --> bitmap[1][1]
SE6,SH0 --> bitmap[2][1]
SE7,SH0 --> bitmap[3][1]
Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The registers used for VG20 are different in that certain performance
counters were split off to TXCLK3/4. Vega10/12 doesn't have this, so add
a new vg20_get_pcie_usage to reflect this change.
Signed-off-by: Kent Russell <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fixes GPU reset crash.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Memory used by KFD applications can contain sensitive information that
should not be leaked to other processes. The current approach to prevent
leaks is to clear VRAM at allocation time. This is not effective because
memory can be reused in other ways without being cleared. Synchronously
clearing memory on the allocation path also carries a significant
performance penalty.
Stop clearing memory at allocation time. Instead mark the memory for
wipe on release.
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Wipe VRAM memory containing sensitive data when moving or releasing
BOs. Clearing the memory is pipelined to minimize any impact on
subsequent memory allocation latency. Use of a poison value should
help debug future use-after-free bugs.
When moving BOs, the existing ttm_bo_pipelined_move ensures that the
memory won't be reused before being wiped.
When releasing BOs, the BO is fenced with the memory fill operation,
which results in queuing the BO for a delayed delete.
v2: Move amdgpu_amdkfd_unreserve_memory_limit into
amdgpu_bo_release_notify so that KFD can use memory that's still
being cleared in the background
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
previously the ucode loading of PSP was repreated, one executed in
phase_1 init/re-init/resume and the other in fw_loading routine
Avoid this double loading by clearing ip_blocks.status.hw in suspend or reset
prior to the FW loading and any block's hw_init/resume
v2:
still do the smu fw loading since it is needed by bare-metal
v3:
drop the change in reinit_early_sriov, just clear all block's status.hw
in the head place and set the status.hw after hw_init done is enough
Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Emily Deng <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
for SRIOV the SOS fw of PSP is loaded in hypervisor thus
guest won't tell the version of it, and judging feature by
reading the sos fw version in guest side is completely wrong
Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Emily Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
we can simplify all those unnecessary function under
SRIOV for vega10 since:
1) PSP L1 policy is by force enabled in SRIOV
2) original logic always set all flags which make itself
a dummy step
besides,
1) the ih_doorbell_range set should also be skipped
for VEGA10 SRIOV.
2) the gfx_common registers should also be skipped
for VEGA10 SRIOV.
Signed-off-by: Monk Liu <[email protected]>
Reviewed-by: Emily Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
git://people.freedesktop.org/~agd5f/linux into drm-fixes
drm-fixes-5.3-2019-07-31:
amdgpu:
- Fix temperature granularity for navi
- Fix stable pstate setting for navi
- Fix VCN DPM enablement on navi
- Fix error handling on CS ioctl when processing dependencies
- Fix possible information leak in debugfs
amdkfd:
- fix memory alignment for VegaM
Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Same as navi10.
Reviewed-by: Xiaojie Yuan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
some subblocks of gfx fail in inject test, disable them
Signed-off-by: Dennis Li <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
check gfx error count in both ras querry function and
ras interrupt handler.
gfx ras is still disabled by default due to known stability
issue found in gpu reset.
Signed-off-by: Dennis Li <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add functions for RAS error inject and query error counter
Signed-off-by: Dennis Li <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: Dennis Li <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
error injection address is not in gpu address space
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
only ue and ce errors are supported
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add err_data parameter in interrupt cb for ras clients
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add error data as parameter for ras interrupt cb and process it
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
query umc ras error address, translate it to gpu 4k page view
and save it.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add related registers, callback function and channel index table
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
more than one error address may be recorded in one query
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
remove the check of ErrorCodeExt
v2: refine the if condition for ue counting
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
create new amdgpu_umc structure to for more umc
settings in future and switch to the new structure
Signed-off-by: Tao Zhou <[email protected]>
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
replace some 32bit macros with 64bit operations to simplify code
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add 64 bits register access functions
v2: implement 64 bit functions in low level
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
v1: increase ras ce/ue error count
v2: log the number of correctable and uncorrectable errors
Signed-off-by: Tao Zhou <[email protected]>
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
check umc error count in both ras querry function and
ras interrupt handler
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
init umc callback function for vega20 in sw early init phase
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Implement umc query_ras_error_count function to support querry
both correctable and uncorrectable error
Signed-off-by: Hawking Zhang <[email protected]>
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This is common structure as UMC callback function
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
the driver needs to program RSMU and UMC registers to
support vega20 RAS feature
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
These are common structures that can be included by IP specific
source files
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Unused.
Acked-by: Sam Ravnborg <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
And fix the fallout.
Acked-by: Sam Ravnborg <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
And fix the fallout.
Acked-by: Sam Ravnborg <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
And fix up the fallout.
Acked-by: Sam Ravnborg <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
And fix the fallout.
Acked-by: Sam Ravnborg <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|