| Age | Commit message (Collapse) | Author | Files | Lines |
|
gfx ras is only available in cerntain ip generations.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
mmhub ras is only avaiable in cerntain mmhub ip
generation.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
df/mca ras is not managed by gpu driver when gpu
is connected to cpu through xgmi. gpu driver should
register x86 mca notifier for umc ras error
notification
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
umc ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. split umc callbacks
into ras and non-ras ones so gpu driver only
initializes umc ras callbacks when it manages
umc ras.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
xgmi ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. move all xgmi ras
functions to xgmi_ras_funcs so gpu driver only
initializes xgmi ras functions when it manages
xgmi ras.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
nbio ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. split nbio callbacks
into ras and non-ras ones so gpu driver only
initializes nbio ras callbacks when it manages
nbio ras.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Dennis Li <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
query_ras_error_address will be invoked to query bad
page address when there is poison data in HBM consumed
by GPU engines.
Signed-off-by: Hawking Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
umc query_ras_error_count will be invoked to query
umc correctable and uncorrectable error. It will
reset the umc ras error counter after the query.
Signed-off-by: Hawking Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add helper functions to query correctable and
uncorrectable umc ras error.
Signed-off-by: Hawking Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
umc_v6_7_funcs are callbacks to support umc ras
functionalities in aldebaran
Signed-off-by: Hawking Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Driver only manages GFX/SDMA/MMHUB RAS in platforms
that gpu node is connected to cpu through XGMI, other
than that, it queries VBIOS for RAS capabilities.
Signed-off-by: Hawking Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
These were leftover from the old CI dpm code which was
retired a while ago.
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fix patch check warning:
WARNING: suspect code indent for conditional statements (8, 17)
+ if (obj && obj->use < 0) {
+ DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name);
WARNING: braces {} are not necessary for single statement blocks
+ if (obj && obj->use < 0) {
+ DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name);
+ }
Signed-off-by: Bernard Zhao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: Stanley.Yang <[email protected]>
Reivewed-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Forgot to reserve a fence slot to use sdma to update page table, cause
below kernel BUG backtrace to handle vm retry fault while application is
exiting.
[ 133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281!
[ 133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]
[ 133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280
[ 133.048672] amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu]
[ 133.048788] amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu]
[ 133.048905] amdgpu_vm_handle_fault+0x202/0x370 [amdgpu]
[ 133.049031] gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu]
[ 133.049165] ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu]
[ 133.049289] ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
[ 133.049408] amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
[ 133.049534] amdgpu_ih_process+0x9b/0x1c0 [amdgpu]
[ 133.049657] amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu]
[ 133.049669] process_one_work+0x29f/0x640
[ 133.049678] worker_thread+0x39/0x3f0
[ 133.049685] ? process_one_work+0x640/0x640
Signed-off-by: Philip Yang <[email protected]>
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
1. expand rlcg interface for gc & mmhub indirect access
2. add rlcg interface for no kiq
v2: squash in fix for gfx9 (Changfeng)
Signed-off-by: Peng Ju Zhou <[email protected]>
Reviewed-by: Emily.Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
using the control bits got from host to control registers access.
Signed-off-by: Peng Ju Zhou <[email protected]>
Reviewed-by: Emily.Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
get pf2vf msg info at it's earliest time so that
guest driver can use these info to decide whether
register indirect access enabled.
Signed-off-by: Peng Ju Zhou <[email protected]>
Reviewed-by: Emily.Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
unify host driver and guest driver indirect access
control bits names
Signed-off-by: Peng Ju Zhou <[email protected]>
Reviewed-by: Emily.Deng <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The page table of AMDGPU requires an alignment to CPU page so we should
check ioctl parameters for it. Return -EINVAL if some parameter is
unaligned to CPU page, instead of corrupt the page table sliently.
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Xi Ruoyao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
In Mesa, dev_info.gart_page_size is used for alignment and it was
set to AMDGPU_GPU_PAGE_SIZE(4KB). However, the page table of AMDGPU
driver requires an alignment on CPU pages. So, for non-4KB page system,
gart_page_size should be max_t(u32, PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE).
Signed-off-by: Rui Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
Link: https://github.com/loongson-community/linux-stable/commit/caa9c0a1
[Xi: rebased for drm-next, use max_t for checkpatch,
and reworded commit message.]
Signed-off-by: Xi Ruoyao <[email protected]>
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
Tested-by: Dan Horák <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
int write = !(gtt->userflags & AMDGPU_GEM_USERPTR_READONLY);
v2: put short variable declaration last
Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
ttm->sg needs to be checked before accessing its child member.
Call Trace:
amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
ttm_bo_cleanup_memtype_use+0x3a/0x60 [ttm]
ttm_bo_release+0x17d/0x300 [ttm]
amdgpu_bo_unref+0x1a/0x30 [amdgpu]
amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x78b/0x8b0 [amdgpu]
kfd_ioctl_alloc_memory_of_gpu+0x118/0x220 [amdgpu]
kfd_ioctl+0x222/0x400 [amdgpu]
? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
__x64_sys_ioctl+0x8e/0xd0
? __context_tracking_exit+0x52/0x90
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f97f264d317
Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffdb402c338 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f97f3cc63a0 RCX: 00007f97f264d317
RDX: 00007ffdb402c380 RSI: 00000000c0284b16 RDI: 0000000000000003
RBP: 00007ffdb402c380 R08: 00007ffdb402c428 R09: 00000000c4000004
R10: 00000000c4000004 R11: 0000000000000246 R12: 00000000c0284b16
R13: 0000000000000003 R14: 00007f97f3cc63a0 R15: 00007f8836200000
Signed-off-by: Guchun Chen <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add 3 sub flags to notify guest for indirect reg access of
gc, mmhub and ih
The host sets these flags depending on L1 RAP version,
asic and other scenarios. These flags ensure that
there is compatibility between different guest/host/vbios versions.
Signed-off-by: Rohit Khaire <[email protected]>
Reviewed-by: Monk Liu <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Acked-by: Luben Tuikov <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This message is not needed on Aldebaran.
Signed-off-by: Feifei Xu <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
vega20/ALDEBARAN
When resume from gpu reset, need set MP1 state to UNLOAD before reload SMU
FW otherwise will cause following errors:
[ 121.642772] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR [ 123.801051] [drm] failed to load ucode id (24) [ 123.801055] [drm] psp command (0x6) failed and response status is (0x0) [ 123.801214] [drm:psp_load_smu_fw [amdgpu]] *ERROR* PSP load smu failed!
[ 123.801398] [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed [ 123.801536] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22 [ 123.801632] amdgpu 0000:04:00.0: amdgpu: GPU reset(9) failed [ 123.801691] amdgpu 0000:07:00.0: amdgpu: GPU reset(9) failed [ 123.802899] amdgpu 0000:04:00.0: amdgpu: GPU reset end with ret = -22
v2: add error info and including ALDEBARAN also
Signed-off-by: Chengming Gui <[email protected]>
Reviewed-and-tested-by: Guchun Chen <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
If reset handler is not implemented, reset error before proceeding.
Fixes issue with the following trace -
[ 106.508592] amdgpu 0000:b1:00.0: amdgpu: ASIC reset failed with error, -38 for drm dev, 0000:b1:00.0
[ 106.508972] amdgpu 0000:b1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 106.509116] [drm] PCIE GART of 512M enabled.
[ 106.509120] [drm] PTB located at 0x0000008000000000
[ 106.509136] [drm] VRAM is lost due to GPU reset!
[ 106.509332] [drm] PSP is resuming...
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-and-tested-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Starting Arcturus, it supports ih reroute through mmio directly
in bare metal environment. This is also valid for newer asics
such as Aldebaran.
Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Offset calculation wasn't correct as start addresses are in pfn
not in bytes.
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
No need to have special handling for swSMU supported ASICs.
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
No need to have special handling for swSMU supported ASICs.
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fix header guard and make internal functions static. Fixes the below warnings:
drivers/gpu/drm/amd/amdgpu/../amdgpu/amdgpu_reset.h:24:9: warning: '__AMDUGPU_RESET_H__' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]
drivers/gpu/drm/amd/amdgpu/aldebaran.c:110:6: warning: no previous prototype for function 'aldebaran_async_reset' [-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/aldebaran_ppt.c:1435:5: warning: no previous prototype for function 'aldebaran_mode2_reset' [-Wmissing-prototypes]
Signed-off-by: Lijo Lazar <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add aldebaran to devices which support recovery
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
v1: Aldebaran uses reset control to support mode2 reset. The sequences to
reset and restore hardware context are specific to a particular
configuration.
v2: Clear bus mastering before reset.
Fix coding style issues, drop unwanted variables and info log.
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Expose PG/CG set states functions for other clients
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
v1: Adds a function to load a list of FWs as passed by the caller. This is
needed as only a select need to loaded for some use cases.
v2: Omit unrelated change, remove info log, fix return value when count is 0
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This prefers reset control based handling if it's implemented
for a particular ASIC. If not, it takes the legacy path. It uses
the legacy method of preparing environment (job, scheduler tasks)
and restoring environment.
v2: remove unused variable (Alex)
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
v1: Add generic amdgpu_reset_control to handle different types of resets. It
may be added at device, hive or ip level. Each reset control has a list
of handlers associated with it to handle different types of reset. Reset
control is responsible for choosing the right handler given a particular
reset context.
Handler objects may implement a set of functions on how to handle a
particular type of reset.
prepare_env = Prepare environment/software context (not used currently).
prepare_hwcontext = Prepare hardware context for the reset.
perform_reset = Perform the type of reset.
restore_hwcontext = Restore the hw context after reset.
restore_env = Restore the environment after reset (not used currently).
Reset context carries the context of reset, as of now this is based on
the parameters used for current set of resets.
v2: Fix coding style
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.
[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.
1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.
2. For whole gpu reset(vram lost), do resubmit as the old way.
v2: squash in build fix (Alex)
Signed-off-by: Jack Zhang <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
BO with ttm_bo_type_sg type can also have tiling_flag and metadata.
So so BO type check for only ttm_bo_type_kernel.
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Nirmoy Das <[email protected]>
Reported-by: Tom StDenis <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Tiling flag and metadata are only needed for BOs created by
amdgpu_gem_object_create(), so we can remove those from the
base class.
v2: * squash tiling_flags and metadata relared patches into one
* use BUG_ON for non ttm_bo_type_device type when accessing
tiling_flags and metadata._
v3: *include to_amdgpu_bo_user
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Use amdgpu_bo_create_user() for all the BO allocations for
ttm_bo_type_device type.
v2: include amdgpu_amdkfd_alloc_gws() as well it calls amdgpu_bo_create()
for ttm_bo_type_device
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Implement a new struct amdgpu_bo_user as subclass of
struct amdgpu_bo and a function to created amdgpu_bo_user
bo with a flag to identify the owner.
v2: amdgpu_bo_to_amdgpu_bo_user -> to_amdgpu_bo_user()
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Allow allocating BO structures with different structure size
than struct amdgpu_bo.
v2: Check bo_ptr_size in all amdgpu_bo_create() caller.
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add VCN3 IB parsing to figure out to which instance we can send the
stream for decode.
v2: remove VCN instance limit as well, fix amdgpu_cs_find_mapping,
check supported formats instead of unsupported.
v3: fix typo and error handling
v4: make sure the message BO is CPU accessible
v5: fix addr calculation once more
v6: only check message buffers
v7: fix constant and use defines
v8: fix create msg calculation
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Sonny Jiang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The VCN3 instances can do both decode as well as encode.
Share the scheduler load balancing score and remove fixing encode to
only the second instance.
Signed-off-by: Christian König <[email protected]>
Reviewed-and-Tested-by: Leo Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Allow separate ring to share the same scheduler score.
No functional change.
Signed-off-by: Christian König <[email protected]>
Reviewed-and-Tested-by: Leo Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
s/acccess/access/
s/inferface/interface/
s/sequnce/sequence/ .....two different places.
s/retrive/retrieve/
s/sheduling/scheduling/
s/independant/independent/
s/wether/whether/ ......two different places.
s/emmit/emit/
s/synce/sync/
Reviewed-by: Nirmoy Das<[email protected]>
Signed-off-by: Bhaskar Chowdhury <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
A previous fix I did left a rather complicated loop in
amdgpu_securedisplay_debugfs_write() for what could be expressed in a
simple sprintf, as Rasmus pointed out.
This drops the leading 0x for each byte, but is otherwise
much nicer.
Suggested-by: Rasmus Villemoes <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
On initializing the framebuffer, call drm_any_plane_has_format to do a
check if the modifier is supported. drm_any_plane_has_format calls
dm_plane_format_mod_supported which is extended to validate that the
modifier is on the list of the plane's supported modifiers.
The bug was caught using igt-gpu-tools test: kms_addfb_basic.addfb25-bad-modifier
Tested on ChromeOS Zork by turning on the display, running an overlay
test, and running a YT video.
=== Changes from v1 ===
Explicitly handle DRM_FORMAT_MOD_INVALID modifier.
Cc: Alex Deucher <[email protected]>
Cc: Bas Nieuwenhuizen <[email protected]>
Signed-off-by: Mark Yacoub <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|