| Age | Commit message (Collapse) | Author | Files | Lines |
|
[Why] Devices with CPU XGMI iolink do not support PCIe peer access.
Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The additional call is caused by merge conflict
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: shaoyunl <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.
Fixes: 9e08564727fc (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
mes self test rely on vm mapping, move it after
drm sched re-started so that vm mapping can work
during gpu reset.
Signed-off-by: Jack Xiao <[email protected]>
Acked-and-tested-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This extra call trace dump comes out in every gpu reset.
And it gives people a wrong impression that something
went wrong. Although actually there was not.
Signed-off-by: Evan Quan <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This is a follow-up cleanup to [1]. See bellow refcount balancing
for calling amdgpu_job_submit_direct after this cleanup as far
as I calculated.
amdgpu_fence_emit
dma_fence_init 1
dma_fence_get(fence) 2
rcu_assign_pointer(*ptr, dma_fence_get(fence) 3
---> amdgpu_job_submit_direct completes before fence signaled
amdgpu_sa_bo_free
(*sa_bo)->fence = dma_fence_get(fence) 4
amdgpu_job_free
dma_fence_put 3
amdgpu_vcn_enc_get_destroy_msg
*fence = dma_fence_get(f) 4
dma_fence_put(f); 3
amdgpu_vcn_enc_ring_test_ib
dma_fence_put(fence) 2
amdgpu_fence_process
dma_fence_put 1
amdgpu_sa_bo_remove_locked
dma_fence_put 0
---> amdgpu_job_submit_direct completes after fence signaled
amdgpu_fence_process
dma_fence_put 2
amdgpu_job_free
dma_fence_put 1
amdgpu_vcn_enc_get_destroy_msg
*fence = dma_fence_get(f) 2
dma_fence_put(f); 1
amdgpu_vcn_enc_ring_test_ib
dma_fence_put(fence) 0
[1] - https://patchwork.kernel.org/project/dri-devel/cover/[email protected]/
Signed-off-by: Andrey Grodzovsky <[email protected]>
Suggested-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.
Signed-off-by: Likun Gao <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fixes this issue:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5094: warning: expecting prototype for amdgpu_device_gpu_recover_imp(). Prototype was for amdgpu_device_gpu_recover() instead
Fixes: cf727044144d ("drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover")
Reviewed-by: Kent Russell <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Change amdggpu to amdgpu and pedning to pending
Signed-off-by: Kent Russell <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.
Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Tested-by: Yiqing Yao <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.
Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.
v2: Switch from irq_get/put to full enable/disable_irq for amdgpu
Signed-off-by: Andrey Grodzovsky <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Use the correct adev variable for the drm_fb_helper in
amdgpu_device_gpu_recover(). Noticed by inspection.
Fixes: 087451f372bf ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
enable_mes and enable_mes_kiq are set in both device init and
MES IP init. Leave the ones in MES IP init, since it is
a more accurate way to judge from GC IP version.
Signed-off-by: Yifan Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Jack Xiao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
We skip rest requests if another one is already in progress.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
We need to have a work_struct to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add support for peer-to-peer communication among AMD GPUs over PCIe
bus. Support REQUIRES enablement of config HSA_AMD_P2P.
Signed-off-by: Ramesh Errabolu <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Added device coredump information:
- Kernel version
- Module
- Time
- VRAM status
- Guilty process name and PID
- GPU register dumps
v1 -> v2: Variable name change
v1 -> v2: NULL check
v1 -> v2: Code alignment
v1 -> v2: Adding dummy amdgpu_devcoredump_free
v1 -> v2: memset reset_task_info to zero
v2 -> v3: add CONFIG_DEV_COREDUMP for variables
v2 -> v3: remove NULL check on amdgpu_devcoredump_read
Signed-off-by: Somalapuram Amaranath <[email protected]>
Reviewed-by: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Allocate memory for register value and use the same values for devcoredump.
v1 -> v2: Change krealloc_array() to kmalloc_array()
v2 -> v3: Fix alignment
Signed-off-by: Somalapuram Amaranath <[email protected]>
Reviewed-by: Shashank Sharma <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
LVDS support was implemented in DC a while ago. Just DAC
support is left to do.
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Drop all of the extra cases in the default case.
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Removed trailing whitespace from end of line in amdgpu_device.c
Signed-off-by: Mitchell Augustin <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Drop extra cases in the default case.
Reviewed-by: Guchun Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
To enable TMZ feature based on IP version needs adev->ip_version
populated but its empty. Move amdgpu_gmc_tmz_set to a place where
ip_version is populated.
Signed-off-by: Sunil Khatri <[email protected]>
Reviewed-by: Alexander Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
support umc/gfx/sdma ras on guest side
Changed from V1:
move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler
Signed-off-by: Stanley.Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add sysfs interface to copy VBIOS.
v2: squash in fix for proper vmalloc API (Alex)
Signed-off-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Likun Gao <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
[why]
lru_list not empty warning in sw fini during repeated device bind unbind.
There should be a amdgpu_fence_wait_empty() before the flush_delayed_work()
call as Christian suggested.
[how]
Move to do flush_delayed_work for ttm bo delayed delete wq after fence_driver_hw_fini.
Tested by: Yiqing Yao <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Yiqing Yao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Enable KFD initialization with MES enabled.
Signed-off-by: Mukul Joshi <[email protected]>
Acked-by: Oak Zeng <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
For kfd hasn't supported mes, skip kfd routines.
Signed-off-by: Jack Xiao <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
mes_kiq parameter is used to enable mes kiq pipe.
This module parameter is unneccessary or enabled by default
in final version.
v2: reword commit message.
Signed-off-by: Jack Xiao <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Use the whole doorbell space for mes. Each queue in one process occupies
one doorbell slot to ring the queue submitting.
Signed-off-by: Jack Xiao <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Some initial settings now are not available from
the atom data table. The assumption that !ps[0]
|| !ps[1] in amdgpu_atom_asic_init is not valid.
In addition, driver needs to strictly follow
atomfirmware structure (asic_init_parameters) to
initialize parameters used to execute asic_init
function, otherwise, the execution of asic_init
would fail.
This shall be applicable to all soc15 adapters,but
let make the transition on soc21 first.
Signed-off-by: Hawking Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This data has no dependencies, so encapsulate it all within
amdgpu_discovery.c.
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
- In the latest version of the header, there is a variable name change.
This should not cause any backward compatibility since the variable is
at the same offset in the struct.
Signed-off-by: Bokun Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
With this, we can support more CG flags.
Signed-off-by: Evan Quan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
If the GPU is passed through to a guest VM, use the PCI
BAR for CPU FB access rather than the physical address of
carve out. The physical address is not valid in a guest.
v2: Fix HDP handing as suggested by Michel
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
amdgpu_detect_virtualization reads register, amdgpu_device_rreg access
adev->reset_domain->sem if kernel defined CONFIG_LOCKDEP, below is the
random boot hang backtrace on Vega10. It may get random NULL pointer
access backtrace if amdgpu_sriov_runtime is true too.
Move amdgpu_reset_create_reset_domain before calling to RREG32.
BUG: kernel NULL pointer dereference, address:
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
Workqueue: events work_for_cpu_fn
RIP: 0010:down_read_trylock+0x13/0xf0
Call Trace:
<TASK>
amdgpu_device_skip_hw_access+0x38/0x80 [amdgpu]
amdgpu_device_rreg+0x1b/0x170 [amdgpu]
amdgpu_detect_virtualization+0x73/0x100 [amdgpu]
amdgpu_device_init.cold.60+0xbe/0x16b1 [amdgpu]
? pci_bus_read_config_word+0x43/0x70
amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
amdgpu_pci_probe+0x1a1/0x3a0 [amdgpu]
Fixes: d0fb18b535679a ("drm/amdgpu: Move reset sem into reset_domain")
Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
We don't support runtime pm on APUs. They support more
dynamic power savings using clock and powergating.
Reviewed-by: Mario Limonciello <[email protected]>
Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
These were leftover from bring up and are no longer
necessary. The information is available via
the IP discovery table.
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
otherwise adev->ip_versions is still empty when amdgpu_gmc_noretry_set
is called.
Reviewed-by: Huang Rui <[email protected]>
Signed-off-by: Yifan Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.18-2022-02-25:
amdgpu:
- Raven2 suspend/resume fix
- SDMA 5.2.6 updates
- VCN 3.1.2 updates
- SMU 13.0.5 updates
- DCN 3.1.5 updates
- Virtual display fixes
- SMU code cleanup
- Harvest fixes
- Expose benchmark tests via debugfs
- Drop no longer relevant gart aperture tests
- More RAS restructuring
- W=1 fixes
- PSR rework
- DP/VGA adapter fixes
- DP MST fixes
- GPUVM eviction fix
- GPU reset debugfs register dumping support
- Misc display fixes
- SR-IOV fix
- Aldebaran mGPU fix
- Add module parameter to disable XGMI for testing
amdkfd:
- IH ring overflow logging fixes
- CRIU fixes
- Misc fixes
Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
According to my investigation of the state of PCI
reset recently it's not working. The reason is
due to the fact the kernel PCI code rejects SBR
when there are more then one PF under same bridge
which we always have (at least AUDIO PF but usually
more) and that because SBR will reset all the PFS
and devices under the same bridge as you and you
cannot assume they support SBR.
Once we anble FLR support we can reenable this option as
FLR is doable on single PF and doens't have this
restriction.
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v5.18:
UAPI Changes:
Cross-subsystem Changes:
- Split out panel-lvds and lvds dt bindings .
- Put yes/no on/off disabled/enabled strings in linux/string_helpers.h
and use it in drivers and tomoyo.
- Clarify dma_fence_chain and dma_fence_array should never include eachother.
- Flatten chains in syncobj's.
- Don't double add in fbdev/defio when page is already enlisted.
- Don't sort deferred-I/O pages by default in fbdev.
Core Changes:
- Fix missing pm_runtime_put_sync in bridge.
- Set modifier support to only linear fb modifier if drivers don't
advertise support.
- As a result, we remove allow_fb_modifiers.
- Add missing clear for EDID Deep Color Modes in drm_reset_display_info.
- Assorted documentation updates.
- Warn once in drm_clflush if there is no arch support.
- Add missing select for dp helper in drm_panel_edp.
- Assorted small fixes.
- Improve fb-helper's clipping handling.
- Don't dump shmem mmaps in a core dump.
- Add accounting to ttm resource manager, and use it in amdgpu.
- Allow querying the detected eDP panel through debugfs.
- Add helpers for xrgb8888 to 8 and 1 bits gray.
- Improve drm's buddy allocator.
- Add selftests for the buddy allocator.
Driver Changes:
- Add support for nomodeset to a lot of drm drivers.
- Use drm_module_*_driver in a lot of drm drivers.
- Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau,
bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86.
- Add bridge/it6505.
- Create DP and DVI-I connectors in ast.
- Assorted nouveau backlight fixes.
- Rework amdgpu reset handling.
- Add dt bindings for ingenic,jz4780-dw-hdmi.
- Support reading edid through aux channel in ingenic.
- Add a drm driver for Solomon SSD130x OLED displays.
- Add simple support for sharp LQ140M1JW46.
- Add more panels to nt35560.
Signed-off-by: Dave Airlie <[email protected]>
From: Maarten Lankhorst <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Dump the list of register values to trace event on GPU reset.
Signed-off-by: Somalapuram Amaranath <[email protected]>
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This test is not particularly useful now that GTT and GART
are decoupled in the driver.
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Now that we expose the benchmarks via debugfs, there is no
longer a need for the module parameter.
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
To avoid multiple runs in parallel to avoid mixing results.
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add device pointer so scheduler's printing can use
DRM_DEV_ERROR() instead, which makes life easier under multiple GPU
scenario.
v2: amend all calls of drm_sched_init()
v3: fill dev pointer for all drm_sched_init() calls
Signed-off-by: Jiawei Gu <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|