Age | Commit message (Collapse) | Author | Files | Lines |
|
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.
How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling
Signed-off-by: Jingwen Chen <[email protected]>
Signed-off-by: Jack Zhang <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Reviewed by: Monk Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Similar to xGMI reporting the min/max bandwidth between direct peers, PCIe
will report the min/max bandwidth to the KFD.
Signed-off-by: Jonathan Kim <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Report the min/max bandwidth in megabytes to the kfd for direct
xgmi connections only. Indirect peers will report 0 since
indirect route is unknown.
Signed-off-by: Jonathan Kim <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Backmerging from drm/drm-next to the patches for AMD devices
for v5.14.
Signed-off-by: Thomas Zimmermann <[email protected]>
|
|
Use it to call disply code dependent on device->drv_data
before it's set to NULL on device unplug
v5:
Move HW finilization into this callback to prevent MMIO accesses
post cpi remove.
v7:
Split kfd suspend from device exit to expdite HW related
stuff to amdgpu_pci_remove
v8:
Squash previous KFD commit into this commit to avoid compile break.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Need do a heavy-weight TLB flush to make sure we have no more dirty data
in the cache for the unmapped pages.
Define enum TLB_FLUSH_TYPE, add flush_type parameter to
amdgpu_amdkfd_flush_gpu_tlb_pasid.
Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Tiling flag and metadata are only needed for BOs created by
amdgpu_gem_object_create(), so we can remove those from the
base class.
v2: * squash tiling_flags and metadata relared patches into one
* use BUG_ON for non ttm_bo_type_device type when accessing
tiling_flags and metadata._
v3: *include to_amdgpu_bo_user
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Use amdgpu_bo_create_user() for all the BO allocations for
ttm_bo_type_device type.
v2: include amdgpu_amdkfd_alloc_gws() as well it calls amdgpu_bo_create()
for ttm_bo_type_device
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Allow allocating BO structures with different structure size
than struct amdgpu_bo.
v2: Check bo_ptr_size in all amdgpu_bo_create() caller.
Signed-off-by: Nirmoy Das <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This reverts commit 73bf5cad2696fe3a21f70101821405db839ea18e.
Fixed in newer firmware
Signed-off-by: Harish Kasiviswanathan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
amdgpu driver may be in reset state during init which will not initialize the kfd,
driver need to initialize the KFD after reset by check the flag
Signed-off-by: shaoyunl <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
With the current kfd memory accounting scheme, kfd applications
can use up to 15/16 of total system memory. For system which
has small total system memory size it leaves small system memory
for OS. For example, if the system has totally 16GB of system
memory, this scheme leave OS and non-kfd applications only 1GB
of system memory. In many cases, this leads to OOM killer.
This patch changed the KFD system memory accounting scheme.
15/16 of free system memory when kfd driver load. This deduct
the system memory that OS already use.
Signed-off-by: Oak Zeng <[email protected]>
Suggested-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Move all the dummy functions in amdgpu_amdkfd.c to
amdgpu_amdkfd.h as inline functions.
Signed-off-by: Lang Yu <[email protected]>
Suggested-by: Felix Kuehling <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Huang Rui <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Determine FRAMEBUFFER_PUBLIC/PRIVATE only based host-accessibility,
not peer-accesssibility
Signed-off-by: Gang Ba <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Workaround to fix the soft hang observed in certain compute
applications.
Acked-by: Evan Quan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Fixes: c7651b73586600 ("drm/amdgpu: Fix handling of KFD initialization failures")
Signed-off-by: kernel test robot <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This will allow us to have different defaults per asic
in a future patch.
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Remember KFD module initializaton status in a global variable. Skip KFD
device probing when the module was not initialized. Other amdgpu_amdkfd
calls are then protected by the adev->kfd.dev check.
Also print a clear error message when KFD disables itself. Amdgpu
continues its initialization even when KFD failed.
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Kent Russell <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
git://people.freedesktop.org/~agd5f/linux into drm-next
amd-drm-next-5.10-2020-09-03:
amdgpu:
- RAS fixes
- Sienna Cichlid updates
- Navy Flounder updates
- DCE6 (SI) support in DC
- Enable plane rotation
- Rework pre-OS vram reservation handling during driver init
- Add standard interface to dump GPU metrics table from SMU
- Rework tiling and tmz state handling in atomic commits
- Pstate fixes
- Add voltage and power hwmon interfaces for renoir
- SW CTF fixes
- S/G display fix for Raven
- Print client strings for vmfaults for vega and newer
- Manual fan control fixes
- Display updates
- Reorg power management directory structure
- Misc bug fixes
- Misc code cleanups
amdkfd:
- Topology fixes
- Add SMI events for thermal throttling and GPU resets
radeon:
- switch from pci_* to dma_* for dma allocations
- PLL fix
Scheduler:
- Clean up priority levels
UAPI:
- amdgpu INFO IOCTL query update for TMZ state
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6049
- amdkfd SMI event interface updates
https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/therm_thrott
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Add a static inline adev_to_drm() to obtain
the DRM device pointer from an amdgpu_device pointer.
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Get the amdgpu_device from the DRM device by use
of an inline function, drm_to_adev(). The inline
function resolves a pointer to struct drm_device
to a pointer to struct amdgpu_device.
v2: Use a typed visible static inline function
instead of an invisible macro.
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The whole approach wasn't thought through till the end.
We already had a reset lock like this in the past and it caused the same problems like this one.
Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.
This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929.
Signed-off-by: Christian König <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Backmerging drm-next into drm-misc-next for nouveau and panel updates.
Resolves a conflict between ttm and nouveau, where struct ttm_mem_res got
renamed to struct ttm_resource.
Signed-off-by: Thomas Zimmermann <[email protected]>
|
|
Make sure to unlock the mutex when error happen
v2:
1. correct syntax error in the commit comments
2. remove change-Id
Acked-by: Nirmoy Das <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This name makes a lot more sense, since these are about managing
driver resources rather than just memory ranges.
Acked-by: Christian König <[email protected]>
Reviewed-by: Ben Skeggs <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Add support for reporting thermal throttling events through SMI.
Also, add a counter to count the number of throttling interrupts
observed and report the count in the SMI event message.
Signed-off-by: Mukul Joshi <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.
During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.
v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.
v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;
[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249] dump_stack+0x98/0xd5
[ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833] free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831] ksys_ioctl+0x98/0xb0
[ 1230.204004] __x64_sys_ioctl+0x1a/0x20
[ 1230.205174] do_syscall_64+0x5f/0x250
[ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe
2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.
v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset
v5:
1. Fix some style issues.
Reviewed-by: Hawking Zhang <[email protected]>
Suggested-by: Andrey Grodzovsky <[email protected]>
Suggested-by: Christian König <[email protected]>
Suggested-by: Felix Kuehling <[email protected]>
Suggested-by: Lijo Lazar <[email protected]>
Suggested-by: Luben Tukov <[email protected]>
Signed-off-by: Dennis Li <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The KFD VMID assignment was hard-coded in a few places. Consolidate that in
a single variable adev->vm_manager.first_kfd_vmid. The value is still
assigned in gmc-ip-version-specific code.
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
In order to surface the ASIC revision to user level, we want
to put it into the HSA topology. This can be because different
ASIC revisions may require user-level software to do different
things (e.g. patch code for things that are changed in later
hardware revisions).
The ASIC revision from the hardware is maximum of 4 bits at this
time, so put it into 4 of the open bits in the HSA capability.
Then user-level software can use this capability information to
know -- for each ASIC -- what revision-based things must be done.
Signed-off-by: Joseph Greathouse <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
ALLOC_MEM_FLAGS_* used are the same as the KFD_IOC_ALLOC_MEM_FLAGS_*,
but they are interweavedly used in kernel driver, resulting in bad
readability. For example, KFD_IOC_ALLOC_MEM_FLAGS_COHERENT is not
referenced in kernel, and it functions implicitly in kernel through
ALLOC_MEM_FLAGS_COHERENT, causing unnecessary confusion.
Replace all occurrences of ALLOC_MEM_FLAGS_* with
KFD_IOC_ALLOC_MEM_FLAGS_* to solve the problem.
Signed-off-by: Yong Zhao <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add "CP" to AMDGPU_GEM_CREATE_MQD_GFX9 to indicate it is only for CP MQD
buffer.
Signed-off-by: Yong Zhao <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The queues represented in queue_bitmap are only CP queues.
Signed-off-by: Yong Zhao <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Devices from Arcturus onwards will have their UUID exposed to Thunk.
Adding neccessary functions to the kernel to propagate the uuid.
Signed-off-by: Divya Shikre <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
pp_funcs may not exist, while dpm may be enabled. This change ensures
that KFD topology will report the same as pp_dpm_sclk, as the conditions
for reporting them will be the same.
Otherwise, we may see the issue where KFD reports "100MHz" in topology
as the max speed, while DPM is working correctly.
Signed-off-by: Kent Russell <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
No need to trigger eviction as the memory mapping will not be used
anymore.
All pt/pd bos share same resv, hence the same shared eviction fence.
Everytime page table is freed, the fence will be signled and that cuases
kfd unexcepted evictions.
v2: squash in 32 bit fix
CC: Christian König <[email protected]>
CC: Felix Kuehling <[email protected]>
CC: Alex Deucher <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: xinhui pan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Using a heavy-weight TLB flush once is not sufficient. Concurrent
memory accesses in the same TLB cache line can re-populate TLB entries
from stale texture cache (TC) entries while the heavy-weight TLB
flush is in progress. To fix this race condition, perform another TLB
flush after the heavy-weight one, when TC is known to be clean.
Move the workaround into the low-level TLB flushing functions. This way
they apply to amdgpu as well, and KIQ-based TLB flush only needs to
synchronize once.
Signed-off-by: Felix Kuehling <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: shaoyun liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load because
the driver reports a locked kfd device since gpu is runtime suspended.
However, in an ideal case, when gpu is runtime suspended the kfd driver
should be able to:
- auto resume amdgpu driver whenever a client requests compute service
- prevent runtime suspend for amdgpu while kfd is in use
This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime power management.
Reviewed-by: Oak Zeng <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
[Why]
TLB flush method has been deprecated using kfd2kgd interface.
This implementation is now on the amdgpu_amdkfd API.
[How]
TLB flush functions now implemented in amdgpu_amdkfd.
Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This can save users much troubles. As they do not
actually need to care whether swSMU or traditional
powerplay routine should be used.
V2: apply the fixes to vi.c and cik.c also
V3: squash in oops fix
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The KIQ is on the second MEC and its reservation is covered in the
latter logic, so no need to reserve its bit twice.
Signed-off-by: Yong Zhao <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This is the same idea as the kfd device info probe and move all the
probe control together for easy maintenance.
Signed-off-by: Yong Zhao <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The patch c670707 drm/amd: Pass drm_device to kfd introduced this issue and
fix the following compiler error.
CC [M] drivers/gpu/drm/amd/amdgpu//../powerplay/smumgr/fiji_smumgr.o
drivers/gpu/drm/amd/amdgpu//amdgpu_amdkfd.c:746:6: error: conflicting types for ‘kgd2kfd_device_init’
bool kgd2kfd_device_init(struct kfd_dev *kfd,
^
In file included from drivers/gpu/drm/amd/amdgpu//amdgpu_amdkfd.c:23:0:
drivers/gpu/drm/amd/amdgpu//amdgpu_amdkfd.h:253:6: note: previous declaration of ‘kgd2kfd_device_init’ was here
bool kgd2kfd_device_init(struct kfd_dev *kfd,
^
scripts/Makefile.build:273: recipe for target 'drivers/gpu/drm/amd/amdgpu//amdgpu_amdkfd.o' failed
make[1]: *** [drivers/gpu/drm/amd/amdgpu//amdgpu_amdkfd.o] Error 1
Signed-off-by: Prike Liang <[email protected]>
Acked-by: Evan Quan <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
kfd needs drm_device to call into drm_cgroup functions
Signed-off-by: Harish Kasiviswanathan <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This patch is to add asic flag to enable device probe during kfd init.
Signed-off-by: Huang Rui <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This optimizes out the pci device id usage in KFD and makes the code
more maintainable.
Signed-off-by: Yong Zhao <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Navi12 has the same interface as Navi10
Signed-off-by: shaoyunl <[email protected]>
Reviewed-by: Jack Xiao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Hook up the SW SMU power profile switch in KFD routine.
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Kevin Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Same as navi10.
Reviewed-by: Xiaojie Yuan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add amdgpu_amdkfd_arcturus_get_functions stub when
CONFIG_HSA_AMD is undefinded.
Signed-off-by: James Zhu <[email protected]>
Reviewed-by: Yong Zhao <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|