Age | Commit message (Collapse) | Author | Files | Lines |
|
Confirmed with hardware team, there is harvesting for gc 10.3.1.
Signed-off-by: Aaron Liu <[email protected]>
Reviewed-by: Huang Rui <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The parameter order in the list_add_tail is incorrect, it causes the
reuse of ras reserved page.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The commit d5e8ff5f7b2a ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
had fixed this defect.
Revert workaround
commit a2170b4af62f ("drm/amdgpu: Add judgement to avoid infinite loop").
Signed-off-by: yipechai <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
1. The infinite loop case only occurs on multiple cards support
ras functions.
2. The explanation of root cause refer to commit 76641cbbf196
("drm/amdgpu: Add judgement to avoid infinite loop").
3. Create new node to manage each unique ras instance to guarantee
each device .ras_list is completely independent.
4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
interface for each ras block").
5. The soft locked logs are as follows:
[ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu
[ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
[ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
[ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
[ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
[ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
[ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
[ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
[ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
[ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
[ 262.166267] Call Trace:
[ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[ 262.166529] ? psi_task_switch+0xd2/0x250
[ 262.166537] ? __switch_to+0x11d/0x460
[ 262.166542] ? __switch_to_asm+0x36/0x70
[ 262.166549] process_one_work+0x220/0x3c0
[ 262.166556] worker_thread+0x4d/0x3f0
[ 262.166560] ? process_one_work+0x3c0/0x3c0
[ 262.166563] kthread+0x12b/0x150
[ 262.166568] ? set_kthread_struct+0x40/0x40
[ 262.166571] ret_from_fork+0x22/0x30
Signed-off-by: yipechai <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20
ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU
is sought through that bus "pointer" and not through the RAS bus "pointer".
Cc: Roy Sun <[email protected]>
Cc: Alex Deucher <[email protected]>
Fixes: 2f60dd50769efc ("drm/amd: Expose the FRU SMU I2C bus")
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Noticed the below warning while running a pytorch workload on vega10
GPUs. Change to trylock to avoid conflicts with already held reservation
locks.
[ +0.000003] WARNING: possible recursive locking detected
[ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
[ +0.000004] --------------------------------------------
[ +0.000002] python/4822 is trying to acquire lock:
[ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[ +0.000203]
but task is already holding lock:
[ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[ +0.000017]
other info that might help us debug this:
[ +0.000002] Possible unsafe locking scenario:
[ +0.000003] CPU0
[ +0.000002] ----
[ +0.000002] lock(reservation_ww_class_mutex);
[ +0.000004] lock(reservation_ww_class_mutex);
[ +0.000003]
*** DEADLOCK ***
[ +0.000002] May be due to missing lock nesting notation
[ +0.000003] 7 locks held by python/4822:
[ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
[ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
[ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
[ +0.000236] #3: ffffb2b35606fd28
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
[ +0.000235] #4: ffff932cbb7181f8
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
drm_dev_enter+0x5/0xa0 [drm]
[ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
[ +0.000195]
stack backtrace:
[ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
5.13.0-kfd-rajneesh #1030
[ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
08/29/2018
[ +0.000003] Call Trace:
[ +0.000003] dump_stack+0x6d/0x89
[ +0.000010] __lock_acquire+0xb93/0x1a90
[ +0.000009] lock_acquire+0x25d/0x2d0
[ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[ +0.000184] ? lock_is_held_type+0xa2/0x110
[ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060
[ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[ +0.000183] ? lock_release+0x13f/0x270
[ +0.000005] ? lock_is_held_type+0xa2/0x110
[ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm]
[ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
[ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
[ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
[ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
[ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
[ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu]
[ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
[ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
[amdgpu]
[ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
[ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu]
[ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
[ +0.000216] ? lock_release+0x13f/0x270
[ +0.000006] ? __fget_files+0x107/0x1e0
[ +0.000007] __x64_sys_ioctl+0x8b/0xd0
[ +0.000007] do_syscall_64+0x36/0x70
[ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ +0.000007] RIP: 0033:0x7fbff90a7317
[ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
[ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
00007fbff90a7317
[ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
0000000000000004
[ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
00007fbcc402d880
[ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
00000000c0184b18
[ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
00007fbcc402d820
Cc: Christian König <[email protected]>
Cc: Felix Kuehling <[email protected]>
Cc: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Prevent random memory access in the FRU EEPROM code by passing the size of
the destination buffer to the reading routine, and reading no more than the
size of the buffer.
Cc: Kent Russell <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Harish Kasiviswanathan <[email protected]>
Reviewed-by: Kent Russell <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Read buffers no longer expose the I2C address, and so we don't need to
offset by two when we get the read data.
Cc: Alex Deucher <[email protected]>
Cc: Kent Russell <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Fixes: bd607166af7fe3 ("drm/amdgpu: Enable reading FRU chip via I2C v3")
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Harish Kasiviswanathan <[email protected]>
Reviewed-by: Kent Russell <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er).
This is consistent with the rest of the kernel code.
Cc: Kent Russell <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Harish Kasiviswanathan <[email protected]>
Reviewed-by: Kent Russell <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
In CRIU resume stage, resume all the shared virtual memory ranges from
the data stored inside the resuming kfd process during CRIU restore
phase. Also setup xnack mode and free up the resources.
KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr
interface but we must clear the flags during restore as there might be
some default flags set when the prange is created. Also handle the
invalid PREFETCH atribute values saved during checkpoint by replacing
them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute.
(rajneesh: Fixed the checkpatch reported problems)
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
During checkpoint stage, save the shared virtual memory ranges and
attributes for the target process. A process may contain a number of svm
ranges and each range might contain a number of attributes. While not
all attributes may be applicable for a given prange but during
checkpoint we store all possible values for the max possible attribute
types.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by extending the PROCESS_INFO op of the the
CRIU IOCTL to discover the svm ranges in the target process and a future
patches brings in the required support for checkpoint and restore for
SVM ranges.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore
support its possible that the SVM ranges can be resumed on another node
where the actual_gpu_id may not be same as the original (user_gpu_id)
gpu id. So modify svm code to use user_gpu_id.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct
from current but for a Checkpoint or Restore operation, the current->mm
will fetch the mm for the CRIU master process. So modify these helpers to
accept the task mm for a target kfd process to support Checkpoint
Restore.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
KFD buffer objects do not associate a GEM handle with them so cannot
directly be used with libdrm to initiate a system dma (sDMA) operation
to speedup the checkpoint and restore operation so export them as dmabuf
objects and use with libdrm helper (amdgpu_bo_import) to further process
the sdma command submissions.
With sDMA, we see huge improvement in checkpoint and restore operations
compared to the generic pci based access via host data path.
Suggested-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the user
ioctl's.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
When re-creating queues during CRIU restore, restore the queue with the
same doorbell id value used during CRIU dump.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
When re-creating queues during CRIU restore, restore the queue with the
same sdma id value used during CRIU dump.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This implements the KFD CRIU Restore ioctl that lays the basic
foundation for the CRIU restore operation. It provides support to
create the buffer objects corresponding to the checkpointed image.
This ioctl creates various types of buffer objects such as VRAM,
MMIO, Doorbell, GTT based on the date sent from the userspace plugin.
The data mostly contains the previously checkpointed KFD images from
some KFD processs.
While restoring a criu process, attach old IDR values to newly
created BOs. This also adds the minimal gpu mapping support for a single
gpu checkpoint restore use case.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This adds support to discover the buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This IOCTL op is expected to be called as a precursor to the actual
Checkpoint operation. This does the basic discovery into the target
process seized by CRIU and relays the information to the userspace that
utilizes it to start the Checkpoint operation via another dedicated
IOCTL op.
The process_info IOCTL op determines the number of GPUs, buffer objects
that are associated with the target process, its process id in
caller's namespace since /proc/pid/mem interface maybe used to drain
the contents of the discovered buffer objects in userspace and getpid
returns the pid of CRIU dumper process. Also the pid of a process
inside a container might be different than its global pid so return
the ns pid.
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can
snapshot a running process and later restore it on same or a remote
machine but expects the processes that have a device file (e.g. GPU)
associated with them, provide necessary driver support to assist CRIU
and its extensible plugin interface. Thus, In order to support the
Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute
Kernel driver, needs to provide a set of new APIs that provide
necessary VRAM metadata and its contents to a userspace component
(CRIU plugin) that can store it in form of image files.
This introduces some new ioctls which will be used to checkpoint-Restore
any KFD bound user process. KFD only allows ioctl calls from the same
process that opened the KFD file descriptor. Since these ioctls are
expected to be called from a KFD criu plugin which has elevated ptrace
attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with
the file descriptors so modify KFD to allow such calls.
(API redesigned by David Yat Sin)
Suggested-by: Felix Kuehling <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: David Yat Sin <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
the dmesg log to fill up with the same message, e.g,
[18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function.
Make it dev_dbg_once(), as it isn't something correctible during boot or
thereafter, so printing just once is sufficient. Also sanitize the message.
Cc: Alex Deucher <[email protected]>
Cc: Hawking Zhang <[email protected]>
Cc: John Clements <[email protected]>
Cc: Tao Zhou <[email protected]>
Cc: yipechai <[email protected]>
Fixes: 8b0fb0e967c1 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops")
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Some people complained about the name and this matches much
more Linux naming conventions for object functions.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Whenever a bo_va structure is added or removed the VM and eventually
added BO should be locked.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
linux/string_helpers.h provides a helper to return "yes"/"no" strings.
Replace the open coded versions with str_yes_no(). The places were
identified with the following semantic patch:
@@
expression b;
@@
- b ? "yes" : "no"
+ str_yes_no(b)
Then the includes were added, so we include-what-we-use, and parenthesis
adjusted in drivers/gpu/drm/v3d/v3d_debugfs.c. After the conversion we
still see the same binary sizes:
text data bss dec hex filename
51149 3295 212 54656 d580 virtio/virtio-gpu.ko.old
51149 3295 212 54656 d580 virtio/virtio-gpu.ko
1441491 60340 800 1502631 16eda7 radeon/radeon.ko.old
1441491 60340 800 1502631 16eda7 radeon/radeon.ko
6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko.old
6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko
411986 10490 6176 428652 68a6c drm.ko.old
411986 10490 6176 428652 68a6c drm.ko
98129 1636 264 100029 186bd dp/drm_dp_helper.ko.old
98129 1636 264 100029 186bd dp/drm_dp_helper.ko
1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko.old
1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko
Signed-off-by: Lucas De Marchi <[email protected]>
Reviewed-by: Jani Nikula <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Remove the local yesno() implementation and adopt the str_yes_no() from
linux/string_helpers.h.
Signed-off-by: Lucas De Marchi <[email protected]>
Reviewed-by: Harry Wentland <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
First backmerge into drm-misc-next. Required for more helpers backmerged,
and to pull in 5.17 (rc2).
Signed-off-by: Maarten Lankhorst <[email protected]>
|
|
We probably never trigger this, but the logic inside the check is
inverted.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
dGPUs connected to Intel systems configured for suspend to idle
will not have the power rails cut at suspend and resetting the GPU
may lead to problematic behaviors.
Fixes: e25443d2765f4 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
panels
The eDP link rate reported by the DP_MAX_LINK_RATE dpcd register (0xa) is
contradictory to the highest rate supported reported by
EDID (0xc = LINK_RATE_RBR2). The effects of this compounded with commit
'4a8ca46bae8a ("drm/amd/display: Default max bpc to 16 for eDP")' results
in no display modes being found and a dark panel.
For now, simply force the maximum supported link rate for the eDP attached
2018 15" Apple Retina panels.
Additionally, we must also check the firmware revision since the device ID
reported by the DPCD is identical to that of the more capable 16,1,
incorrectly quirking it. We also use said firmware check to quirk the
refreshed 15,1 models with Vega graphics as they use a slightly newer
firmware version.
Tested-by: Aun-Ali Zaidi <[email protected]>
Reviewed-by: Harry Wentland <[email protected]>
Signed-off-by: Aun-Ali Zaidi <[email protected]>
Signed-off-by: Aditya Garg <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
[Why]
This change causes regression, that prevents some systems
from lighting up internal displays.
[How]
Revert this patch until a new solution is ready.
Tested-by: Daniel Wheeler <[email protected]>
Reviewed-by: Charlene Liu <[email protected]>
Acked-by: Stylon Wang <[email protected]>
Signed-off-by: Zhan Liu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
[Why]
The original latencies were causing underflow in some modes.
Resolution: 2880x1620@60p when HDR enable
[How]
1. Replace with the up-to-date watermark values based on new measurments
2. Correct the ddr_wm_table name to DDR5 on DCN31
Tested-by: Daniel Wheeler <[email protected]>
Reviewed-by: Aric Cyr <[email protected]>
Acked-by: Stylon Wang <[email protected]>
Signed-off-by: Paul Hsieh <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
[Why]
There is underflow / visual corruption DCN301, for high
bandwidth MST DSC configurations such as 2x1440p144 or 2x4k60.
[How]
Use up-to-date watermark values for DCN301.
Reviewed-by: Zhan Liu <[email protected]>
Signed-off-by: Agustin Gutierrez <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
We observed a GPU hang when querying GMC CG state(i.e.,
cat amdgpu_pm_info) on cyan skillfish. Acctually, cyan
skillfish doesn't support any CG features.
Just prevent it from accessing GMC CG registers.
Signed-off-by: Lang Yu <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
This will cause misconfigured systems to not run the GPU suspend
routines.
* In APUs that are properly configured system will go into s2idle.
* In APUs that are intended to be S3 but user selects
s2idle the GPU will stay fully powered for the suspend.
* In APUs that are intended to be s2idle and system misconfigured
the GPU will stay fully powered for the suspend.
* In systems that are intended to be s2idle, but AMD dGPU is also
present, the dGPU will go through S3
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This will be used to help make decisions on what to do in
misconfigured systems.
v2: squash in semicolon fix from Stephen Rothwell
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Assigning 0L to a pointer variable caused the following warning:
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dsc/rc_calc_fpu.c:71:40:
warning: Using plain integer as NULL pointer
In order to remove this warning, this commit assigns a NULL pointer to
the pointer variable that caused this issue.
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Magali Lemes <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
write offset
(v3)
Rewrote patchset to order patches as (API, hw impl, usecase)
- added API for new power management function emit_clk_levels
This function should duplicate the functionality of print_clk_levels,
but this solution passes the buffer base and write offset down the stack.
- new powerplay function emit_clock_levels, implemented by smu_emit_ppclk_levels()
This function parallels the implementation of smu_print_ppclk_levels and
calls emit_clk_levels, and allows the returns of errors
- new helper function smu_convert_to_smuclk called by smu_print_ppclk_levels and
smu_emit_ppclk_levels
Signed-off-by: Darren Powell <[email protected]>
Reviewed-By: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
trace_amdgpu_vm_update_ptes trace unable to log when nptes too large
Signed-off-by: Somalapuram Amaranath <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
dGPUs connected to Intel systems configured for suspend to idle
will not have the power rails cut at suspend and resetting the GPU
may lead to problematic behaviors.
Fixes: e25443d2765f4 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|