aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd
AgeCommit message (Collapse)AuthorFilesLines
2022-02-07drm/amdgpu: add utcl2_harvest to gc 10.3.1Aaron Liu1-1/+6
Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <[email protected]> Reviewed-by: Huang Rui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: fix list add issue in vram reserveTao Zhou1-1/+1
The parameter order in the list_add_tail is incorrect, it causes the reuse of ras reserved page. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07Revert "drm/amdgpu: Add judgement to avoid infinite loop"yipechai1-4/+0
The commit d5e8ff5f7b2a ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop") had fixed this defect. Revert workaround commit a2170b4af62f ("drm/amdgpu: Add judgement to avoid infinite loop"). Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Fixed the defect of soft lock caused by infinite loopyipechai2-7/+31
1. The infinite loop case only occurs on multiple cards support ras functions. 2. The explanation of root cause refer to commit 76641cbbf196 ("drm/amdgpu: Add judgement to avoid infinite loop"). 3. Create new node to manage each unique ras instance to guarantee each device .ras_list is completely independent. 4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block interface for each ras block"). 5. The soft locked logs are as follows: [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [ 262.166267] Call Trace: [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Set FRU bus for Aldebaran and Vega 20Luben Tuikov2-1/+3
The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20 ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU is sought through that bus "pointer" and not through the RAS bus "pointer". Cc: Roy Sun <[email protected]> Cc: Alex Deucher <[email protected]> Fixes: 2f60dd50769efc ("drm/amd: Expose the FRU SMU I2C bus") Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Fix recursive locking warningRajneesh Bhardwaj1-1/+2
Noticed the below warning while running a pytorch workload on vega10 GPUs. Change to trylock to avoid conflicts with already held reservation locks. [ +0.000003] WARNING: possible recursive locking detected [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted [ +0.000004] -------------------------------------------- [ +0.000002] python/4822 is trying to acquire lock: [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000203] but task is already holding lock: [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000017] other info that might help us debug this: [ +0.000002] Possible unsafe locking scenario: [ +0.000003] CPU0 [ +0.000002] ---- [ +0.000002] lock(reservation_ww_class_mutex); [ +0.000004] lock(reservation_ww_class_mutex); [ +0.000003] *** DEADLOCK *** [ +0.000002] May be due to missing lock nesting notation [ +0.000003] 7 locks held by python/4822: [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] [ +0.000236] #3: ffffb2b35606fd28 (reservation_ww_class_acquire){+.+.}-{0:0}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] [ +0.000235] #4: ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: drm_dev_enter+0x5/0xa0 [drm] [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] [ +0.000195] stack backtrace: [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted 5.13.0-kfd-rajneesh #1030 [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 08/29/2018 [ +0.000003] Call Trace: [ +0.000003] dump_stack+0x6d/0x89 [ +0.000010] __lock_acquire+0xb93/0x1a90 [ +0.000009] lock_acquire+0x25d/0x2d0 [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] ? lock_release+0x13f/0x270 [ +0.000005] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 [amdgpu] [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] [ +0.000216] ? lock_release+0x13f/0x270 [ +0.000006] ? __fget_files+0x107/0x1e0 [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 [ +0.000007] do_syscall_64+0x36/0x70 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000007] RIP: 0033:0x7fbff90a7317 [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: 00007fbff90a7317 [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: 0000000000000004 [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: 00007fbcc402d880 [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: 00000000c0184b18 [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: 00007fbcc402d820 Cc: Christian König <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Prevent random memory access in FRU codeLuben Tuikov1-10/+12
Prevent random memory access in the FRU EEPROM code by passing the size of the destination buffer to the reading routine, and reading no more than the size of the buffer. Cc: Kent Russell <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Don't offset by 2 in FRU EEPROMLuben Tuikov1-9/+4
Read buffers no longer expose the I2C address, and so we don't need to offset by two when we get the read data. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Fixes: bd607166af7fe3 ("drm/amdgpu: Enable reading FRU chip via I2C v3") Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Nerf "buff" to "buf"Luben Tuikov1-14/+14
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er). This is consistent with the rest of the kernel code. Cc: Kent Russell <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU resume shared virtual memory rangesRajneesh Bhardwaj3-0/+119
In CRIU resume stage, resume all the shared virtual memory ranges from the data stored inside the resuming kfd process during CRIU restore phase. Also setup xnack mode and free up the resources. KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr interface but we must clear the flags during restore as there might be some default flags set when the prange is created. Also handle the invalid PREFETCH atribute values saved during checkpoint by replacing them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute. (rajneesh: Fixed the checkpatch reported problems) Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU prepare for svm resumeRajneesh Bhardwaj4-2/+73
During CRIU restore phase, the VMAs for the virtual address ranges are not at their final location yet so in this stage, only cache the data required to successfully resume the svm ranges during an imminent CRIU resume phase. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Save Shared Virtual Memory rangesRajneesh Bhardwaj3-1/+108
During checkpoint stage, save the shared virtual memory ranges and attributes for the target process. A process may contain a number of svm ranges and each range might contain a number of attributes. While not all attributes may be applicable for a given prange but during checkpoint we store all possible values for the max possible attribute types. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Discover svm rangesRajneesh Bhardwaj4-6/+81
A KFD process may contain a number of virtual address ranges for shared virtual memory management and each such range can have many SVM attributes spanning across various nodes within the process boundary. This change reports the total number of such SVM ranges and their total private data size by extending the PROCESS_INFO op of the the CRIU IOCTL to discover the svm ranges in the target process and a future patches brings in the required support for checkpoint and restore for SVM ranges. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: use user_gpu_id for svm rangesRajneesh Bhardwaj1-2/+2
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore support its possible that the SVM ranges can be resumed on another node where the actual_gpu_id may not be same as the original (user_gpu_id) gpu id. So modify svm code to use user_gpu_id. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU allow external mm for svm rangesRajneesh Bhardwaj1-8/+9
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct from current but for a Checkpoint or Restore operation, the current->mm will fetch the mm for the CRIU master process. So modify these helpers to accept the task mm for a target kfd process to support Checkpoint Restore. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU checkpoint and restore xnack modeRajneesh Bhardwaj2-0/+16
Recoverable page faults are represented by the xnack mode setting inside a kfd process and are used to represent the device page faults. For CR, we don't consider negative values which are typically used for querying the current xnack mode without modifying it. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU export BOs as prime dmabuf objectsRajneesh Bhardwaj1-2/+69
KFD buffer objects do not associate a GEM handle with them so cannot directly be used with libdrm to initiate a system dma (sDMA) operation to speedup the checkpoint and restore operation so export them as dmabuf objects and use with libdrm helper (amdgpu_bo_import) to further process the sdma command submissions. With sDMA, we see huge improvement in checkpoint and restore operations compared to the generic pci based access via host data path. Suggested-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU implement gpu_id remappingDavid Yat Sin5-160/+414
When doing a restore on a different node, the gpu_id's on the restore node may be different. But the user space application will still refer use the original gpu_id's in the ioctl calls. Adding code to create a gpu id mapping so that kfd can determine actual gpu_id during the user ioctl's. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU checkpoint and restore eventsDavid Yat Sin3-89/+280
Add support to existing CRIU ioctl's to save and restore events during criu checkpoint and restore. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU checkpoint and restore queue control stackDavid Yat Sin11-53/+138
Checkpoint contents of queue control stacks on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU checkpoint and restore queue mqdsDavid Yat Sin11-26/+516
Checkpoint contents of queue MQD's on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU restore queue doorbell idDavid Yat Sin1-19/+41
When re-creating queues during CRIU restore, restore the queue with the same doorbell id value used during CRIU dump. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU restore sdma id for queuesDavid Yat Sin3-15/+40
When re-creating queues during CRIU restore, restore the queue with the same sdma id value used during CRIU dump. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU restore queue idsDavid Yat Sin4-9/+34
When re-creating queues during CRIU restore, restore the queue with the same queue id value used during CRIU dump. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU add queues supportDavid Yat Sin3-8/+353
Add support to existing CRIU ioctl's to save number of queues and queue properties for each queue during checkpoint and re-create queues on restore. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD unpause operationDavid Yat Sin3-1/+39
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO op the queues will be stay in an evicted state. Once the plugin is done draining BO contents, it is safe to perform an UNPAUSE op for the queues to resume. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD resume ioctlRajneesh Bhardwaj5-14/+122
This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD restore ioctlRajneesh Bhardwaj1-1/+297
This implements the KFD CRIU Restore ioctl that lays the basic foundation for the CRIU restore operation. It provides support to create the buffer objects corresponding to the checkpointed image. This ioctl creates various types of buffer objects such as VRAM, MMIO, Doorbell, GTT based on the date sent from the userspace plugin. The data mostly contains the previously checkpointed KFD images from some KFD processs. While restoring a criu process, attach old IDR values to newly created BOs. This also adds the minimal gpu mapping support for a single gpu checkpoint restore use case. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD checkpoint ioctlRajneesh Bhardwaj6-2/+213
This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Implement KFD process_info ioctlRajneesh Bhardwaj1-1/+55
This IOCTL op is expected to be called as a precursor to the actual Checkpoint operation. This does the basic discovery into the target process seized by CRIU and relays the information to the userspace that utilizes it to start the Checkpoint operation via another dedicated IOCTL op. The process_info IOCTL op determines the number of GPUs, buffer objects that are associated with the target process, its process id in caller's namespace since /proc/pid/mem interface maybe used to drain the contents of the discovered buffer objects in userspace and getpid returns the pid of CRIU dumper process. Also the pid of a process inside a container might be different than its global pid so return the ns pid. Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdkfd: CRIU Introduce Checkpoint-Restore APIsRajneesh Bhardwaj2-2/+161
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can snapshot a running process and later restore it on same or a remote machine but expects the processes that have a device file (e.g. GPU) associated with them, provide necessary driver support to assist CRIU and its extensible plugin interface. Thus, In order to support the Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute Kernel driver, needs to provide a set of new APIs that provide necessary VRAM metadata and its contents to a userspace component (CRIU plugin) that can store it in form of image files. This introduces some new ioctls which will be used to checkpoint-Restore any KFD bound user process. KFD only allows ioctl calls from the same process that opened the KFD file descriptor. Since these ioctls are expected to be called from a KFD criu plugin which has elevated ptrace attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with the file descriptors so modify KFD to allow such calls. (API redesigned by David Yat Sin) Suggested-by: Felix Kuehling <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: Print once if RAS unsupportedLuben Tuikov1-8/+8
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes the dmesg log to fill up with the same message, e.g, [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function. Make it dev_dbg_once(), as it isn't something correctible during boot or thereafter, so printing just once is sufficient. Also sanitize the message. Cc: Alex Deucher <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: John Clements <[email protected]> Cc: Tao Zhou <[email protected]> Cc: yipechai <[email protected]> Fixes: 8b0fb0e967c1 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops") Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: rename amdgpu_vm_bo_rmv to _delChristian König6-9/+9
Some people complained about the name and this matches much more Linux naming conventions for object functions. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Daniel Vetter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm/amdgpu: add some lockdep checks to the VM codeChristian König1-0/+6
Whenever a bo_va structure is added or removed the VM and eventually added BO should be locked. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Acked-by: Daniel Vetter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-07drm: Convert open-coded yes/no strings to yesno()Lucas De Marchi1-1/+3
linux/string_helpers.h provides a helper to return "yes"/"no" strings. Replace the open coded versions with str_yes_no(). The places were identified with the following semantic patch: @@ expression b; @@ - b ? "yes" : "no" + str_yes_no(b) Then the includes were added, so we include-what-we-use, and parenthesis adjusted in drivers/gpu/drm/v3d/v3d_debugfs.c. After the conversion we still see the same binary sizes: text data bss dec hex filename 51149 3295 212 54656 d580 virtio/virtio-gpu.ko.old 51149 3295 212 54656 d580 virtio/virtio-gpu.ko 1441491 60340 800 1502631 16eda7 radeon/radeon.ko.old 1441491 60340 800 1502631 16eda7 radeon/radeon.ko 6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko.old 6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko 411986 10490 6176 428652 68a6c drm.ko.old 411986 10490 6176 428652 68a6c drm.ko 98129 1636 264 100029 186bd dp/drm_dp_helper.ko.old 98129 1636 264 100029 186bd dp/drm_dp_helper.ko 1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko.old 1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko Signed-off-by: Lucas De Marchi <[email protected]> Reviewed-by: Jani Nikula <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-02-07drm/amd/display: Use str_yes_no()Lucas De Marchi1-9/+5
Remove the local yesno() implementation and adopt the str_yes_no() from linux/string_helpers.h. Signed-off-by: Lucas De Marchi <[email protected]> Reviewed-by: Harry Wentland <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-02-07Merge remote-tracking branch 'drm/drm-next' into drm-misc-nextMaarten Lankhorst249-3242/+4941
First backmerge into drm-misc-next. Required for more helpers backmerged, and to pull in 5.17 (rc2). Signed-off-by: Maarten Lankhorst <[email protected]>
2022-02-02drm/amdgpu: fix logic inversion in checkChristian König1-1/+1
We probably never trigger this, but the logic inside the check is inverted. Signed-off-by: Christian König <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabledMario Limonciello1-2/+1
dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: e25443d2765f4 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879 Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd/display: Force link_rate as LINK_RATE_RBR2 for 2018 15" Apple Retina ↵Aun-Ali Zaidi1-0/+20
panels The eDP link rate reported by the DP_MAX_LINK_RATE dpcd register (0xa) is contradictory to the highest rate supported reported by EDID (0xc = LINK_RATE_RBR2). The effects of this compounded with commit '4a8ca46bae8a ("drm/amd/display: Default max bpc to 16 for eDP")' results in no display modes being found and a dark panel. For now, simply force the maximum supported link rate for the eDP attached 2018 15" Apple Retina panels. Additionally, we must also check the firmware revision since the device ID reported by the DPCD is identical to that of the more capable 16,1, incorrectly quirking it. We also use said firmware check to quirk the refreshed 15,1 models with Vega graphics as they use a slightly newer firmware version. Tested-by: Aun-Ali Zaidi <[email protected]> Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Aun-Ali Zaidi <[email protected]> Signed-off-by: Aditya Garg <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-02-02drm/amd/display: revert "Reset fifo after enable otg"Zhan Liu6-31/+0
[Why] This change causes regression, that prevents some systems from lighting up internal displays. [How] Revert this patch until a new solution is ready. Tested-by: Daniel Wheeler <[email protected]> Reviewed-by: Charlene Liu <[email protected]> Acked-by: Stylon Wang <[email protected]> Signed-off-by: Zhan Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-02-02drm/amd/display: watermark latencies is not enough on DCN31Paul Hsieh1-10/+10
[Why] The original latencies were causing underflow in some modes. Resolution: 2880x1620@60p when HDR enable [How] 1. Replace with the up-to-date watermark values based on new measurments 2. Correct the ddr_wm_table name to DDR5 on DCN31 Tested-by: Daniel Wheeler <[email protected]> Reviewed-by: Aric Cyr <[email protected]> Acked-by: Stylon Wang <[email protected]> Signed-off-by: Paul Hsieh <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-02-02drm/amd/display: Update watermark values for DCN301Agustin Gutierrez1-8/+8
[Why] There is underflow / visual corruption DCN301, for high bandwidth MST DSC configurations such as 2x1440p144 or 2x4k60. [How] Use up-to-date watermark values for DCN301. Reviewed-by: Zhan Liu <[email protected]> Signed-off-by: Agustin Gutierrez <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-02-02drm/amdgpu: fix a potential GPU hang on cyan skillfishLang Yu1-0/+3
We observed a GPU hang when querying GMC CG state(i.e., cat amdgpu_pm_info) on cyan skillfish. Acctually, cyan skillfish doesn't support any CG features. Just prevent it from accessing GMC CG registers. Signed-off-by: Lang Yu <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-02-02drm/amd: Only run s3 or s0ix if system is configured properlyMario Limonciello1-0/+8
This will cause misconfigured systems to not run the GPU suspend routines. * In APUs that are properly configured system will go into s2idle. * In APUs that are intended to be S3 but user selects s2idle the GPU will stay fully powered for the suspend. * In APUs that are intended to be s2idle and system misconfigured the GPU will stay fully powered for the suspend. * In systems that are intended to be s2idle, but AMD dGPU is also present, the dGPU will go through S3 Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd: add support to check whether the system is set to s3Mario Limonciello2-0/+15
This will be used to help make decisions on what to do in misconfigured systems. v2: squash in semicolon fix from Stephen Rothwell Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd/display: Use NULL pointer instead of plain integerMagali Lemes1-1/+1
Assigning 0L to a pointer variable caused the following warning: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dsc/rc_calc_fpu.c:71:40: warning: Using plain integer as NULL pointer In order to remove this warning, this commit assigns a NULL pointer to the pointer variable that caused this issue. Reported-by: kernel test robot <[email protected]> Signed-off-by: Magali Lemes <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02amdgpu/pm: Implement new API function "emit" that accepts buffer base and ↵Darren Powell5-5/+77
write offset (v3) Rewrote patchset to order patches as (API, hw impl, usecase) - added API for new power management function emit_clk_levels This function should duplicate the functionality of print_clk_levels, but this solution passes the buffer base and write offset down the stack. - new powerplay function emit_clock_levels, implemented by smu_emit_ppclk_levels() This function parallels the implementation of smu_print_ppclk_levels and calls emit_clk_levels, and allows the returns of errors - new helper function smu_convert_to_smuclk called by smu_print_ppclk_levels and smu_emit_ppclk_levels Signed-off-by: Darren Powell <[email protected]> Reviewed-By: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amdgpu: limit the number of dst address in traceSomalapuram Amaranath2-4/+3
trace_amdgpu_vm_update_ptes trace unable to log when nptes too large Signed-off-by: Somalapuram Amaranath <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-02-02drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabledMario Limonciello1-2/+1
dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: e25443d2765f4 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879 Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>