aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-04-23drm/amdgpu: remove unused vm context flagsNirmoy Das1-3/+0
Remove unused AMDGPU_VM_CONTEXT_GFX and AMDGPU_VM_CONTEXT_COMPUTE flags. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: cleanup amdgpu_vm_init()Nirmoy Das3-16/+6
Currently only way to create compute vm is through amdgpu_vm_make_compute(). So vm_context isn't required anymore for amdgpu_vm_init(). Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: expose amdgpu_bo_create_shadow()Nirmoy Das2-3/+6
Exposed amdgpu_bo_create_shadow() will be needed for amdgpu_vm handling. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: fix coding style and documentation in amdgpu_vram_mgr.cChristian König1-80/+93
No functional changes, just cleaning up some leftovers and improve documentation. Signed-off-by: Christian König <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: fix coding style and documentation in amdgpu_gtt_mgr.cChristian König1-79/+90
Avoid the forward define, fix coding style, add documentation. No functional change. Signed-off-by: Christian König <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdkfd: remove redundant initialization to variable rColin Ian King1-1/+1
The variable r is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdkfd: fix uint32 variable compared to less than zeroColin Ian King1-1/+1
Currently the call to kfd_process_gpuidx_from_gpuid is returning an int value and this is being assigned to a uint32_t variable gpuidx and this is being checked for a negative error return which is always going to be false. Fix this by making gpuidx an int32_t. This makes gpuidx also type consistent with the use of gpuidx from the callers. Addresses-Coverity: ("Unsigned compared against 0") Fixes: cda0f85bfa5e ("drm/amdkfd: refine migration policy with xnack on") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdkfd: set attribute access for default rangesAlex Sierra1-4/+3
Attribute access value for default ranges is set, based on process xnack on/off. XNACK ON has GPU access attribute for unregistered ranges through page fault. While XNACK OFF has no access attribute for unregistered ranges. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdkfd: svm ranges creation for unregistered memoryAlex Sierra1-3/+104
SVM ranges are created for unregistered memory, triggered by page faults. These ranges are migrated/mapped to GPU VRAM memory. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: extend xnack limit page fault timeoutAlex Sierra1-0/+2
Extending this timeout will prevent IH from storm interrupts coming from SDMA while a page fault is active. Currently, on Aldebaran, handling that many interrupts can take a lot of CPU time (up to 4 seconds). This eventually causes timeouts in other tasks. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: disable gfx ras by default in aldebaranHawking Zhang1-2/+1
aldebaran gfx ras is still under development Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdkfd: add per-vmid-debug map_process_supportJonathan Kim5-4/+161
In order to support multi-process debugging, HWS PM4 packet MAP_PROCESS requires an extension of 5 DWORDS to support targeting of per-vmid SPI debug control registers as well as watch points per process. Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amd/amdgpu: add cglsKenneth Feng1-0/+3
enable cgls to improve the runtime power efficiency. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: optimize gfx ras features flag cleanStanley.Yang1-5/+5
Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: Enable SDMA MGCG for VangoghJinzhou Su2-0/+5
Add flags AMD_CG_SUPPORT_SDMA_MGCG for Vangogh. Start to open sdma mgcg from firmware version 70. Signed-off-by: Jinzhou Su <[email protected]> Reviewed-by: Huang Rui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: add support for ras init flagsJohn Clements2-2/+14
conditionally configure ras for dgpu mode or poison propogation mode Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-23drm/amdgpu: refine gprs init shaders to check coverageDennis Li3-5/+382
Add codes to check whether all SIMDs are covered, make sure that all GPRs are initialized. Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amd/amdgpu/amdgpu_cs: Repair some function naming disparityLee Jones1-3/+3
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:685: warning: expecting prototype for cs_parser_fini(). Prototype was for amdgpu_cs_parser_fini() instead drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1502: warning: expecting prototype for amdgpu_cs_wait_all_fence(). Prototype was for amdgpu_cs_wait_all_fences() instead drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1656: warning: expecting prototype for amdgpu_cs_find_bo_va(). Prototype was for amdgpu_cs_find_mapping() instead Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amd/amdgpu/amdgpu_ring: Provide description for 'sched_score'Lee Jones1-0/+1
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:169: warning: Function parameter or member 'sched_score' not described in 'amdgpu_ring_init' Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amd/amdgpu/amdgpu_ttm: Fix incorrectly documented function ↵Lee Jones1-1/+1
'amdgpu_ttm_copy_mem_to_mem()' Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:311: warning: expecting prototype for amdgpu_copy_ttm_mem_to_mem(). Prototype was for amdgpu_ttm_copy_mem_to_mem() instead Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amd/amdgpu/amdgpu_gart: Correct a couple of function names in the docsLee Jones1-2/+2
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c:73: warning: expecting prototype for amdgpu_dummy_page_init(). Prototype was for amdgpu_gart_dummy_page_init() instead drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c:96: warning: expecting prototype for amdgpu_dummy_page_fini(). Prototype was for amdgpu_gart_dummy_page_fini() instead Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Nirmoy Das <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amd/amdgpu/amdgpu_fence: Provide description for 'sched_score'Lee Jones1-0/+1
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:444: warning: Function parameter or member 'sched_score' not described in 'amdgpu_fence_driver_init_ring' Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/radeon/radeon_device: Provide function name in kernel-doc headerLee Jones1-1/+2
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/radeon/radeon_device.c:1101: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amd/amdgpu/amdgpu_device: Remove unused variable 'r'Lee Jones1-3/+2
Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function ‘amdgpu_device_suspend’: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3733:6: warning: variable ‘r’ set but not used [-Wunused-but-set-variable] Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: Add CONFIG_HSA_AMD_SVMFelix Kuehling5-11/+67
Control whether to build SVM support into amdgpu with a Kconfig option. This makes it easier to disable it in production kernels if this new feature causes problems in production environments. Use "depends on" instead of "select" for DEVICE_PRIVATE, as is recommended for visible options. Reviewed-by: Philip Yang <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: Add SVM API support capability bitsPhilip Yang2-4/+12
SVMAPISupported property added to HSA_CAPABILITY, the value match HSA_CAPABILITY defined in Thunk spec: SVMAPISupported: it will not be supported on older kernels that don't have HMM or on systems with GFXv8 or older GPUs without support for 48-bit virtual addresses. CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match HSA_MEMORYPROPERTY defined in Thunk spec: CoherentHostAccess: whether or not device memory can be coherently accessed by the host CPU. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: multiple gpu migrate vram to vramFelix Kuehling3-15/+87
If prefetch range to gpu with acutal location is another gpu, or GPU retry fault restore pages to migrate the range with acutal location is gpu, then migrate from one gpu to another gpu. Use system memory as bridge because sdma engine may not able to access another gpu vram, use sdma of source gpu to migrate to system memory, then use sdma of destination gpu to migrate from system memory to gpu. Print out gpuid or gpuidx in debug messages. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: add svm range validate timestampFelix Kuehling2-0/+19
With xnack on, add validate timestamp in order to handle GPU vm fault from multiple GPUs. If GPU retry fault need migrate the range to the best restore location, use range validate timestamp to record system timestamp after range is restored to update GPU page table. Because multiple pages of same range have multiple retry fault, define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING to the long time period that pending retry fault may still comes after page table update, to skip duplicate retry fault of same range. If difference between system timestamp and range last validate timestamp is bigger than AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING, that means the retry fault is from another GPU, then continue to handle retry fault recover. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: refine migration policy with xnack onFelix Kuehling5-15/+150
With xnack on, GPU vm fault handler decide the best restore location, then migrate range to the best restore location and update GPU mapping to recover the GPU vm fault. Signed-off-by: Philip Yang <[email protected]> Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdgpu: add svm_bo eviction to enable_signal cbAlex Sierra1-3/+8
Add to amdgpu_amdkfd_fence.enable_signal callback, support for svm_bo fence eviction. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdgpu: svm bo enable_signal call conditionAlex Sierra1-0/+14
[why] To support svm bo eviction mechanism. [how] If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set, enable_signal callback will be called inside amdgpu_evict_flags. This also causes gutting of the BO by removing all placements, so that TTM won't actually do an eviction. Instead it will discard the memory held by the BO. This is needed for HMM migration to user mode system memory pages. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: add svm_bo eviction mechanism supportFelix Kuehling2-46/+168
svm_bo eviction mechanism is different from regular BOs. Every SVM_BO created contains one eviction fence and one worker item for eviction process. SVM_BOs can be attached to one or more pranges. For SVM_BO eviction mechanism, TTM will start to call enable_signal callback for every SVM_BO until VRAM space is available. Here, all the ttm_evict calls are synchronous, this guarantees that each eviction has completed and the fence has signaled before it returns. Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdgpu: add param bit flag to create SVM BOsAlex Sierra2-5/+6
Add CREATE_SVM_BO define bit for SVM BOs. Another define flag was moved to concentrate these KFD type flags in one include file. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: add svm_bo reference for eviction fenceAlex Sierra3-5/+10
[why] As part of the SVM functionality, the eviction mechanism used for SVM_BOs is different. This mechanism uses one eviction fence per prange, instead of one fence per kfd_process. [how] A svm_bo reference to amdgpu_amdkfd_fence to allow differentiate between SVM_BO or regular BO evictions. This also include modifications to set the reference at the fence creation call. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: SVM API call to restore page tablesAlex Sierra1-5/+15
Use SVM API to restore page tables when retry fault and compute context are enabled. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: page table restore through svm APIFelix Kuehling2-0/+61
Page table restore implementation in SVM API. This is called from the fault handler at amdgpu_vm. To update page tables through the page fault retry IH. Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdgpu: enable 48-bit IH timestamp counterAlex Sierra1-0/+1
By default this timestamp is 32 bit counter. It gets overflowed in around 10 minutes. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Philip Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: invalidate tables on page retry faultFelix Kuehling3-17/+70
GPU page tables are invalidated by unmapping prange directly at the mmu notifier, when page fault retry is enabled through amdgpu_noretry global parameter. The restore page table is performed at the page fault handler. If xnack is on, we update GPU mappings after migration to avoid unnecessary GPUVM faults. Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: HMM migrate vram to ramFelix Kuehling4-13/+429
If CPU page fault happens, HMM pgmap_ops callback migrate_to_ram start migrate memory from vram to ram in steps: 1. migrate_vma_pages get vram pages, and notify HMM to invalidate the pages, HMM interval notifier callback evict process queues 2. Allocate system memory pages 3. Use svm copy memory to migrate data from vram to ram 4. migrate_vma_pages copy pages structure from vram pages to ram pages 5. Return VM_FAULT_SIGBUS if migration failed, to notify application 6. migrate_vma_finalize put vram pages, page_free callback free vram pages and vram nodes 7. Restore work wait for migration is finished, then update GPU page table mapping to system memory, and resume process queues Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: HMM migrate ram to vramFelix Kuehling4-13/+502
Register svm range with same address and size but perferred_location is changed from CPU to GPU or from GPU to CPU, trigger migration the svm range from ram to vram or from vram to ram. If svm range prefetch location is GPU with flags KFD_IOCTL_SVM_FLAG_HOST_ACCESS, validate the svm range on ram first, then migrate it from ram to vram. After migrating to vram is done, CPU access will have cpu page fault, page fault handler migrate it back to ram and resume cpu access. Migration steps: 1. migrate_vma_pages get svm range ram pages, notify the interval is invalidated and unmap from CPU page table, HMM interval notifier callback evict process queues 2. Allocate new pages in vram using TTM 3. Use svm copy memory to sdma copy data from ram to vram 4. migrate_vma_pages copy ram pages structure to vram pages structure 5. migrate_vma_finalize put ram pages to free ram pages and memory 6. Restore work wait for migration is finished, then update GPUs page table mapping to new vram pages, resume process queues If migrate_vma_setup failed to collect all ram pages of range, retry 3 times until success to start migration. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: copy memory through gart tablePhilip Yang2-0/+177
Use sdma linear copy to migrate data between ram and vram. The sdma linear copy command uses kernel buffer function queue to access system memory through gart table. Use reserved gart table window 0 to map system page address, and vram page address is direct mapping. Use the same kernel buffer function to fill in gart table mapping, so this is serialized with memory copy by sdma job submit. We only need wait for the last memory copy sdma fence for larger buffer migration. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: support xgmi same hive mappingPhilip Yang1-19/+75
amdgpu_gmc_get_vm_pte use bo_va->is_xgmi same hive information to set pte flags to update GPU mapping. Add local structure variable bo_va, and update bo_va.is_xgmi, pass it to mapping->bo_va while mapping to GPU. Assuming xgmi pstate is hi after boot. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: validate vram svm range from TTMFelix Kuehling2-7/+309
If svm range perfetch location is not zero, use TTM to alloc amdgpu_bo vram nodes to validate svm range, then map vram nodes to GPUs. Use offset to sub allocate from the same amdgpu_bo to handle overlap vram range while adding new range or unmapping range. svm_bo has ref count to trace the shared ranges. If all ranges of shared amdgpu_bo are migrated to ram, ref count becomes 0, then amdgpu_bo is released, all ranges svm_bo is set to NULL. To migrate range from ram back to vram, allocate the same amdgpu_bo with previous offset if the range has svm_bo. If prange migrate to VRAM, no CPU mapping exist, then process exit will not have unmap callback for this prange to free prange and svm bo. Free outstanding pranges from svms list before process is freed in svm_range_list_fini. Signed-off-by: Philip Yang <[email protected]> Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: set memory limit to avoid OOM with HMM enabledPhilip Yang3-0/+14
HMM migration alloc sizeof(struct page) on system memory for each VRAM page, it is 1GB system memory reserved for 64GB VRAM. To avoid application OOM, increase system memory used size based on VRAM size of all GPUs, then application alloc memory will fail if system memory usage reach the limit. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Oak Zeng <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: register HMM device private zonePhilip Yang6-1/+164
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: add ioctl to configure and query xnack retriesAlex Sierra2-1/+70
Xnack retries are used for page fault recovery. Some AMD chip families support continuously retry while page table entries are invalid. The driver must handle the page fault interrupt and fill in a valid entry for the GPU to continue. This ioctl allows to enable/disable XNACK retries per KFD process. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: add xnack enabled flag to kfd_processAlex Sierra3-2/+68
XNACK mode controls the SQ RETRY_DISABLE setting that determines, whether recoverable page faults can be supported on GFXv9 hardware. Only on Aldebaran we can support different processes running with different XNACK modes. On older chips all processes must use the same RETRY_DISABLE setting. However, processes not relying on recoverable page faults can work with RETRY enabled. This means XNACK off is always available as a fallback so we can use the same mode on all GPUs in a process. Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdgpu: Enable retry faults unconditionally on AldebaranFelix Kuehling3-5/+12
This is needed to allow per-process XNACK mode selection in the SQ when booting with XNACK off by default. Signed-off-by: Felix Kuehling <[email protected]> Reviewed-by: Philip Yang <[email protected]> Tested-by: Alex Sierra <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: svm range eviction and restoreFelix Kuehling4-0/+140
HMM interval notifier callback notify CPU page table will be updated, stop process queues if the updated address belongs to svm range registered in process svms objects tree. Scheduled restore work to update GPU page table using new pages address in the updated svm range. The restore worker flushes any deferred work to make sure it restores an up-to-date svm_range_list. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-04-20drm/amdkfd: map svm range to GPUsFelix Kuehling4-10/+478
Use amdgpu_vm_bo_update_mapping to update GPU page table to map or unmap svm range system memory pages address to GPUs. Signed-off-by: Philip Yang <[email protected]> Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>