aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
AgeCommit message (Collapse)AuthorFilesLines
2023-01-10drm/amd: Use `amdgpu_ucode_*` helpers for GPU info binMario Limonciello1-11/+3
The `amdgpu_ucode_request` helper will ensure that the return code for missing firmware is -ENODEV so that early_init can fail. The `amdgpu_ucode_release` helper is for symmetry on unloading. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-09drm/amd: Delay removal of the firmware framebufferMario Limonciello1-0/+8
Removing the firmware framebuffer from the driver means that even if the driver doesn't support the IP blocks in a GPU it will no longer be functional after the driver fails to initialize. This change will ensure that unsupported IP blocks at least cause the driver to work with the EFI framebuffer. Cc: stable@vger.kernel.org Suggested-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03drm/amdgpu: rename vram_scratch into mem_scratchChristian König1-14/+14
Rename vram_scratch into mem_scratch and allow allocating it into GTT as well. The only problem with that is that we won't have a default page for the system aperture any more. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03drm/amdgpu: use VRAM|GTT for a bunch of kernel allocationsChristian König1-3/+5
Technically all of those can use GTT as well, no need to force things into VRAM. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03drm/amdgpu: adjust the sequence to check soft resetLikun Gao1-6/+2
1.Drop soft reset check when do should recover gpu check. (As it will skip gpu reset operation if some ip is hang but not support soft reset) 2.Check soft reset status before do soft reset when pre asic reset. a. If check soft reset return true, it means: some ip is hang and it also support soft reset, will try soft reset first. b. If check soft reset return false, it means: I. All the ip are not hang, will skip gpu reset. II. Some ip is hang but not support soft reset, will skip soft reset and retry with full reset later. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20drm/amdgpu: skip MES for S0ix as well since it's part of GFXAlex Deucher1-2/+3
It's also part of gfxoff. Cc: stable@vger.kernel.org # 6.0, 6.1 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20Revert "drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume"Alex Deucher1-16/+0
This reverts commit f543d28687480fad06b708bc6e0b0b6ec953b078. This is no longer needed since we no longer touch SDMA 5.x for s0i3. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20drm/amdgpu: for S0ix, skip SDMA 5.x+ suspend/resumeAlex Deucher1-0/+6
SDMA 5.x is part of the GFX block so it's controlled via GFXOFF. Skip suspend as it should be handled the same as GFX. v2: drop SDMA 4.x. That requires special handling. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20drm/amdgpu: don't mess with SDMA clock or powergating in S0ixAlex Deucher1-4/+6
It's handled by GFXOFF for SDMA 5.x and SMU saves the state on SDMA 4.x. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-13drm/amdgpu: Add an extra evict_resource call during device_suspend.Shikang Fan1-0/+5
- evict_resource is taking too long causing sriov full access mode timeout. So, add an extra evict_resource in the beginning as an early evict. Signed-off-by: Shikang Fan <shikang.fan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29drm/amdgpu: Fix potential double free and null pointer dereferenceLiang He1-0/+5
In amdgpu_get_xgmi_hive(), we should not call kfree() after kobject_put() as the PUT will call kfree(). In amdgpu_device_ip_init(), we need to check the returned *hive* which can be NULL before we dereference it. Signed-off-by: Liang He <windhl@126.com> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-23drm/amdgpu: fix for suspend/resume kiq fence fallback under sriovShikang Fan1-11/+12
- in device_resume, sriov configure interrupt should be in full access, so release_full_gpu should be done after kfd_resume. - remove the previous workaround solution for sriov. Fixes: ec4927d463cb ("drm/amdgpu: fix for suspend/resume sequence under sriov") Signed-off-by: Shikang Fan <shikang.fan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-23drm/amdgpu: fix pci device refcount leakYang Yingliang1-0/+4
As comment of pci_get_domain_bus_and_slot() says, it returns a pci device with refcount increment, when finish using it, the caller must decrement the reference count by calling pci_dev_put(). So before returning from amdgpu_device_resume|suspend_display_audio(), pci_dev_put() is called to avoid refcount leak. Fixes: 3f12acc8d6d4 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3") Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-22Merge tag 'amd-drm-next-6.2-2022-11-18' of ↵Dave Airlie1-127/+76
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.2-2022-11-18: amdgpu: - SR-IOV fixes - Clean up DC checks - DCN 3.2.x fixes - DCN 3.1.x fixes - Don't enable degamma on asics which don't support it - IP discovery fixes - BACO fixes - Fix vbios allocation handling when vkms is enabled - Drop buggy tdr advanced mode GPU reset handling - Fix the build when DCN is not set in kconfig - MST DSC fixes - Userptr fixes - FRU and RAS EEPROM fixes - VCN 4.x RAS support - Aldrebaran CU occupancy reporting fix - PSP ring cleanup amdkfd: - Memory limit fix - Enable cooperative launch on gfx 10.3 amd-drm-next-6.2-2022-11-11: amdgpu: - SMU 13.x updates - GPUVM TLB race fix - DCN 3.1.4 updates - DCN 3.2.x updates - PSR fixes - Kerneldoc fix - Vega10 fan fix - GPUVM locking fixes in error pathes - BACO fix for Beige Goby - EEPROM I2C address cleanup - GFXOFF fix - Fix DC memory leak in error pathes - Flexible array updates - Mtype fix for GPUVM PTEs - Move Kconfig into amdgpu directory - SR-IOV updates - Fix possible memory leak in CS IOCTL error path amdkfd: - Fix possible memory overrun - CRIU fixes radeon: - ACPI ref count fix - HDA audio notifier support - Move Kconfig into radeon directory UAPI: - Add new GEM_CREATE flags to help to transition more KFD functionality to the DRM UAPI. These are used internally in the driver to align location based memory coherency requirements from memory allocated in the KFD with how we manage GPUVM PTEs. They are currently blocked in the GEM_CREATE IOCTL as we don't have a user right now. They are just used internally in the kernel driver for now for existing KFD memory allocations. So a change to the UAPI header, but no functional change in the UAPI. From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221118170807.6505-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-11-17drm/amdgpu: Enable mode-1 reset for RAS recovery in fatal error modeYiPeng Chai1-0/+4
The patch is enabling mode-1 reset for RAS recovery in fatal error mode. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-16Merge tag 'drm-misc-next-2022-11-10-1' of ↵Dave Airlie1-0/+1
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 6.2: UAPI Changes: Cross-subsystem Changes: Core Changes: - atomic-helper: Add begin_fb_access and end_fb_access hooks - fb-helper: Rework to move fb emulation into helpers - scheduler: rework entity flush, kill and fini - ttm: Optimize pool allocations Driver Changes: - amdgpu: scheduler rework - hdlcd: Switch to DRM-managed resources - ingenic: Fix registration error path - lcdif: FIFO threshold tuning - meson: Fix return type of cvbs' mode_valid - ofdrm: multiple fixes (kconfig, types, endianness) - sun4i: A100 and D1 support - panel: - New Panel: Jadard JD9365DA-H3 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20221110083612.g63eaocoaa554soh@houat
2022-11-15drm/amdgpu: stop resubmittting jobs in amdgpu_pci_resumeChristian König1-2/+0
The state of VRAM is unreliable due to a PCI event like AER, link reset or DPC. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15drm/amdgpu: stop resubmitting jobs for GPU reset v2Christian König1-5/+1
Re-submitting IBs by the kernel has many problems because pre- requisite state is not automatically re-created as well. In other words neither binary semaphores nor things like ring buffer pointers are in the state they should be when the hardware starts to work on the IBs again. Additional to that even after more than 5 years of developing this feature it is still not stable and we have massively problems getting the reference counts right. As discussed with user space developers this behavior is not helpful in the first place. For graphics and multimedia workloads it makes much more sense to either completely re-create the context or at least re-submitting the IBs from userspace. For compute use cases re-submitting is also not very helpful since userspace must rely on the accuracy of the result. Because of this we stop this practice and instead just properly note that the fence submission was canceled. The only use case we keep the re-submission for now is SRIOV and function level resets. v2: as suggested by Sshaoyun stop resubmitting jobs even for SRIOV Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15drm/amdgpu: revert "implement tdr advanced mode"Christian König1-103/+0
This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15drm/amdgpu: Fixed the problem that ras error can't be queried after gpu ↵YiPeng Chai1-0/+2
recovery is completed Amdgpu_ras_set_error_query_ready is called at the start of amdgpu_device_gpu_recover to disable query ras error, but the code behind only enables query ras error in full reset path, but not in soft reset path, emergency restart path and skip the hardware reset path. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15drm/amdgpu: there is no vbios fb on devices with no display hw (v2)Alex Deucher1-0/+41
If we enable virtual display functionality on parts with no display hardware we can end up trying to check for and reserve the vbios FB area on devices where it doesn't exist. Check if display hardware is actually present on the hardware before trying to reserve the memory. v2: move the check into common code Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15drm/amdgpu: clarify DC checksAlex Deucher1-15/+17
There are several places where we don't want to check if a particular asic could support DC, but rather, if DC is enabled. Set a flag if DC is enabled and check for that rather than if a device supports DC or not. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15drm/amdgpu: rework SR-IOV virtual display handlingAlex Deucher1-2/+11
virtual display is enabled unconditionally in SR-IOV, but without specifying the virtual_display module, the number of crtcs defaults to 0. Set a single display by default for SR-IOV if the virtual_display parameter is not set. Only enable virtual display by default on SR-IOV on asics which actually have display hardware. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-05drm/fb-helper: Remove unnecessary include statementsThomas Zimmermann1-0/+1
Remove include statements for <drm/drm_fb_helper.h> where it is not required (i.e., most of them). In a few places include other header files that are required by the source code. v3: * fix amdgpu include statements * fix rockchip include statements Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-23-tzimmermann@suse.de
2022-11-04drm/amdgpu: fix for suspend/resume sequence under sriovVictor Zhao1-0/+2
- clear kiq ring after suspend/resume under sriov to aviod kiq ring test failure - update irq after resume to fix kiq interrput loss Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04drm/amdgpu: Disable MCBP from soc21 for SRIOVYiqing Yao1-1/+1
[why] Start from soc21, CP does not support MCBP, so disable it. [how] Used amgpu_mcbp flag alone instead of checking if is in SRIOV to enable/disable MCBP. Only set flag to enable on asic_type prior to soc21 in SRIOV. Signed-off-by: Yiqing Yao <yiqing.yao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27drm/amd: Fail the suspend if resources can't be evictedMario Limonciello1-5/+10
If a system does not have swap and memory is under 100% usage, amdgpu will fail to evict resources. Currently the suspend carries on proceeding to reset the GPU: ``` [drm] evicting device resources failed [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12 [drm] free PSP TMR buffer [TTM] Failed allocating page table [drm] evicting device resources failed amdgpu 0000:03:00.0: amdgpu: MODE1 reset amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset ``` At this point if the suspend actually succeeded I think that amdgpu would have recovered because the GPU would have power cut off and restored. However the kernel fails to continue the suspend from the memory pressure and amdgpu fails to run the "resume" from the aborted suspend. ``` ACPI: PM: Preparing to enter system sleep state S3 SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO) cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0 node 0: slabs: 22, objs: 1122, free: 0 ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651) [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed! [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62 amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62). PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62 amdgpu 0000:03:00.0: PM: failed to resume async: error -62 ``` To avoid this series of unfortunate events, fail amdgpu's suspend when the memory eviction fails. This will let the system gracefully recover and the user can try suspend again when the memory pressure is relieved. Reported-by: post@davidak.de Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24amd/amdgpu: fix repeated words in commentswangjianli1-1/+1
Delete the redundant word 'the'. Signed-off-by: wangjianli <wangjianli@cdjrlc.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resumePrike Liang1-0/+16
In the S2idle suspend/resume phase the gfxoff is keeping functional so some IP blocks will be likely to reinitialize at gfxoff entry and that will result in failing to program GC registers.Therefore, let disallow gfxoff until AMDGPU IPs reinitialized completely. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24drm/amdgpu: skip mes self test for gc 11.0.3 in recoverYuBiao Wang1-1/+1
Temporary disable mes self teset for gc 11.0.3 during gpu_recovery. Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com> Acked-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-18drm/amd/pm: disable cstate feature for gpu reset scenarioEvan Quan1-0/+8
Suggested by PMFW team and same as what did for gfxoff feature. This can address some Mode1Reset failures observed on SMU13.0.0. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.0.x Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-18Revert "drm/amdgpu: let mode2 reset fallback to default when failure"Victor Zhao1-6/+1
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17drm/amd/pm: disable cstate feature for gpu reset scenarioEvan Quan1-0/+8
Suggested by PMFW team and same as what did for gfxoff feature. This can address some Mode1Reset failures observed on SMU13.0.0. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.0.x
2022-10-17Revert "drm/amdgpu: let mode2 reset fallback to default when failure"Victor Zhao1-6/+1
This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdgpu: Add amdgpu suspend-resume code path under SRIOVBokun Zhang1-1/+26
- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor during the suspend time. Furthermore, we cannot request a mode 1 reset under SRIOV as VF. Therefore, we will skip it as it is called in suspend_noirq() function. - In the resume code path, we need to send REQ_GPU_INIT to the hypervisor and also resume PSP IP block under SRIOV. Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdgpu: Use simplified API for p2p dist calcLijo Lazar1-3/+3
Use the simpified API that calculates distance between two devices. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29drm/amdgpu: Disable verbose for p2p dist calcLijo Lazar1-1/+1
Disable verbose while getting p2p distance. With verbose, it shows warning if ACS redirect is set between the devices. Adds noise to dmesg logs when a few GPU devices are on the same platform. Example log: amdgpu 0000:34:00.0: ACS redirect is set between the client and provider (0000:31:00.0) amdgpu 0000:34:00.0: to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=0000:30:00.0;0000:2e:00.0;0000:33:00.0;0000:2e:10.0 Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-20drm/amdgpu: add gang submit backend v2Christian König1-0/+35
Allows submitting jobs as gang which needs to run on multiple engines at the same time. Basic idea is that we have a global gang submit fence representing when the gang leader is finally pushed to run on the hardware last. Jobs submitted as gang are never re-submitted in case of a GPU reset since this won't work and will just deadlock the hardware immediately again. v2: fix logic inversion, improve documentation, fix rcu Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19drm/amdgpu: Fixed psp fence and memory issues when removing amdgpu deviceYiPeng Chai1-1/+4
V3: Fixed psp fence and memory issues for the asic using smu v13_0_2 when removing amdgpu device. [Why]: 1. psp_suspend->psp_free_shared_bufs-> psp_ta_free_shared_buf-> amdgpu_bo_free_kernel-> ...->amdgpu_bo_release_notify-> amdgpu_fill_buffer psp will free vram memory used by psp when psp_suspend is called. But for the asic using smu v13_0_2, because psp_suspend is called before adev->shutdown is set to true when removing the first hive device, amdgpu fill_buffer will be called, which will cause fence issues when evicting all vram resources in amdgpu vram mgr_fini. 2. Since psp_hw_fini is not called after calling psp_suspend and psp_suspend only calls psp_ring_stop, the psp ring memory will not be released when amdgpu device is removed. [How]: 1. Set shutdown to true before calling amdgpu_device_gpu_recover, then amdgpu_fill_buffer will not be called when psp_suspend is called. 2. Free psp ring memory in psp_sw_fini. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19drm/amdgpu: Adjust removal control flow for smu v13_0_2YiPeng Chai1-0/+30
Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed. V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable. V3: 1. Update commit comments. 2. Split a patch into multiple patches. 3. The current patch does: a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into the existing gpu recover path, which make all devices in hive list only have HW reset but no resume (except the base IP). b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover in amdgpu_pci_remove when removing the first device in hive list. c. When removing the first device, the IP blocks keyword function call sequence is as follows: .suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini. ^ | |-<----------<---------<----| The first three sequences are because of a call to amdgpu_device_gpu_recover. The three sequences will be executed in a loop until all devices in the hive list are iterated. The sequences starting from .hw_fini only apply to the first device. Since .suspend has been called before, except the resumed phase1 basic ip blocks, all other ip blocks .hw_fini of current device will do nothing. d. When removing other devices, the calling sequences is the same as legacy: .hw_fini -> .early_fini -> .sw_fini. Since .suspend has been called when removing the first device, except the resumed phase1 basic ip blocks, all of other ip blocks .hw_fini of current device will do nothing. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-14drm/amdgpu: make sure to init common IP before gmcAlex Deucher1-3/+11
Move common IP init before GMC init so that HDP gets remapped before GMC init which uses it. This fixes the Unsupported Request error reported through AER during driver load. The error happens as a write happens to the remap offset before real remapping is done. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373 The error was unnoticed before and got visible because of the commit referenced below. This doesn't fix anything in the commit below, rather fixes the issue in amdgpu exposed by the commit. The reference is only to associate this commit with below one so that both go together. Fixes: 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()") Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdgpu: Fix hive reference count leakVignesh Chander1-2/+2
both get_xgmi_hive and put_xgmi_hive can be skipped since the reset domain is not necessary for VF Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13drm/amdgpu: Use per device reset_domain for XGMI on sriov configurationshaoyunl1-9/+11
For SRIOV configuration, host driver control the reset method(either FLR or heavier chain reset). The host will notify the guest individually with FLR message if individual GPU within the hive need to be reset. So for guest side, no need to use hive->reset_domain to replace the original per device reset_domain Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-07drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is ↵YiPeng Chai1-2/+1
uninstalled V1: The psp_cmd_submit_buf function is called by psp_hw_fini to send TA unload messages to psp to terminate ras, asd and tmr. But when amdgpu is uninstalled, drm_dev_unplug is called earlier than psp_hw_fini in amdgpu_pci_remove, the calling order as follows: static void amdgpu_pci_remove(struct pci_dev *pdev) { drm_dev_unplug ...... amdgpu_driver_unload_kms->amdgpu_device_fini_hw->... ->.hw_fini->psp_hw_fini->... ->psp_ta_unload->psp_cmd_submit_buf ...... } The program will return when calling drm_dev_enter in psp_cmd_submit_buf. So the call to drm_dev_enter in psp_cmd_submit_buf should be removed, so that the TA unload messages can be sent to the psp when amdgpu is uninstalled. V2: 1. Restore psp_cmd_submit_buf to its original code. 2. Move drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. 3. Since amdgpu_device_fini_hw is called by amdgpu_driver_unload_kms, remove the unplug check to release device mmio resource in amdgpu_device_fini_hw before calling drm_dev_unplug. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-29drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinksAlex Sierra1-1/+2
[Why] Devices with CPU XGMI iolink do not support PCIe peer access. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-25drm/amd/amdgpu: avoid soft reset check when gpu recovery disabledChengming Gui1-5/+6
Avoid soft reset, even ip hang check (ring/ib test) when gpu recovery disabled. v2: add missing "}" Signed-off-by: Chengming Gui <Jack.Gui@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22drm/amdgpu: Remove the additional kfd pre reset call for sriovshaoyunl1-2/+0
The additional call is caused by merge conflict Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22drm/amdgpu: fix hive reference leak when adding xgmi deviceYiPeng Chai1-0/+2
Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive which will leak the hive reference. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16drm/amd: Add detailed GFXOFF stats to debugfsAndré Almeida1-0/+2
Add debugfs interface to log GFXOFF statistics: - Read amdgpu_gfxoff_count to get the total GFXOFF entry count at the time of query since system power-up - Write 1 to amdgpu_gfxoff_residency to start logging, and 0 to stop. Read it to get average GFXOFF residency % multiplied by 100 during the last logging interval. Both features are designed to be keep the values persistent between suspends. Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16drm/amdgpu: revert context to stop engine before mode2 resetVictor Zhao1-0/+1
For some hang caused by slow tests, engine cannot be stopped which may cause resume failure after reset. In this case, force halt engine by reverting context addresses Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>