aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2021-12-31Merge branch 'drm-misc-fixes' of ssh://git.freedesktop.org/git/drm/drm-misc ↵Dave Airlie1-1/+1
into drm-fixes This merges two fixes that haven't been sent to me yet, but I wanted to get in. One amdgpu fix, but one nouveau regression fixer. Signed-off-by: Dave Airlie <[email protected]>
2021-12-31Merge tag 'amd-drm-next-5.17-2021-12-30' of ↵Dave Airlie23-354/+574
ssh://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.17-2021-12-30: amdgpu: - Suspend/resume fixes - Fence fix - Misc code cleanups - IP discovery fixes - SRIOV fixes - RAS fixes - GMC 8 VRAM detection fix - FRU fixes for Aldebaran - Display fixes amdkfd: - SVM fixes - IP discovery fixes Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-12-30drm/amdgpu: no DC support for headless chipsAlex Deucher1-0/+6
Chips with no display hardware should return false for DC support. v2: drop Arcturus and Aldebaran Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support") Reviewed-by: Evan Quan <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reported-by: Tareque Md.Hanif <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: always reset the asic in suspend (v2)Alex Deucher1-1/+4
If the platform suspend happens to fail and the power rail is not turned off, the GPU will be in an unknown state on resume, so reset the asic so that it will be in a known good state on resume even if the platform suspend failed. v2: handle s0ix Acked-by: Luben Tuikov <[email protected]> Acked-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable ↵Evan Quan1-0/+15
platform By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups and put itself into a state ready for PNP. That can workaround some random resuming failure observed on BOCO capable platforms. Signed-off-by: Evan Quan <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: fix runpm documentationAlex Deucher1-3/+4
It's not only supported by HG/PX laptops. It's supported by all dGPUs which supports BOCO/BACO functionality (runtime D3). BOCO - Bus Off, Chip Off. The entire chip is powered off. This is controlled by ACPI. BACO - Bus Active, Chip Off. The chip still shows up on the PCI bus, but the device itself is powered down. v2: fix missed HG/PX reference Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: save error count in RAS poison handlerTao Zhou3-76/+97
Otherwise the RAS error count couldn't be queried from sysfs. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: drop redundant semicolonGuchun Chen1-1/+1
A minor typo. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: Check the memory can be accesssed by ttm_device_clear_dma_mappings.Surbhi Kakarya1-1/+2
If the event guard is enabled and VF doesn't receive an ack from PF for full access, the guest driver load crashes. This is caused due to the call to ttm_device_clear_dma_mappings with non-initialized mman during driver tear down. This patch adds the necessary condition to check if the mman initialization passed or not and takes the path based on the condition output. Signed-off-by: Surbhi Kakarya <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: Access the FRU on AldebaranKent Russell1-3/+10
This is supported, although the offset is different from VG20, so fix that with a variable and enable getting the product name and serial number from the FRU. Do this for all SKUs since all SKUs have the FRU Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: Increase potential product_name to 64 charactersKent Russell2-8/+7
Having seen at least 1 42-character product_name, bump the number up to 64, and put that definition into amdgpu.h to make future adjustments simpler. Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: Remove the redundant code of psp bootloader functionsyipechai1-63/+15
The psp bootloader functions code of psp_v13_0.c had been optimized before. According the code style of psp_v13_0.c to remove the redundant code of psp_v11_0.c. v2: squash in drop unused variable (Alex) Signed-off-by: yipechai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-30drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to ↵Leslie Shi1-1/+3
prevent crash in GPU initialization failure [Why] In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs. [How] call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in vcn_v3_0_sw_fini() when GPU initialization failure. Signed-off-by: Leslie Shi <[email protected]> Reviewed-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: fixup bad vram size on gmc v8Zongmin Zhou1-3/+10
Some boards(like RX550) seem to have garbage in the upper 16 bits of the vram size register. Check for this and clamp the size properly. Fixes boards reporting bogus amounts of vram. after add this patch,the maximum GPU VRAM size is 64GB, otherwise only 64GB vram size will be used. Signed-off-by: Zongmin Zhou<[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handlingsashank saye1-5/+4
For Aldebaran chip passthrough case we need to intimate SMU about special handling for SBR.On older chips we send LightSBR to SMU, enabling the same for Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is used for SMU to differentiate. Reviewed by: Shaoyun.liu <[email protected]> Signed-off-by: sashank saye <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Don't inherit GEM object VMAs in child processRajneesh Bhardwaj1-0/+3
When an application having open file access to a node forks, its shared mappings also get reflected in the address space of child process even though it cannot access them with the object permissions applied. With the existing permission checks on the gem objects, it might be reasonable to also create the VMAs with VM_DONTCOPY flag so a user space application doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) system call to prevent the pages in the mapped range to appear in the address space of the child process. It also prevents the memory leaks due to additional reference counts on the mapped BOs in the child process that prevented freeing the memory in the parent for which we had worked around earlier in the user space inside the thunk library. Additionally, we faced this issue when using CRIU to checkpoint restore an application that had such inherited mappings in the child which confuse CRIU when it mmaps on restore. Having this flag set for the render node VMAs helps. VMAs mapped via KFD already take care of this so this is needed only for the render nodes. To limit the impact of the change to user space consumers such as OpenGL etc, limit it to KFD BOs only. Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: David Yat Sin <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdkfd: reset queue which consumes RAS poison (v2)Tao Zhou2-4/+5
CP supports unmap queue with reset mode which only destroys specific queue without affecting others. Replacing whole gpu reset with reset queue mode for RAS poison consumption saves much time, and we can also fallback to gpu reset solution if reset queue fails. v2: Return directly if process is NULL; Reset queue solution is not applicable to SDMA, fallback to legacy way; Call kfd_unref_process after lookup process. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: add gpu reset control for umc page retirementTao Zhou2-5/+15
Add a reset parameter for umc page retirement, let user decide whether call gpu reset in umc page retirement. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Modify indirect register access for gfx9 sriovVictor Skvortsov1-28/+89
Expand RLCG interface for new GC read & write commands. New interface will only be used if the PF enables the flag in pf2vf msg. v2: Added a description for the scratch registers Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: David Nieto <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: get xgmi info before ip_initVictor Skvortsov3-12/+7
Driver needs to call get_xgmi_info() before ip_init to determine whether it needs to handle a pending hive reset. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: David Nieto <[email protected]> Reviewed by: shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Modify indirect register access for amdkfd_gfx_v9 sriovVictor Skvortsov1-14/+13
Modify GC register access from MMIO to RLCG if the indirect flag is set Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: David Nieto <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Modify indirect register access for gmc_v9_0 sriovVictor Skvortsov1-14/+43
Modify GC register access from MMIO to RLCG if the indirect flag is set v2: Replaced ternary operator with if-else for better readability Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: David Nieto <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Add *_SOC15_IP_NO_KIQ() macro definitionsVictor Skvortsov1-0/+5
Add helper macros to change register access from direct to indirect. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: David Nieto <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: Filter security violation registersBokun Zhang1-37/+46
Recently, there is security policy update under SRIOV. We need to filter the registers that hit the violation and move the code to the host driver side so that the guest driver can execute correctly. Signed-off-by: Bokun Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-28drm/amdgpu: no DC support for headless chipsAlex Deucher1-0/+6
Chips with no display hardware should return false for DC support. v2: drop Arcturus and Aldebaran Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support") Reviewed-by: Evan Quan <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reported-by: Tareque Md.Hanif <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-27drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable ↵Evan Quan1-0/+15
platform By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups and put itself into a state ready for PNP. That can workaround some random resuming failure observed on BOCO capable platforms. Signed-off-by: Evan Quan <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-27drm/amdgpu: always reset the asic in suspend (v2)Alex Deucher1-1/+4
If the platform suspend happens to fail and the power rail is not turned off, the GPU will be in an unknown state on resume, so reset the asic so that it will be in a known good state on resume even if the platform suspend failed. v2: handle s0ix Acked-by: Luben Tuikov <[email protected]> Acked-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-22drm/amdgpu: fix runpm documentationAlex Deucher1-3/+4
It's not only supported by HG/PX laptops. It's supported by all dGPUs which supports BOCO/BACO functionality (runtime D3). BOCO - Bus Off, Chip Off. The entire chip is powered off. This is controlled by ACPI. BACO - Bus Active, Chip Off. The chip still shows up on the PCI bus, but the device itself is powered down. v2: fix missed HG/PX reference Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-23Merge tag 'amd-drm-next-5.17-2021-12-16' of ↵Dave Airlie52-226/+467
https://gitlab.freedesktop.org/agd5f/linux into drm-next amdgpu: - Add some display debugfs entries - RAS fixes - SR-IOV fixes - W=1 fixes - Documentation fixes - IH timestamp fix - Misc power fixes - IP discovery fixes - Large driver documentation updates - Multi-GPU memory use reductions - Misc display fixes and cleanups - Add new SMU debug option amdkfd: - SVM fixes radeon: - Fix typo in comment From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-12-22x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumerationYazen Ghannam1-1/+1
AMD systems currently lay out MCA bank types such that the type of bank number "i" is either the same across all CPUs or is Reserved/Read-as-Zero. For example: Bank # | CPUx | CPUy 0 LS LS 1 RAZ UMC 2 CS CS 3 SMU RAZ Future AMD systems will lay out MCA bank types such that the type of bank number "i" may be different across CPUs. For example: Bank # | CPUx | CPUy 0 LS LS 1 RAZ UMC 2 CS NBIO 3 SMU RAZ Change the structures that cache MCA bank types to be per-CPU and update smca_get_bank_type() to handle this change. Move some SMCA-specific structures to amd.c from mce.h, since they no longer need to be global. Break out the "count" for bank types from struct smca_hwid, since this should provide a per-CPU count rather than a system-wide count. Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The values in this array should not change at runtime. Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-17drm/amdgpu: add support for IP discovery gc_info table v2Alex Deucher1-22/+54
Used on gfx9 based systems. Fixes incorrect CU counts reported in the kernel log. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1833 Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-12-17drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly ↵chen gong1-0/+7
enabled Play a video on the raven (or PCO, raven2) platform, and then do the S3 test. When resume, the following error will be reported: amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec test failed (-110) [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v1_0> failed -110 amdgpu 0000:02:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -110 [why] When playing the video: The power state flag of the vcn block is set to POWER_STATE_ON. When doing suspend: There is no change to the power state flag of the vcn block, it is still POWER_STATE_ON. When doing resume: Need to open the power gate of the vcn block and set the power state flag of the VCN block to POWER_STATE_ON. But at this time, the power state flag of the vcn block is already POWER_STATE_ON. The power status flag check in the "8f2cdef drm/amd/pm: avoid duplicate powergate/ungate setting" patch will return the amdgpu_dpm_set_powergating_by_smu function directly. As a result, the gate of the power was not opened, causing the subsequent ring test to fail. [how] In the suspend function of the vcn block, explicitly change the power state flag of the vcn block to POWER_STATE_OFF. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828 Signed-off-by: chen gong <[email protected]> Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-12-17drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fenceHuang Rui3-51/+90
The job embedded fence donesn't initialize the flags at dma_fence_init(). Then we will go a wrong way in amdgpu_fence_get_timeline_name callback and trigger a null pointer panic once we enabled the trace event here. So introduce new amdgpu_fence object to indicate the job embedded fence. [ 156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0 [ 156.131804] #PF: supervisor read access in kernel mode [ 156.131811] #PF: error_code(0x0000) - not-present page [ 156.131817] PGD 0 P4D 0 [ 156.131824] Oops: 0000 [#1] PREEMPT SMP PTI [ 156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G OE 5.16.0-rc1-custom #1 [ 156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016 [ 156.131848] RIP: 0010:strlen+0x0/0x20 [ 156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 [ 156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206 [ 156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b [ 156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0 [ 156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000 [ 156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0 [ 156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007 [ 156.131914] FS: 0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000 [ 156.131923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0 [ 156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 156.131949] Call Trace: [ 156.131953] <TASK> [ 156.131957] ? trace_event_raw_event_dma_fence+0xcc/0x200 [ 156.131973] ? ring_buffer_unlock_commit+0x23/0x130 [ 156.131982] dma_fence_init+0x92/0xb0 [ 156.131993] amdgpu_fence_emit+0x10d/0x2b0 [amdgpu] [ 156.132302] amdgpu_ib_schedule+0x2f9/0x580 [amdgpu] [ 156.132586] amdgpu_job_run+0xed/0x220 [amdgpu] v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot) Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-17drm/amdgpu: fix dropped backing store handling in amdgpu_dma_buf_move_notifyChristian König1-1/+1
bo->tbo.resource can now be NULL. Signed-off-by: Christian König <[email protected]> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1811 Acked-by: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-12-16drm/amdgpu: add support for IP discovery gc_info table v2Alex Deucher1-22/+54
Used on gfx9 based systems. Fixes incorrect CU counts reported in the kernel log. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1833 Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-16drm/amdgpu: Separate vf2pf work item init from virt data exchangeVictor Skvortsov3-13/+30
We want to be able to call virt data exchange conditionally after gmc sw init to reserve bad pages as early as possible. Since this is a conditional call, we will need to call it again unconditionally later in the init sequence. Refactor the data exchange function so it can be called multiple times without re-initializing the work item. v2: Cleaned up the code. Kept the original call to init_exchange_data() inside early init to initialize the work item, afterwards call exchange_data() when needed. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed By: Shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-16drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly ↵chen gong1-0/+7
enabled Play a video on the raven (or PCO, raven2) platform, and then do the S3 test. When resume, the following error will be reported: amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec test failed (-110) [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v1_0> failed -110 amdgpu 0000:02:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -110 [why] When playing the video: The power state flag of the vcn block is set to POWER_STATE_ON. When doing suspend: There is no change to the power state flag of the vcn block, it is still POWER_STATE_ON. When doing resume: Need to open the power gate of the vcn block and set the power state flag of the VCN block to POWER_STATE_ON. But at this time, the power state flag of the vcn block is already POWER_STATE_ON. The power status flag check in the "8f2cdef drm/amd/pm: avoid duplicate powergate/ungate setting" patch will return the amdgpu_dpm_set_powergating_by_smu function directly. As a result, the gate of the power was not opened, causing the subsequent ring test to fail. [how] In the suspend function of the vcn block, explicitly change the power state flag of the vcn block to POWER_STATE_OFF. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828 Signed-off-by: chen gong <[email protected]> Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-16drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fenceHuang Rui3-51/+90
The job embedded fence donesn't initialize the flags at dma_fence_init(). Then we will go a wrong way in amdgpu_fence_get_timeline_name callback and trigger a null pointer panic once we enabled the trace event here. So introduce new amdgpu_fence object to indicate the job embedded fence. [ 156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0 [ 156.131804] #PF: supervisor read access in kernel mode [ 156.131811] #PF: error_code(0x0000) - not-present page [ 156.131817] PGD 0 P4D 0 [ 156.131824] Oops: 0000 [#1] PREEMPT SMP PTI [ 156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G OE 5.16.0-rc1-custom #1 [ 156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016 [ 156.131848] RIP: 0010:strlen+0x0/0x20 [ 156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 [ 156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206 [ 156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b [ 156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0 [ 156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000 [ 156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0 [ 156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007 [ 156.131914] FS: 0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000 [ 156.131923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0 [ 156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 156.131949] Call Trace: [ 156.131953] <TASK> [ 156.131957] ? trace_event_raw_event_dma_fence+0xcc/0x200 [ 156.131973] ? ring_buffer_unlock_commit+0x23/0x130 [ 156.131982] dma_fence_init+0x92/0xb0 [ 156.131993] amdgpu_fence_emit+0x10d/0x2b0 [amdgpu] [ 156.132302] amdgpu_ib_schedule+0x2f9/0x580 [amdgpu] [ 156.132586] amdgpu_job_run+0xed/0x220 [amdgpu] v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot) Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-16Merge drm/drm-next into drm-misc-nextThomas Zimmermann46-1190/+1023
Backmerging for v5.16-rc5. Resolves a conflict between drm-misc-next and drm-misc-fixes in the vc4 driver. Signed-off-by: Thomas Zimmermann <[email protected]>
2021-12-14drm/amdgpu: correct the wrong cached state for GMC on PICASSOEvan Quan2-4/+12
Pair the operations did in GMC ->hw_init and ->hw_fini. That can help to maintain correct cached state for GMC and avoid unintention gate operation dropping due to wrong cached state. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828 Signed-off-by: Evan Quan <[email protected]> Acked-by: Guchun Chen <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amdgpu: don't override default ECO_BITs settingHawking Zhang8-9/+0
Leave this bit as hardware default setting Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-12-14drm/amdgpu: correct register access for RLC_JUMP_TABLE_RESTORELe Ma1-2/+2
should count on GC IP base address Signed-off-by: Le Ma <[email protected]> Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2021-12-14amdgpu: fix some comment typosYann Dirson2-2/+2
Signed-off-by: Yann Dirson <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14amdgpu: fix some kernel-doc markupYann Dirson1-3/+3
Those are not today pulled by the sphinx doc, but better be ready. Signed-off-by: Yann Dirson <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amd/amdgpu: fix gmc bo pin count leak in SRIOVJingwen Chen2-0/+8
[Why] gmc bo will be pinned during loading amdgpu and reset in SRIOV while only unpinned in unload amdgpu [How] add amdgpu_in_reset and sriov judgement to skip pin bo v2: fix wrong judgement Signed-off-by: Jingwen Chen <[email protected]> Reviewed-by: Horace Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amd/amdgpu: fix psp tmr bo pin count leak in SRIOVJingwen Chen1-0/+4
[Why] psp tmr bo will be pinned during loading amdgpu and reset in SRIOV while only unpinned in unload amdgpu [How] add amdgpu_in_reset and sriov judgement to skip pin bo v2: fix wrong judgement Signed-off-by: Jingwen Chen <[email protected]> Reviewed-by: Horace Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amdgpu: correct the wrong cached state for GMC on PICASSOEvan Quan2-4/+12
Pair the operations did in GMC ->hw_init and ->hw_fini. That can help to maintain correct cached state for GMC and avoid unintention gate operation dropping due to wrong cached state. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828 Signed-off-by: Evan Quan <[email protected]> Acked-by: Guchun Chen <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amdgpu: use adev_to_drm to get drm_device pointerGuchun Chen1-1/+1
Updated for consistency when accessing drm_device from amdgpu driver. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amdgpu: move smu_debug_mask to a more proper placeEvan Quan1-1/+1
As the smu_context will be invisible from outside(of power). Also, the smu_debug_mask can be shared around all power code instead of some specific framework(swSMU) only. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2021-12-14drm/amdgpu: SRIOV flr_work should use down_writeVictor Skvortsov2-4/+6
Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset goes through. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed by: Shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>