aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2022-11-29drm/amdgpu: fix stall on CPU when allocate large system memoryJames Zhu1-15/+35
-v2: 1. rename variable to redue confuse 2. optimize the code -v3: move new define out of the middle of the code -v4: squash in minmax error fix (Luben) When applications try to allocate large system (more than > 128GB), "stall cpu" is reported. for such large system memory, walk_page_range takes more than 20s usually. The warning message can be removed when splitting hmm range into smaller ones which is not more 64GB for each walk_page_range. [ 164.437617] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1753: amdgpu: create BO VA 0x7f63c7a00000 size 0x2f16000000 domain CPU [ 164.488847] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1785: amdgpu: creating userptr BO for user_addr = 7f63c7a00000 [ 185.439116] rcu: INFO: rcu_sched self-detected stall on CPU [ 185.439125] rcu: 8-....: (20999 ticks this GP) idle=e22/1/0x4000000000000000 softirq=2242/2242 fqs=5249 [ 185.439137] (t=21000 jiffies g=6325 q=1215) [ 185.439141] NMI backtrace for cpu 8 [ 185.439143] CPU: 8 PID: 3470 Comm: kfdtest Kdump: loaded Tainted: G O 5.12.0-0_fbk5_zion_rc1_5697_g2c723fb88626 #1 [ 185.439147] Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 11/06/2020 [ 185.439150] Call Trace: [ 185.439153] <IRQ> [ 185.439157] dump_stack+0x64/0x7c [ 185.439163] nmi_cpu_backtrace.cold.7+0x30/0x65 [ 185.439165] ? lapic_can_unplug_cpu+0x70/0x70 [ 185.439170] nmi_trigger_cpumask_backtrace+0xf9/0x100 [ 185.439174] rcu_dump_cpu_stacks+0xc5/0xf5 [ 185.439178] rcu_sched_clock_irq.cold.97+0x112/0x38c [ 185.439182] ? tick_sched_handle.isra.21+0x50/0x50 [ 185.439185] update_process_times+0x8c/0xc0 [ 185.439189] tick_sched_timer+0x63/0x70 [ 185.439192] __hrtimer_run_queues+0xff/0x250 [ 185.439195] hrtimer_interrupt+0xf4/0x200 [ 185.439199] __sysvec_apic_timer_interrupt+0x51/0xd0 [ 185.439201] sysvec_apic_timer_interrupt+0x69/0x90 [ 185.439206] </IRQ> [ 185.439207] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 185.439211] RIP: 0010:clear_page_rep+0x7/0x10 [ 185.439214] Code: e8 fe 7c 51 00 44 89 e2 48 89 ee 48 89 df e8 60 ff ff ff c6 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 [ 185.439218] RSP: 0018:ffffc9000f58f818 EFLAGS: 00000246 [ 185.439220] RAX: 0000000000000000 RBX: 0000000000000881 RCX: 000000000000005c [ 185.439223] RDX: 0000000000100dca RSI: 0000000000000000 RDI: ffff88a59e0e5d20 [ 185.439225] RBP: ffffea0096783940 R08: ffff888118c35280 R09: ffffea0096783940 [ 185.439227] R10: ffff888000000000 R11: 0000160000000000 R12: ffffea0096783980 [ 185.439228] R13: ffffea0096783940 R14: ffff88b07fdfdd00 R15: 0000000000000000 [ 185.439232] prep_new_page+0x81/0xc0 [ 185.439236] get_page_from_freelist+0x13be/0x16f0 [ 185.439240] ? release_pages+0x16a/0x4a0 [ 185.439244] __alloc_pages_nodemask+0x1ae/0x340 [ 185.439247] alloc_pages_vma+0x74/0x1e0 [ 185.439251] __handle_mm_fault+0xafe/0x1360 [ 185.439255] handle_mm_fault+0xc3/0x280 [ 185.439257] hmm_vma_fault.isra.22+0x49/0x90 [ 185.439261] __walk_page_range+0x692/0x9b0 [ 185.439265] walk_page_range+0x9b/0x120 [ 185.439269] hmm_range_fault+0x4f/0x90 [ 185.439274] amdgpu_hmm_range_get_pages+0x24f/0x260 [amdgpu] [ 185.439463] amdgpu_ttm_tt_get_user_pages+0xc2/0x190 [amdgpu] [ 185.439603] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x49f/0x7a0 [amdgpu] [ 185.439774] kfd_ioctl_alloc_memory_of_gpu+0xfb/0x410 [amdgpu] Signed-off-by: James Zhu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-25Merge tag 'amd-drm-fixes-6.1-2022-11-23' of ↵Dave Airlie14-74/+71
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.1-2022-11-23: amdgpu: - DCN 3.1.4 fixes - DP MST DSC deadlock fixes - HMM userptr fixes - Fix Aldebaran CU occupancy reporting - GFX11 fixes - PSP suspend/resume fix - DCE12 KASAN fix - DCN 3.2.x fixes - Rotated cursor fix - SMU 13.x fix - DELL platform suspend/resume fixes - VCN4 SR-IOV fix - Display regression fix for polled connectors Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-11-25Merge tag 'drm-misc-fixes-2022-11-24' of ↵Dave Airlie1-3/+3
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes drm-misc-fixes for v6.1-rc7: - Another amdgpu gang submit fix. - Use dma_fence_unwrap_for_each when importing sync files. - Fix race in dma_heap_add(). - Fix use of uninitialized memory in logo. Signed-off-by: Dave Airlie <[email protected]> From: Maarten Lankhorst <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-11-24Backmerge tag 'v6.1-rc6' into drm-nextDave Airlie4-7/+16
Linux 6.1-rc6 This is needed for drm-misc-next and tegra. Signed-off-by: Dave Airlie <[email protected]>
2022-11-23drm/amdgpu: Partially revert "drm/amdgpu: update drm_display_info correctly ↵Alex Deucher1-1/+0
when the edid is read" This partially reverts 20543be93ca45968f344261c1a997177e51bd7e1. Calling drm_connector_update_edid_property() in amdgpu_connector_free_edid() causes a noticeable pause in the system every 10 seconds on polled outputs so revert this part of the change. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2257 Cc: Claudio Suarez <[email protected]> Acked-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: enable RAS poison for VCN 2.6Tao Zhou1-0/+29
Configure related settings to enable it. Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: fix for suspend/resume kiq fence fallback under sriovShikang Fan1-11/+12
- in device_resume, sriov configure interrupt should be in full access, so release_full_gpu should be done after kfd_resume. - remove the previous workaround solution for sriov. Fixes: ec4927d463cb ("drm/amdgpu: fix for suspend/resume sequence under sriov") Signed-off-by: Shikang Fan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: fix unused-function errorRen Zhijie1-0/+2
If CONFIG_DRM_AMDGPU=y and CONFIG_DRM_AMD_DC is not set, gcc complained about unused-function : drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c:1705:13: error: ‘amdgpu_discovery_set_sriov_display’ defined but not used [-Werror=unused-function] static void amdgpu_discovery_set_sriov_display(struct amdgpu_device *adev) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors To fix this error, use CONFIG_DRM_AMD_DC to wrap the definition of amdgpu_discovery_set_sriov_display(). Fixes: 25263da37693 ("drm/amdgpu: rework SR-IOV virtual display handling") Signed-off-by: Ren Zhijie <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: fix use-after-free during gpu recoveryStanley.Yang1-1/+6
[Why] [ 754.862560] refcount_t: underflow; use-after-free. [ 754.862898] Call Trace: [ 754.862903] <TASK> [ 754.862913] amdgpu_job_free_cb+0xc2/0xe1 [amdgpu] [ 754.863543] drm_sched_main.cold+0x34/0x39 [amd_sched] [How] The fw_fence may be not init, check whether dma_fence_init is performed before job free Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: update documentation of parameter amdgpu_gtt_sizeZhenGuo Yin1-2/+1
Fixes: f7ba887f606b ("drm/amdgpu: Adjust logic around GTT size (v3)") Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: ZhenGuo Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu/vcn: re-use original vcn0 doorbell valueJane Jian2-9/+1
root cause that S2A need to use deduct offset flag. after setting this flag, vcn0 doorbell value works. so return it as before Signed-off-by: Jane Jian <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: fix pci device refcount leakYang Yingliang1-0/+4
As comment of pci_get_domain_bus_and_slot() says, it returns a pci device with refcount increment, when finish using it, the caller must decrement the reference count by calling pci_dev_put(). So before returning from amdgpu_device_resume|suspend_display_audio(), pci_dev_put() is called to avoid refcount leak. Fixes: 3f12acc8d6d4 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3") Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Yang Yingliang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu/psp: don't free PSP buffers on suspendAlex Deucher1-7/+9
We can reuse the same buffers on resume. v2: squash in S4 fix from Shikai Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2213 Reviewed-by: Christian König <[email protected]> Tested-by: Guilherme G. Piccoli <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-11-23drm/amd/amdgpu: reserve vm invalidation engine for firmwareJack Xiao1-0/+6
If mes enabled, reserve VM invalidation engine 5 for firmware. Signed-off-by: Jack Xiao <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu/vcn: re-use original vcn0 doorbell valueJane Jian2-9/+1
root cause that S2A need to use deduct offset flag. after setting this flag, vcn0 doorbell value works. so return it as before Signed-off-by: Jane Jian <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu: Partially revert "drm/amdgpu: update drm_display_info correctly ↵Alex Deucher1-1/+0
when the edid is read" This partially reverts 20543be93ca45968f344261c1a997177e51bd7e1. Calling drm_connector_update_edid_property() in amdgpu_connector_free_edid() causes a noticeable pause in the system every 10 seconds on polled outputs so revert this part of the change. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2257 Cc: Claudio Suarez <[email protected]> Acked-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-11-23drm/amdgpu: fix use-after-free during gpu recoveryStanley.Yang1-1/+5
[Why] [ 754.862560] refcount_t: underflow; use-after-free. [ 754.862898] Call Trace: [ 754.862903] <TASK> [ 754.862913] amdgpu_job_free_cb+0xc2/0xe1 [amdgpu] [ 754.863543] drm_sched_main.cold+0x34/0x39 [amd_sched] [How] The fw_fence may be not init, check whether dma_fence_init is performed before job free Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-23drm/amdgpu/psp: don't free PSP buffers on suspendAlex Deucher1-7/+9
We can reuse the same buffers on resume. v2: squash in S4 fix from Shikai Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2213 Reviewed-by: Christian König <[email protected]> Tested-by: Guilherme G. Piccoli <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-11-22Merge tag 'amd-drm-next-6.2-2022-11-18' of ↵Dave Airlie66-789/+1035
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.2-2022-11-18: amdgpu: - SR-IOV fixes - Clean up DC checks - DCN 3.2.x fixes - DCN 3.1.x fixes - Don't enable degamma on asics which don't support it - IP discovery fixes - BACO fixes - Fix vbios allocation handling when vkms is enabled - Drop buggy tdr advanced mode GPU reset handling - Fix the build when DCN is not set in kconfig - MST DSC fixes - Userptr fixes - FRU and RAS EEPROM fixes - VCN 4.x RAS support - Aldrebaran CU occupancy reporting fix - PSP ring cleanup amdkfd: - Memory limit fix - Enable cooperative launch on gfx 10.3 amd-drm-next-6.2-2022-11-11: amdgpu: - SMU 13.x updates - GPUVM TLB race fix - DCN 3.1.4 updates - DCN 3.2.x updates - PSR fixes - Kerneldoc fix - Vega10 fan fix - GPUVM locking fixes in error pathes - BACO fix for Beige Goby - EEPROM I2C address cleanup - GFXOFF fix - Fix DC memory leak in error pathes - Flexible array updates - Mtype fix for GPUVM PTEs - Move Kconfig into amdgpu directory - SR-IOV updates - Fix possible memory leak in CS IOCTL error path amdkfd: - Fix possible memory overrun - CRIU fixes radeon: - ACPI ref count fix - HDA audio notifier support - Move Kconfig into radeon directory UAPI: - Add new GEM_CREATE flags to help to transition more KFD functionality to the DRM UAPI. These are used internally in the driver to align location based memory coherency requirements from memory allocated in the KFD with how we manage GPUVM PTEs. They are currently blocked in the GEM_CREATE IOCTL as we don't have a user right now. They are just used internally in the kernel driver for now for existing KFD memory allocations. So a change to the UAPI header, but no functional change in the UAPI. From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Dave Airlie <[email protected]>
2022-11-21drm/amd/amdgpu: reserve vm invalidation engine for firmwareJack Xiao1-0/+6
If mes enabled, reserve VM invalidation engine 5 for firmware. Signed-off-by: Jack Xiao <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.0.x
2022-11-21drm/amdgpu: Enable Aldebaran devices to report CU OccupancyRamesh Errabolu1-0/+1
Allow user to know number of compute units (CU) that are in use at any given moment. Enable access to the method kgd_gfx_v9_get_cu_occupancy that computes CU occupancy. Signed-off-by: Ramesh Errabolu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2022-11-21drm/amdgpu: fix userptr HMM range handling v2Christian König7-51/+46
The basic problem here is that it's not allowed to page fault while holding the reservation lock. So it can happen that multiple processes try to validate an userptr at the same time. Work around that by putting the HMM range object into the mutex protected bo list for now. v2: make sure range is set to NULL in case of an error Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> CC: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2022-11-21drm/amdgpu: always register an MMU notifier for userptrChristian König1-5/+3
Since switching to HMM we always need that because we no longer grab references to the pages. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Felix Kuehling <[email protected]> CC: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2022-11-18drm/amdgpu: handle gang submit before VMIDChristian König1-3/+3
Otherwise it can happen that not all gang members can get a VMID assigned and we deadlock. Signed-off-by: Christian König <[email protected]> Tested-by: Timur Kristóf <[email protected]> Acked-by: Timur Kristóf <[email protected]> Fixes: 68ce8b242242 ("drm/amdgpu: add gang submit backend v2") Reviewed-by: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-11-18Merge tag 'amd-drm-fixes-6.1-2022-11-16' of ↵Dave Airlie6-6/+49
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.1-2022-11-16: amdgpu: - Fix a possible memory leak in ganng submit error path - DP tunneling fixes - DCN 3.1 page flip fix - DCN 3.2.x fixes - DCN 3.1.4 fixes - Don't expose degamma on hardware that doesn't support it - BACO fixes for SMU 11.x - BACO fixes for SMU 13.x - Virtual display fix for devices with no display hardware amdkfd: - Memory limit regression fix Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-11-17drm/amdgpu: make psp_ring_init commonAlex Deucher9-191/+26
All of the IP specific versions are the same now, so we can just use a common function. Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu/psp12: move ih_reroute into ring_createAlex Deucher1-2/+2
This matches what we do for psp 3.1 and makes ring_init common for all PSP versions. Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: Enable Aldebaran devices to report CU OccupancyRamesh Errabolu1-0/+1
Allow user to know number of compute units (CU) that are in use at any given moment. Enable access to the method kgd_gfx_v9_get_cu_occupancy that computes CU occupancy. Signed-off-by: Ramesh Errabolu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: add JPEG 4.0 RAS poison consumption handlingTao Zhou1-0/+18
Register related irq handler. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: add VCN 4.0 RAS poison consumption handlingTao Zhou1-0/+10
Register irq handler. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: add RAS error query for JPEG 4.0Tao Zhou2-0/+70
Initialize JPEG RAS structure and add error query interface. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: add RAS query support for VCN 4.0Tao Zhou2-0/+66
Initialize VCN RAS structure and add RAS status query function. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: define common jpeg_set_ras_funcsTao Zhou3-12/+19
Make the code reusable. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: define common vcn_set_ras_funcsTao Zhou3-12/+19
So the code can be reused. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: enable RAS for VCN/JPEG v4.0Tao Zhou1-1/+2
Set support flag for VCN/JPEG 4.0. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: Enable mode-1 reset for RAS recovery in fatal error modeYiPeng Chai2-1/+10
The patch is enabling mode-1 reset for RAS recovery in fatal error mode. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: Add support for RAS table at 0x40000Luben Tuikov1-1/+6
Add support for RAS table at I2C EEPROM address of 0x40000, since on some ASICs it is not at 0, but at 0x40000. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Tested-by: Kent Russell <[email protected]> Reviewed-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: Interpret IPMI data for product information (v2)Luben Tuikov1-98/+85
Don't assume FRU MCU memory locations for the FRU data fields, or their sizes, instead read and interpret the IPMI data, as stipulated in the IPMI spec version 1.0 rev 1.2. Extract the Product Name, Product Part/Model Number, and the Product Serial Number by interpreting the IPMI data. Check the checksums of the stored IPMI data to make sure we don't read and give corrupted data back the the user. Eliminate small I2C reads, and instead read the whole Product Info Area in one go, and then extract the information we're seeking from it. Eliminates a whole function, making this file smaller. v2: Clarify changes in the commit message. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Tested-by: Kent Russell <[email protected]> Reviewed-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: Bug-fix: Reading I2C FRU data on newer ASICsLuben Tuikov1-11/+25
Set the new correct default FRU MCU I2C address for newer ASICs, so that we can correctly read the Product Name, Product Part/Model Number and Serial Number. On newer ASICs, the FRU MCU was moved to I2C address 0x58. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Tested-by: Kent Russell <[email protected]> Reviewed-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: Allow non-standard EEPROM I2C addressLuben Tuikov1-1/+1
Allow non-standard EEPROM I2C address of 0x58, where the Device Type Identifier is 1011b, where we form 1011000b = 0x58 I2C address, as on some ASICs the FRU data lives there. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Tested-by: Kent Russell <[email protected]> Reviewed-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-18Merge tag 'drm-misc-fixes-2022-11-17' of ↵Dave Airlie2-7/+17
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes drm-misc-fixes for v6.1-rc6: - Fix error handling in vc4_atomic_commit_tail() - Set bpc for logictechno panels. - Fix potential memory leak in drm_dev_init() - Fix potential null-ptr-deref in drm_vblank_destroy_worker() - Set lima's clkname corrrectly when regulator is missing. - Small amdgpu fix to gang submission. - Revert hiding unregistered connectors from userspace, as it breaks on DP-MST. - Add workaround for DP++ dual mode adaptors that don't support i2c subaddressing. Signed-off-by: Dave Airlie <[email protected]> From: Maarten Lankhorst <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-11-17drm/amdgpu: cleanup amdgpu_hmm_range_get_pagesChristian König3-18/+8
Remove unused parameters and cleanup dead code. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: rename the files for HMM handlingChristian König7-29/+32
Clean that up a bit, no functional change. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: fix userptr HMM range handling v2Christian König7-51/+46
The basic problem here is that it's not allowed to page fault while holding the reservation lock. So it can happen that multiple processes try to validate an userptr at the same time. Work around that by putting the HMM range object into the mutex protected bo list for now. v2: make sure range is set to NULL in case of an error Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> CC: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2022-11-17drm/amdgpu: always register an MMU notifier for userptrChristian König1-5/+3
Since switching to HMM we always need that because we no longer grab references to the pages. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Felix Kuehling <[email protected]> CC: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2022-11-16Merge tag 'drm-misc-next-2022-11-10-1' of ↵Dave Airlie29-246/+261
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 6.2: UAPI Changes: Cross-subsystem Changes: Core Changes: - atomic-helper: Add begin_fb_access and end_fb_access hooks - fb-helper: Rework to move fb emulation into helpers - scheduler: rework entity flush, kill and fini - ttm: Optimize pool allocations Driver Changes: - amdgpu: scheduler rework - hdlcd: Switch to DRM-managed resources - ingenic: Fix registration error path - lcdif: FIFO threshold tuning - meson: Fix return type of cvbs' mode_valid - ofdrm: multiple fixes (kconfig, types, endianness) - sun4i: A100 and D1 support - panel: - New Panel: Jadard JD9365DA-H3 Signed-off-by: Dave Airlie <[email protected]> From: Maxime Ripard <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20221110083612.g63eaocoaa554soh@houat
2022-11-15drm/amdgpu: stop resubmittting jobs in amdgpu_pci_resumeChristian König1-2/+0
The state of VRAM is unreliable due to a PCI event like AER, link reset or DPC. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-15drm/amdgpu: stop resubmitting jobs for GPU reset v2Christian König1-5/+1
Re-submitting IBs by the kernel has many problems because pre- requisite state is not automatically re-created as well. In other words neither binary semaphores nor things like ring buffer pointers are in the state they should be when the hardware starts to work on the IBs again. Additional to that even after more than 5 years of developing this feature it is still not stable and we have massively problems getting the reference counts right. As discussed with user space developers this behavior is not helpful in the first place. For graphics and multimedia workloads it makes much more sense to either completely re-create the context or at least re-submitting the IBs from userspace. For compute use cases re-submitting is also not very helpful since userspace must rely on the accuracy of the result. Because of this we stop this practice and instead just properly note that the fence submission was canceled. The only use case we keep the re-submission for now is SRIOV and function level resets. v2: as suggested by Sshaoyun stop resubmitting jobs even for SRIOV Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-15drm/amdgpu: revert "implement tdr advanced mode"Christian König2-104/+1
This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75. This feature basically re-submits one job after another to figure out which one was the one causing a hang. This is obviously incompatible with gang-submit which requires that multiple jobs run at the same time. It's also absolutely not helpful to crash the hardware multiple times if a clean recovery is desired. For testing and debugging environments we should rather disable recovery alltogether to be able to inspect the state with a hw debugger. Additional to that the sw implementation is clearly buggy and causes reference count issues for the hardware fence. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2022-11-15drm/amdgpu: Fixed the problem that ras error can't be queried after gpu ↵YiPeng Chai1-0/+2
recovery is completed Amdgpu_ras_set_error_query_ready is called at the start of amdgpu_device_gpu_recover to disable query ras error, but the code behind only enables query ras error in full reset path, but not in soft reset path, emergency restart path and skip the hardware reset path. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>