aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
AgeCommit message (Collapse)AuthorFilesLines
2019-12-11drm/amdgpu: log when amdgpu.dc=1 but ASIC is unsupportedSimon Ser1-0/+3
This makes it easier to figure out whether the kernel parameter has been taken into account. Signed-off-by: Simon Ser <[email protected]> Cc: Harry Wentland <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-11drm/amd/powerplay: enable pp one vf mode for vega10Yintian Tao1-11/+5
Originally, due to the restriction from PSP and SMU, VF has to send message to hypervisor driver to handle powerplay change which is complicated and redundant. Currently, SMU and PSP can support VF to directly handle powerplay change by itself. Therefore, the old code about the handshake between VF and PF to handle powerplay will be removed and VF will use new the registers below to handshake with SMU. mmMP1_SMN_C2PMSG_101: register to handle SMU message mmMP1_SMN_C2PMSG_102: register to handle SMU parameter mmMP1_SMN_C2PMSG_103: register to handle SMU response v2: remove module parameter pp_one_vf v3: fix the parens v4: forbid vf to change smu feature v5: use hwmon_attributes_visible to skip sepicified hwmon atrribute v6: change skip condition at vega10_copy_table_to_smc Signed-off-by: Yintian Tao <[email protected]> Acked-by: Evan Quan <[email protected]> Reviewed-by: Kenneth Feng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05drm/amdgpu: clear err_event_athub flag after reset exitLe Ma1-0/+3
Otherwise next err_event_athub error cannot call gpu reset. And following resume sequence will not be affected by this flag. v2: create function to clear amdgpu_ras_in_intr for modularity of ras driver Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05drm/amdgpu: support full gpu reset workflow when ras err_event_athub occursLe Ma1-6/+11
This athub fatal error can be recovered by baco without system-level reboot, so add a mode to use baco for the recovery. Not affect the default psp reset situations for now. Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05drm/amdgpu: add concurrent baco reset support for XGMILe Ma1-12/+70
Currently each XGMI node reset wq does not run in parrallel if bound to same cpu. Make change to bound the xgmi_reset_work item to different cpus. XGMI requires all nodes enter into baco within very close proximity before any node exit baco. So schedule the xgmi_reset_work wq twice for enter/exit baco respectively. To use baco for XGMI, PMFW supported for baco on XGMI needs to be involved. The case that PSP reset and baco reset coexist within an XGMI hive never exist and is not in the consideration. v2: define use_baco flag to simplify the code for xgmi baco sequence Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05drm/amdgpu: enable/disable doorbell interrupt in baco entry/exit helperLe Ma1-7/+12
This operation is needed when baco entry/exit for ras recovery Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-02drm/amdgpu/sriov: No need the event 3 and 4 nowEmily Deng1-2/+0
As will call unload kms when initialize fail, and the unload kms will send event 3 and 4, so don't need event 3 and 4 in device init. Signed-off-by: Emily Deng <[email protected]> Reviewed-by: Zhan Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-02drm/amdgpu: not remove sysfs if not create sysfsYintian Tao1-4/+12
When load amdgpu failed before create pm_sysfs and ucode_sysfs, the pm_sysfs and ucode_sysfs should not be removed. Otherwise, there will be warning call trace just like below. [ 24.836386] [drm] VCE initialized successfully. [ 24.841352] amdgpu 0000:00:07.0: amdgpu_device_ip_init failed [ 25.370383] amdgpu 0000:00:07.0: Fatal error during GPU init [ 25.889575] [drm] amdgpu: finishing device. [ 26.069128] amdgpu 0000:00:07.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) [ 26.070110] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed [ 26.200309] [TTM] Finalizing pool allocator [ 26.200314] [TTM] Finalizing DMA pool allocator [ 26.200349] [TTM] Zone kernel: Used memory at exit: 0 KiB [ 26.200351] [TTM] Zone dma32: Used memory at exit: 0 KiB [ 26.200353] [drm] amdgpu: ttm finalized [ 26.205329] ------------[ cut here ]------------ [ 26.205330] sysfs group 'fw_version' not found for kobject '0000:00:07.0' [ 26.205347] WARNING: CPU: 0 PID: 1228 at fs/sysfs/group.c:256 sysfs_remove_group+0x80/0x90 [ 26.205348] Modules linked in: amdgpu(OE+) gpu_sched(OE) ttm(OE) drm_kms_helper(OE) drm(OE) i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_misc snd_hda_codec_generic ledtrig_audio crct10dif_pclmul snd_hda_intel crc32_pclmul snd_hda_codec ghash_clmulni_intel snd_hda_core snd_hwdep snd_pcm snd_timer input_leds snd joydev soundcore serio_raw pcspkr evbug aesni_intel aes_x86_64 crypto_simd cryptd mac_hid glue_helper sunrpc ip_tables x_tables autofs4 8139too psmouse 8139cp mii i2c_piix4 pata_acpi floppy [ 26.205369] CPU: 0 PID: 1228 Comm: modprobe Tainted: G OE 5.2.0-rc1 #1 [ 26.205370] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 26.205372] RIP: 0010:sysfs_remove_group+0x80/0x90 [ 26.205374] Code: e8 35 b9 ff ff 5b 41 5c 41 5d 5d c3 48 89 df e8 f6 b5 ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 48 7a 70 98 e8 60 63 d3 ff <0f> 0b eb d7 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 [ 26.205375] RSP: 0018:ffffbee242b0b908 EFLAGS: 00010282 [ 26.205376] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 [ 26.205377] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff97ad6f817380 [ 26.205377] RBP: ffffbee242b0b920 R08: ffffffff98f520c4 R09: 00000000000002b3 [ 26.205378] R10: ffffbee242b0b8f8 R11: 00000000000002b3 R12: ffffffffc0e58240 [ 26.205379] R13: ffff97ad6d1fe0b0 R14: ffff97ad4db954c8 R15: ffff97ad4db7fff0 [ 26.205380] FS: 00007ff3d8a1c4c0(0000) GS:ffff97ad6f800000(0000) knlGS:0000000000000000 [ 26.205381] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 26.205381] CR2: 00007f9b2ef1df04 CR3: 000000042aab8001 CR4: 00000000003606f0 [ 26.205384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 26.205385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 26.205385] Call Trace: [ 26.205461] amdgpu_ucode_sysfs_fini+0x18/0x20 [amdgpu] [ 26.205518] amdgpu_device_fini+0x3b4/0x560 [amdgpu] [ 26.205573] amdgpu_driver_unload_kms+0x4f/0xa0 [amdgpu] [ 26.205623] amdgpu_driver_load_kms+0xcd/0x250 [amdgpu] [ 26.205637] drm_dev_register+0x12b/0x1c0 [drm] [ 26.205695] amdgpu_pci_probe+0x12a/0x1e0 [amdgpu] [ 26.205699] local_pci_probe+0x47/0xa0 [ 26.205701] pci_device_probe+0x106/0x1b0 [ 26.205704] really_probe+0x21a/0x3f0 [ 26.205706] driver_probe_device+0x11c/0x140 [ 26.205707] device_driver_attach+0x58/0x60 [ 26.205709] __driver_attach+0xc3/0x140 Signed-off-by: Yintian Tao <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-26drm/amdgpu: move pci handling out of pm opsAlex Deucher1-19/+14
The documentation says the that PCI core handles this for you unless you choose to implement it. Just rely on the PCI core to handle the pci specific bits. Reviewed-by: Zhan Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: disentangle runtime pm and vga_switcherooAlex Deucher1-8/+14
Originally we only supported runtime pm on PX/HG laptops so vga_switcheroo and runtime pm are sort of entangled. Attempt to logically separate them. Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: add helpers for baco entry and exitAlex Deucher1-0/+61
BACO - Bus Active, Chip Off Will be used for runtime pm. Entry will enter the BACO state (chip off). Exit will exit the BACO state (chip on). Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: rename amdgpu_device_is_px to amdgpu_device_supports_boco (v2)Alex Deucher1-4/+4
BACO - Bus Active, Chip Off BOCO - Bus Off, Chip Off To better match what we are checking for and to align with amdgpu_device_supports_baco. BOCO is used on PowerXpress/Hybrid Graphics systems and BACO is used on desktop dGPU boards. v2: fix typo in documentation Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: add a amdgpu_device_supports_baco helperAlex Deucher1-0/+15
BACO - Bus Active, Chip Off To check if a device supports BACO or not. This will be used in determining when to enable runtime pm. Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: put flush_delayed_work at firstYintian Tao1-3/+1
There is one regression from 042f3d7b745cd76aa To put flush_delayed_work after adev->shutdown = true which will make amdgpu_ih_process not response the irq At last, all ib ring tests will be failed just like below [drm] amdgpu: finishing device. [drm] Fence fallback timer expired on ring gfx [drm] Fence fallback timer expired on ring comp_1.0.0 [drm] Fence fallback timer expired on ring comp_1.1.0 [drm] Fence fallback timer expired on ring comp_1.2.0 [drm] Fence fallback timer expired on ring comp_1.3.0 [drm] Fence fallback timer expired on ring comp_1.0.1 amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.1.1 (-110). amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.2.1 (-110). amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.3.1 (-110). amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110). amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma1 (-110). amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd_enc_0.0 (-110). amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110). [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110). v2: replace cancel_delayed_work_sync() with flush_delayed_work() Signed-off-by: Yintian Tao <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-19drm/amdgpu: add driver support for JPEG2.0 and aboveLeo Liu1-0/+2
By using JPEG IP block type Signed-off-by: Leo Liu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-13drm/amdgpu: add function parameter description in 'amdgpu_device_set_cg_state'yu kuai1-0/+1
Fixes gcc warning: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1954: warning: Function parameter or member 'state' not described in 'amdgpu_device_set_cg_state' Fixes: e3ecdffac9cc ("drm/amdgpu: add documentation for amdgpu_device.c") Signed-off-by: yu kuai <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-13drm/amd/display: rename DCN1_0 kconfig to DCNBhawanpreet Lakha1-1/+1
Since dcn20 and dcn21 are under dcn1 it doesnt make sense to have it named dcn1. Change it to "dcn" to make it generic Signed-off-by: Bhawanpreet Lakha <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-13drm/amd/display: Drop CONFIG_DRM_AMD_DC_DCN2_1 flagBhawanpreet Lakha1-2/+0
[Why] DCN21 is stable enough to be build by default. So drop the flags. [How] Remove them using the unifdef tool. The following commands were executed in sequence: $ find -name '*.c' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DCN2_1 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_1 '{}' ';' $ find -name '*.h' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DCN2_1 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_1 '{}' ';' In addition: * Remove from kconfig, and replace any dependencies with DCN1_0. * Remove from any makefiles. * Fix and cleanup Renoir definitions in dal_asic_id.h * Expand DCN1 ifdef to include DCN21 code in the following files: * clk_mgr/clk_mgr.c: dc_clk_mgr_create() * core/dc_resources.c: dc_create_resource_pool() * gpio/hw_factory.c: dal_hw_factory_init() * gpio/hw_translate.c: dal_hw_translate_init() Signed-off-by: Bhawanpreet Lakha <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-13drm/amd/display: Drop CONFIG_DRM_AMD_DC_DCN2_0 and DSC_SUPPORTEDBhawanpreet Lakha1-4/+0
[Why] DCN2 and DSC are stable enough to be build by default. So drop the flags. [How] Remove them using the unifdef tool. The following commands were executed in sequence: $ find -name '*.c' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DSC_SUPPORT -DCONFIG_DRM_AMD_DC_DCN2_0 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_0 '{}' ';' $ find -name '*.h' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DSC_SUPPORT -DCONFIG_DRM_AMD_DC_DCN2_0 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_0 '{}' ';' In addition: * Remove from kconfig, and replace any dependencies with DCN1_0. * Remove from any makefiles. * Fix and cleanup NV defninitions in dal_asic_id.h * Expand DCN1 ifdef to include DCN2 code in the following files: * clk_mgr/clk_mgr.c: dc_clk_mgr_create() * core/dc_resources.c: dc_create_resource_pool() * dce/dce_dmcu.c: dcn20_*lock_phy() * dce/dce_dmcu.c: dcn20_funcs * dce/dce_dmcu.c: dcn20_dmcu_create() * gpio/hw_factory.c: dal_hw_factory_init() * gpio/hw_translate.c: dal_hw_translate_init() Signed-off-by: Leo Li <[email protected]> Signed-off-by: Bhawanpreet Lakha <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-11drm/amd/amdgpu: finish delay works before release resourcesJesse Zhang1-0/+3
flush/cancel delayed works before doing finalization to avoid concurrently requests. Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-06drm/amdgpu: perform p-state switch after the whole hive initializedEvan Quan1-12/+35
P-state switch should be performed after all devices from the hive get initialized. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-06drm/amdgpu: register gpu instance before fan boost feature enablmentEvan Quan1-0/+7
Otherwise, the feature enablement will be skipped due to wrong count. Fixes: beff74bc6e0fa91 ("drm/amdgpu: fix a race in GPU reset with IB test (v2)") Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-11-06drm/amdgpu: change pstate only after all XGMI device initializedEvan Quan1-3/+12
Pstate settings should be performed after all device of the XGMI setup get initialized. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-30drm/amdgpu: bypass some cleanup work after err_event_athub (v2)Le Ma1-0/+6
PSP lost connection when err_event_athub occurs. These cleanup work can be skipped in BACO reset. v2: squash in missing include (Alex) Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-28drm/amd: correct "_LENTH" mispelling in constantWambui Karuga1-2/+2
Correct the "_LENTH" mispelling in the AMDGPU_MAX_TIMEOUT_PARAM_LENGTH constant. Signed-off-by: Wambui Karuga <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-25drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU ready.Andrey Grodzovsky1-0/+13
For Arcturus the I2C traffic is done through SMU tables and so we must postpone RAS recovery init to after they are ready which is in amdgpu_device_ip_hw_init_phase2. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Evan Quan <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-17drm/amdgpu: add a generic fb accessing helper function(v3)Tianci.Yin1-0/+30
add a generic helper function for accessing framebuffer via MMIO Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Tianci.Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-15drm/amdgpu: move gpu reset out of amdgpu_device_suspendAlex Deucher1-4/+0
Move it into the caller. There are cases were we don't want it. We need it for hibernation, but we don't need it for runtime pm, so drop it for runtime pm. Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-15drm/amdgpu: move pci_save_state into suspend pathAlex Deucher1-1/+1
for amdgpu_device_suspend. This follows the logic in the resume path. Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-15drm/amdgpu: Fix tdr3 could hang with slow compute issueEmily Deng1-1/+4
When index is 1, need to set compute ring timeout for sriov and passthrough. Signed-off-by: Emily Deng <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-07drm/amdgpu: move amdgpu_device_get_job_timeout_settingsAlex Deucher1-0/+64
It's only used in amdgpu_device.c and the naming also reflects that. Move it there. Reviewed-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu: Iterate through DRM connectors correctlyLyude Paul1-6/+14
Currently, every single piece of code in amdgpu that loops through connectors does it incorrectly and doesn't use the proper list iteration helpers, drm_connector_list_iter_begin() and drm_connector_list_iter_end(). Yeesh. So, do that. Cc: Juston Li <[email protected]> Cc: Imre Deak <[email protected]> Cc: Ville Syrjälä <[email protected]> Cc: Harry Wentland <[email protected]> Cc: Daniel Vetter <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Lyude Paul <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amd/amdgpu:Fix compute ring unable to detect hang.Jesse Zhang1-6/+6
When compute fence did not signal, compute ring cannot detect hardware hang because its timeout value is set to be infinite by default. In SR-IOV and passthrough mode, if user does not declare custome timeout value for compute ring, then use gfx ring timeout value as default. So that when there is a ture hardware hang, compute ring can detect it. Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03drm/amdgpu/discovery: get gpu info from ip discovery tableXiaojie Yuan1-0/+12
except soc_bounding_box which is not integrated in discovery table yet Signed-off-by: Xiaojie Yuan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amd/powerplay: properly set mp1 state for SW SMU suspend/reset routineEvan Quan1-6/+6
Set mp1 state properly for SW SMU suspend/reset routine. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: remove needless usage of #ifdefShirish S1-5/+1
define sched_policy in case CONFIG_HSA_AMD is not enabled, with this there is no need to check for CONFIG_HSA_AMD else where in driver code. Suggested-by: Felix Kuehling <[email protected]> Signed-off-by: Shirish S <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: fix build error without CONFIG_HSA_AMDShirish S1-1/+5
If CONFIG_HSA_AMD is not set, build fails: drivers/gpu/drm/amd/amdgpu/amdgpu_device.o: In function `amdgpu_device_ip_early_init': drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1626: undefined reference to `sched_policy' Use CONFIG_HSA_AMD to guard this. Fixes: 1abb680ad371 ("drm/amdgpu: disable gfxoff while use no H/W scheduling policy") Signed-off-by: Shirish S <[email protected]> Reviewed-by: Huang Rui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: disable gfxoff while use no H/W scheduling policyHuang Rui1-1/+1
While gfxoff is enabled, the mmVM_XXX registers will be 0xfffffff while the GFX is in "off" state. KFD queue creattion doesn't use ring based method, so it will trigger a VM fault. Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16drm/amdgpu: Add a kernel parameter for specifying the asic typeYong Zhao1-1/+6
As more and more new asics start to reuse the old device IDs before launch, there is a need to quickly override the existing asic type corresponding to the reused device ID through a kernel parameter. With this, engineers no longer need to rely on local hack patches, facilitating cooperation across teams. Signed-off-by: Yong Zhao <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13drm/amdgpu: move the call of ras recovery_init and bad page reserve to ↵Tao Zhou1-5/+0
proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13drm/amdkfd: Query kfd device info by CHIP id instead of pci device idYong Zhao1-1/+1
This optimizes out the pci device id usage in KFD and makes the code more maintainable. Signed-off-by: Yong Zhao <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13dmr/amdgpu: Add system auto reboot to RAS.Andrey Grodzovsky1-0/+14
In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13drm/amdgpu: Avoid HW GPU reset for RAS.Andrey Grodzovsky1-10/+28
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.Andrey Grodzovsky1-13/+10
Issue 1: In XGMI case amdgpu_device_lock_adev for other devices in hive was called to late, after access to their repsective schedulers. So relocate the lock to the begining of accessing the other devs. Issue 2: Using amdgpu_device_ip_need_full_reset to switch the device list from all devices in hive to the single 'master' device who owns this reset call is wrong because when stopping schedulers we iterate all the devices in hive but when restarting we will only reactivate the 'master' device. Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS needed we then have to stop schedulers for all devices in hive and not only the 'master' but with amdgpu_device_ip_need_full_reset we already missed the opprotunity do to so. So just remove this logic and always stop and start all schedulers for all devices in hive. Also minor cleanup and print fix. v4: Minor coding style fix. Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-30drm/amdgpu: Handle job is NULL use case in amdgpu_device_gpu_recoverAndrey Grodzovsky1-6/+4
This should be checked at all places job is accessed. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-29drm/amdgpu: Enable DC on RenoirRoman Li1-0/+3
Enable DC support for renoir. Acked-by: Harry Wentland <[email protected]> Signed-off-by: Roman Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-29drm/amdgpu: introduce vram lost for reset (v2)Monk Liu1-2/+2
for SOC15/vega10 the BACO reset & mode1 would introduce vram lost in high end address range, current kmd's vram lost checking cannot catch it since it only check very ahead visible frame buffer v2: cover NV as well Signed-off-by: Monk Liu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-15drm/amdgpu: Use new mode2 reset interface for RV.Andrey Grodzovsky1-0/+1
Integrate the mode2 reset into rest sequence. v2: Check ppfuncs pointer for NULL Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-12drm/amdgpu: add renoir support for gpu_info and ip block settingHuang Rui1-1/+7
This patch adds renoir support for gpu_info firmware and ip block setting. Acked-by: Huang Rui <[email protected]> Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-12drm/amdgpu: add renoir asic_type enumHuang Rui1-0/+1
This patch adds renoir to amd_asic_type enum and amdgpu_asic_name[]. Acked-by: Huang Rui <[email protected]> Signed-off-by: Huang Rui <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>