blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2020-12-08	drm/amdgpu: fix debugfs creation/removal, again	Arnd Bergmann	1	-6/+0
	There is still a warning when CONFIG_DEBUG_FS is disabled: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function] 1145 \| static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev) Change the code again to make the compiler actually drop this code but not warn about it. Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS") Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-10-30	drm/amdgpu: fix the issue of reserving bad pages failed	Dennis Li	1	-8/+0
	In amdgpu_ras_reset_gpu, because bad pages may not be freed, it has high probability to reserve bad pages failed. Change to reserve bad pages when freeing VRAM. v2: 1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c 2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr reserve bad page failed, it will put it into pending list, otherwise put it into processed list; 3. remove amdgpu_ras_release_bad_pages, because retired page's info has been moved into amdgpu_vram_mgr v3: 1. formate code style; 2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation; 3. rename scope_pending as reservations_pending; 4. rename scope_processed as reserved_pages; 5. change to iterate over all the pending ones and try to insert them with drm_mm_reserve_node(); v4: 1. rename amdgpu_vram_mgr_reserve_scope as amdgpu_vram_mgr_reserve_range; 2. remove unused include "amdgpu_ras.h"; 3. rename amdgpu_vram_mgr_check_and_reserve as amdgpu_vram_mgr_do_reserve; 4. refine amdgpu_vram_mgr_reserve_range to call amdgpu_vram_mgr_do_reserve. Reviewed-by: Christian König <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Dennis Li <[email protected]> Signed-off-by: Wenhui Sheng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-10-30	drm/amdgpu: remove redundant GPU reset	Dennis Li	1	-9/+1
	Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-10-30	drm/amdgpu: change to save bad pages in UMC error interrupt callback	Dennis Li	1	-0/+1
	Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-14	drm/amdgpu: bypass querying ras error count registers	Guchun Chen	1	-0/+3
	Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count register harvest of all IPs. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-04	drm/amdgpu: break GPU recovery once it's in bad state(v4)	Guchun Chen	1	-0/+2
	When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-04	drm/amdgpu: skip bad page reservation once issuing from eeprom write	Guchun Chen	1	-3/+11
	Once the ras recovery is issued from eeprom write itself, bad page reservation should be ignored, otherwise, recursive calling of writting to eeprom would happen. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-08-04	drm/amdgpu: validate bad page threshold in ras(v3)	Guchun Chen	1	-0/+3
	Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-07-15	drm/amdgpu: RAS emergency restart logic refine	Wenhui Sheng	1	-0/+1
	If we are in RAS triggered situation and BACO isn't support, emergency restart is needed, and this code is only needed for some specific cases(vega20 with given smu fw version). After we add smu mode1 reset for sienna cichlid, we need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset, so in amdgpu_device_gpu_recover, we need differentiate which mode1 reset we are using, then decide if it's a full reset and then decide if emergency restart is needed, the logic will become much more complex. After discussion with Hawking, move emergency restart logic to an independent function. Signed-off-by: Likun Gao <[email protected]> Signed-off-by: Wenhui Sheng <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-04-01	drm/amdgpu: disable ras query and iject during gpu reset	John Clements	1	-0/+4
	added flag to ras context to indicate if ras query functionality is ready Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2020-03-10	drm/amdgpu: add function to creat all ras debugfs node	Tao Zhou	1	-0/+2
	centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-18	drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu	Guchun Chen	1	-2/+1
	BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05	drm/amdgpu: clear err_event_athub flag after reset exit	Le Ma	1	-0/+5
	Otherwise next err_event_athub error cannot call gpu reset. And following resume sequence will not be affected by this flag. v2: create function to clear amdgpu_ras_in_intr for modularity of ras driver Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-12-05	drm/amdgpu: export amdgpu_ras_find_obj to use externally	Le Ma	1	-0/+3
	Change it to external interface. Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-10-03	drm/amdgpu: update parameter of ras_ih_cb	Tao Zhou	1	-1/+1
	change struct ras_err_data err_data to void err_data, align with umc code and the callback's declaration in each ras block could pay no attention to the structure type Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16	drm/amdgpu: fix ras ctrl debugfs node leak	Guchun Chen	1	-2/+0
	Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-16	drm/amdgpu: Fix mutex lock from atomic context.	Andrey Grodzovsky	1	-1/+3
	Problem: amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu because writing to EEPROM during ASIC reset was unstable. But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called directly from ISR context and so locking is not allowed. Also it's irrelevant for this partilcular interrupt as this is generic RAS interrupt and not memory errors specific. Fix: Avoid calling amdgpu_ras_reserve_bad_pages if not in task context. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13	drm/amdgpu: move the call of ras recovery_init and bad page reserve to ↵	Tao Zhou	1	-0/+5
	proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13	drm/amdgpu: save umc error records	Tao Zhou	1	-1/+1
	save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13	drm/amdgpu: change ras bps type to eeprom table record structure	Tao Zhou	1	-6/+5
	change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13	dmr/amdgpu: Add system auto reboot to RAS.	Andrey Grodzovsky	1	-1/+1
	In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13	drm/amdgpu: Avoid HW GPU reset for RAS.	Andrey Grodzovsky	1	-0/+10
	Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-09-13	drm/amdgpu: add helper function to do common ras_late_init/fini (v3)	Hawking Zhang	1	-0/+7
	In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-27	drm/amdgpu: Add RAS EEPROM table.	Andrey Grodzovsky	1	-0/+3
	Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-08-23	drm/amdgpu: correct ras error count type	Guchun Chen	1	-1/+1
	Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-07-31	drm/amdgpu: add define for gfx ras subblock	Dennis Li	1	-0/+230
	Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-07-31	drm/amdgpu: allow ras interrupt callback to return error data	Tao Zhou	1	-18/+19
	add error data as parameter for ras interrupt cb and process it Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-07-31	drm/amdgpu: add support for recording ras error address	Tao Zhou	1	-0/+2
	more than one error address may be recorded in one query Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-07-31	drm/amdgpu: move some ras data structure to amdgpu_ras.h	Hawking Zhang	1	-1/+68
	These are common structures that can be included by IP specific source files Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Dennis Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-06-13	amdgpu: no need to check return value of debugfs_create functions	Greg Kroah-Hartman	1	-2/+2
	When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: "David (ChunMing) Zhou" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: xinhui pan <[email protected]> Cc: Evan Quan <[email protected]> Cc: Feifei Xu <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-06-11	drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()	Dan Carpenter	1	-0/+2
	The "block" variable can be set by the user through debugfs, so it can be quite large which leads to shift wrapping here. This means we report a "block" as supported when it's not, and that leads to array overflows later on. This bug is not really a security issue in real life, because debugfs is generally root only. Fixes: 36ea1bd2d084 ("drm/amdgpu: add debugfs ctrl node") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-05-24	drm/amdgpu: ras support suspend/resume	xinhui pan	1	-1/+3
	add ras suspend function. rename ras_post_init to amdgpu_ras_resume. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: James Zhu <[email protected]> Tested-by: James Zhu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-05-24	drm/amdgpu: add badpages sysfs interafce	xinhui pan	1	-0/+1
	add badpages node. it will output badpages list in format gpu pfn : gpu page size : flags example 0x00000000 : 0x00001000 : R 0x00000001 : 0x00001000 : R 0x00000002 : 0x00001000 : R 0x00000003 : 0x00001000 : R 0x00000004 : 0x00001000 : R 0x00000005 : 0x00001000 : R 0x00000006 : 0x00001000 : R 0x00000007 : 0x00001000 : P 0x00000008 : 0x00001000 : P 0x00000009 : 0x00001000 : P flags can be one of below characters R: reserved. P: pending for reserve. F: failed to reserve for some reasons. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-05-24	drm/amdgpu: handle ras reset	xinhui pan	1	-0/+3
	add another flag to allow IP do a gpu reset after device init. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-05-24	drm/amdgpu: Revert "drm/amdgpu: skip gpu reset when ras error occured"	xinhui pan	1	-3/+0
	Enable this now to reset the GPU on RAS errors. This reverts commit 138352e5752aa3e694951d70c8fe8730219f4edf. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-04-10	drm/amdgpu: Introduce another ras enable function	xinhui pan	1	-0/+3
	Many parts of the whole SW stack can program the ras enablement state during the boot. Now we handle that case by adding one function which check the ras flags and choose different code path. Reviewed-by: Evan Quan <[email protected]> Signed-off-by: xinhui pan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-27	drm/amdgpu: Fix amdgpu ras to ta enums conversion	xinhui pan	1	-0/+56
	Add helpes to transalte the two enums. And it will catch bugs easily. Signed-off-by: xinhui pan <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19	drm/amdgpu: add new ras workflow control flags	xinhui pan	1	-0/+3
	add ras post init function. Do some initialization after all IP have finished their late init. Add new member flags which will control the ras work flow. For now, vbios enable ras for us on boot. That might change in the future. So there should be a flag from vbios to tell us if ras is enabled or not on boot. Looks like there is no such info now. Other bits of the flags are reserved to control other parts of ras. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19	drm/amdgpu: add new member hw_supported	xinhui pan	1	-0/+3
	Currently, it is not clear how ras is supported. Both software and hardware can set the supported. That is confusing. Fix it by adding new member hw_supported. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Evan Quan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19	drm/amdgpu: skip gpu reset when ras error occured	xinhui pan	1	-0/+3
	gpu reset is not stable on vega20 A1. Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19	drm/amdgpu: add debugfs ctrl node	xinhui pan	1	-0/+9
	allow userspace enable/disable ras Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2019-03-19	drm/amdgpu: add amdgpu_ras.c to support ras (v2)	xinhui pan	1	-0/+217
	add obj management. add feature control. add debugfs infrastructure. add sysfs infrastructure. add IH infrastructure. add recovery infrastructure. It is a framework. Other IPs need call amdgpu_ras_xxx function instead of psp_ras_xxx functions. v2: squash in warning fixes Signed-off-by: xinhui pan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>