drm/xe: Introduce the dev_coredump infrastructure.

The goal is to use devcoredump infrastructure to report error states captured at the crash time. The error state will contain useful information for GPU hang debug, such as INSTDONE registers and the current buffers getting executed, as well as any other information that helps user space and allow later replays of the error. The proposal here is to avoid a Xe only error_state like i915 and use a standard dev_coredump infrastructure to expose the error state. For our own case, the data is only useful if it is a snapshot of the time when the GPU crash has happened, since we reset the GPU immediately after and the registers might have changed. So the proposal here is to have an internal snapshot to be printed out later. Also, usually a subsequent GPU hang can be only a cause of the initial one. So we only save the 'first' hang. The dev_coredump has a delayed work queue where it remove the coredump and free all the data within a few moments of the error. When that happens we also reset our capture state and allow further snapshots. Right now this infra only print out the time of the hang. More information will be migrated here on subsequent work. Also, in order to organize the dump better, the goal is to propose dev_coredump changes itself to allow multiple files and different controls. But for now we start Xe usage of it without any dependency on dev_coredump core changes. v2: Add dma_fence annotation for capture that might happen during long running. (Thomas and Matt) Use xe->drm.primary->index on drm_info msg. (Jani) v3: checkpatch fixes v4: Fix building and locking issues found by Francois. Actually let's kill all of the locking in here. gt_reset serialization already guarantee that there will be only one capture at the same time. Also, the devcoredump has its own locking to protect the free and reads and drivers don't need to duplicate it. Besides this, the dma_fence locking was pushed to a following patch since it is not needed in this one. Fix a use after free identified by KASAN: Do not stash the faulty_engine since that will be freed somewhere else. v5: Fix Uptime - ktime_get_boottime actually returns the Uptime. (Francois) Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
author: Rodrigo Vivi <rodrigo.vivi@intel.com> 2023-05-18 17:12:39 -0400
committer: Rodrigo Vivi <rodrigo.vivi@intel.com> 2023-12-19 18:33:51 -0500
commit: e799485044cb3c0019a226ff3a92a532ca2a4e7e (patch)
tree: 3bb5f49d9da52a53c6c516b1a1530c5c90f6921c /drivers/gpu/drm/xe/xe_devcoredump.h
parent: 3e535bd504057bab1970b2dd1b594908ca3de74d (diff)
1 files changed, 20 insertions, 0 deletions
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
new file mode 100644
index 000000000000..854882129227
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef _XE_DEVCOREDUMP_H_
+#define _XE_DEVCOREDUMP_H_
+
+struct xe_device;
+struct xe_engine;
+
+#ifdef CONFIG_DEV_COREDUMP
+void xe_devcoredump(struct xe_engine *e);
+#else
+static inline void xe_devcoredump(struct xe_engine *e)
+{
+}
+#endif
+
+#endif
author	Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-05-18 17:12:39 -0400
committer	Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-19 18:33:51 -0500
commit	e799485044cb3c0019a226ff3a92a532ca2a4e7e (patch)
tree	3bb5f49d9da52a53c6c516b1a1530c5c90f6921c /drivers/gpu/drm/xe/xe_devcoredump.h
parent	3e535bd504057bab1970b2dd1b594908ca3de74d (diff)