aboutsummaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/xe
AgeCommit message (Collapse)AuthorFilesLines
2024-11-13drm/xe/oa: Fix "Missing outer runtime PM protection" warningAshutosh Dixit1-0/+2
Fix the following drm_WARN: [953.586396] xe 0000:00:02.0: [drm] Missing outer runtime PM protection ... <4> [953.587090] ? xe_pm_runtime_get_noresume+0x8d/0xa0 [xe] <4> [953.587208] guc_exec_queue_add_msg+0x28/0x130 [xe] <4> [953.587319] guc_exec_queue_fini+0x3a/0x40 [xe] <4> [953.587425] xe_exec_queue_destroy+0xb3/0xf0 [xe] <4> [953.587515] xe_oa_release+0x9c/0xc0 [xe] Suggested-by: John Harrison <[email protected]> Suggested-by: Matthew Brost <[email protected]> Fixes: e936f885f1e9 ("drm/xe/oa/uapi: Expose OA stream fd") Cc: [email protected] Signed-off-by: Ashutosh Dixit <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit b107c63d2953907908fd0cafb0e543b3c3167b75) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-13drm/xe: handle flat ccs during hibernation on igpuMatthew Auld1-1/+12
Starting from LNL, CCS has moved over to flat CCS model where there is now dedicated memory reserved for storing compression state. On platforms like LNL this reserved memory lives inside graphics stolen memory, which is not treated like normal RAM and is therefore skipped by the core kernel when creating the hibernation image. Currently if something was compressed and we enter hibernation all the corresponding CCS state is lost on such HW, resulting in corrupted memory. To fix this evict user buffers from TT -> SYSTEM to ensure we take a snapshot of the raw CCS state when entering hibernation, where upon resuming we can restore the raw CCS state back when next validating the buffer. This has been confirmed to fix display corruption on LNL when coming back from hibernation. Fixes: cbdc52c11c9b ("drm/xe/xe2: Support flat ccs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3409 Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit c8b3c6db941299d7cc31bd9befed3518fdebaf68) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-13drm/xe: improve hibernation on igpuMatthew Auld2-27/+16
The GGTT looks to be stored inside stolen memory on igpu which is not treated as normal RAM. The core kernel skips this memory range when creating the hibernation image, therefore when coming back from hibernation the GGTT programming is lost. This seems to cause issues with broken resume where GuC FW fails to load: [drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1 [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 [drm] *ERROR* GT0: firmware signature verification failed [drm] *ERROR* CRITICAL: Xe has declared device 0000:00:02.0 as wedged. Current GGTT users are kernel internal and tracked as pinned, so it should be possible to hook into the existing save/restore logic that we use for dgpu, where the actual evict is skipped but on restore we importantly restore the GGTT programming. This has been confirmed to fix hibernation on at least ADL and MTL, though likely all igpu platforms are affected. This also means we have a hole in our testing, where the existing s4 tests only really test the driver hooks, and don't go as far as actually rebooting and restoring from the hibernation image and in turn powering down RAM (and therefore losing the contents of stolen). v2 (Brost) - Remove extra newline and drop unnecessary parentheses. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275 Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Matthew Brost <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit f2a6b8e396666d97ada8e8759dfb6a69d8df6380) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-13drm/xe: Restore system memory GGTT mappingsMatthew Brost2-4/+11
GGTT mappings reside on the device and this state is lost during suspend / d3cold thus this state must be restored resume regardless if the BO is in system memory or VRAM. v2: - Unnecessary parentheses around bo->placements[0] (Checkpatch) Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit a19d1db9a3fa89fabd7c83544b84f393ee9b851f) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-13drm/xe: Ensure all locks released in exec IOCTLMatthew Brost1-2/+2
In couple of places the wrong error handling goto was used to release locks. Fix these to ensure all locks dropped on exec IOCTL errors. Cc: Francois Dugast <[email protected]> Fixes: d16ef1a18e39 ("drm/xe/exec: Switch hw engine group execution mode upon job submission") Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Francois Dugast <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 9e7aacd8402b88394e6a83cb242901fde77a1773) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-05drm/xe: Stop accumulating LRC timestamp on job_freeLucas De Marchi2-2/+6
The exec queue timestamp is only really useful when it's being queried through the fdinfo. There's no need to update it so often, on every job_free. Tracing a simple app like vkcube running shows an update rate of ~ 120Hz. In case of discrete, the BO is on vram, creating a lot of pcie transactions. The update on job_free() is used to cover a gap: if exec queue is created and destroyed rapidly, before a new query, the timestamp still needs to be accumulated and accounted for in the xef. Initial implementation in commit 6109f24f87d7 ("drm/xe: Add helper to accumulate exec queue runtime") couldn't do it on the exec_queue_fini since the xef could be gone at that point. However since commit ce8c161cbad4 ("drm/xe: Add ref counting for xe_file") the xef is refcounted and the exec queue always holds a reference, making this safe now. Improve the fix in commit 2149ded63079 ("drm/xe: Fix use after free when client stats are captured") by reducing the frequency in which the update is needed. Fixes: 2149ded63079 ("drm/xe: Fix use after free when client stats are captured") Reviewed-by: Nirmoy Das <[email protected]> Reviewed-by: Jonathan Cavitt <[email protected]> Reviewed-by: Umesh Nerlige Ramappa <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 83db047d9425d9a649f01573797558eff0f632e1) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-05drm/xe/pf: Fix potential GGTT allocation leakMichal Wajdeczko1-1/+3
In unlikely event that we fail during sending the new VF GGTT configuration to the GuC, we will free only the GGTT node data struct but will miss to release the actual GGTT allocation. This will later lead to list corruption, GGTT space leak and finally risking crash when unloading the driver: [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-EIO) [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT [ ] list_add corruption. next->prev should be prev (ffff88813cfcd628), but was 0000000000000000. (next=ffff88813cfe2028). [ ] RIP: 0010:__list_add_valid_or_report+0x6b/0xb0 [ ] Call Trace: [ ] drm_mm_insert_node_in_range+0x2c0/0x4e0 [ ] xe_ggtt_node_insert+0x46/0x70 [xe] [ ] pf_provision_vf_ggtt+0x7f5/0xa70 [xe] [ ] xe_gt_sriov_pf_config_set_ggtt+0x5e/0x770 [xe] [ ] ggtt_set+0x4b/0x70 [xe] [ ] simple_attr_write_xsigned.constprop.0.isra.0+0xb0/0x110 [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-ENOSPC) [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT [ ] Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP NOPTI [ ] RIP: 0010:drm_mm_remove_node+0x1b7/0x390 [ ] Call Trace: [ ] <TASK> [ ] ? die_addr+0x2e/0x80 [ ] ? exc_general_protection+0x1a1/0x3e0 [ ] ? asm_exc_general_protection+0x22/0x30 [ ] ? drm_mm_remove_node+0x1b7/0x390 [ ] ggtt_node_remove+0xa5/0xf0 [xe] [ ] xe_ggtt_node_remove+0x35/0x70 [xe] [ ] xe_ttm_bo_destroy+0x123/0x220 [xe] [ ] intel_user_framebuffer_destroy+0x44/0x70 [xe] [ ] intel_plane_destroy_state+0x3b/0xc0 [xe] [ ] drm_atomic_state_default_clear+0x1cd/0x2f0 [ ] intel_atomic_state_clear+0x9/0x20 [xe] [ ] __drm_atomic_state_free+0x1d/0xb0 Fix that by using pf_release_ggtt() on the error path, which now works regardless if the node has GGTT allocation or not. Fixes: 34e804220f69 ("drm/xe: Make xe_ggtt_node struct independent") Signed-off-by: Michal Wajdeczko <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Matthew Auld <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 43b1dd2b550f0861ce80fbfffd5881b1b26272b1) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-05drm/xe: Drop VM dma-resv lock on xe_sync_in_fence_get failure in exec IOCTLMatthew Brost1-0/+1
Upon failure all locks need to be dropped before returning to the user. Fixes: 58480c1c912f ("drm/xe: Skip VMAs pin when requesting signal to the last XE_EXEC") Cc: <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Tejas Upadhyay <[email protected]> Reviewed-by: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 7d1a4258e602ffdce529f56686925034c1b3b095) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-05drm/xe: Fix possible exec queue leak in exec IOCTLMatthew Brost1-4/+8
In a couple of places after an exec queue is looked up the exec IOCTL returns on input errors without dropping the exec queue ref. Fix this ensuring the exec queue ref is dropped on input error. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Tejas Upadhyay <[email protected]> Reviewed-by: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 07064a200b40ac2195cb6b7b779897d9377e5e6f) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-04drm/xe/guc/tlb: Flush g2h worker in case of tlb timeoutNirmoy Das1-0/+2
Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL. This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix. v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A) Cc: Badal Nilawar <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Matthew Auld <[email protected]> Cc: John Harrison <[email protected]> Cc: Himal Prasad Ghimiray <[email protected]> Cc: Lucas De Marchi <[email protected]> Cc: [email protected] # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit e1f6fa55664a0eeb0a641f497e1adfcf6672e995) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-04drm/xe/ufence: Flush xe ordered_wq in case of ufence timeoutNirmoy Das1-0/+7
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to recent scheduling issue with E-cores. This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is a E-core scheduling fix for LNL. v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani) v3: Remove gfx platform check as the issue related to cpu platform(John) v4: Use the Common macro(John) and print when the flush resolves timeout(Matt B) Cc: Badal Nilawar <[email protected]> Cc: Matthew Auld <[email protected]> Cc: John Harrison <[email protected]> Cc: Himal Prasad Ghimiray <[email protected]> Cc: Lucas De Marchi <[email protected]> Cc: [email protected] # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost <[email protected]> Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 38c4c8722bd74452280951edc44c23de47612001) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-04drm/xe: Move LNL scheduling WA to xe_device.hNirmoy Das2-10/+15
Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed. Cc: Badal Nilawar <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Himal Prasad Ghimiray <[email protected]> Cc: Lucas De Marchi <[email protected]> cc: [email protected] # v6.11+ Suggested-by: John Harrison <[email protected]> Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit cbe006a6492c01a0058912ae15d473f4c149896c) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-04drm/xe: Use the filelist from drm for ccs_mode changeBalasubramani Vivekanandan3-23/+5
Drop the exclusive client count tracking and use the filelist from the drm to track the active clients. This also ensures the clients created internally by the driver won't block changing the ccs mode. Fixes: ce8c161cbad4 ("drm/xe: Add ref counting for xe_file") Signed-off-by: Balasubramani Vivekanandan <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 1c35f1ed1fe3c649f8c16214d0d3dd828b5265d9) Signed-off-by: Lucas De Marchi <[email protected]>
2024-11-04drm/xe: Set mask bits for CCS_MODE registerBalasubramani Vivekanandan2-1/+7
CCS_MODE register requires setting mask bits from Xe2+ platforms. Set the mask bits unconditionally, as those bits are unused for older platforms. Signed-off-by: Balasubramani Vivekanandan <[email protected]> Cc: [email protected] # v6.11+ Reviewed-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 23ea2c7572d4735ef66beb1e4feb8ae510b78247) [ Fix conflict with mmio refactors ] Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-31drm/xe: Don't short circuit TDR on jobs not startedMatthew Brost1-6/+12
Short circuiting TDR on jobs not started is an optimization which is not required. On LNL we are facing an issue where jobs do not get scheduled by the GuC if it misses a GGTT page update. When this occurs let the TDR fire, toggle the scheduling which may get the job unstuck, and print a warning message. If the TDR fires twice on job that hasn't started, timeout the job. v2: - Add warning message (Paulo) - Add fixes tag (Paulo) - Timeout job which hasn't started after TDR firing twice v3: - Include local change v4: - Short circuit check_timeout on job not started - use warn level rather than notice (Paulo) Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Cc: [email protected] Cc: Paulo Zanoni <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 35d25a4a0012e690ef0cc4c5440231176db595cc) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-31drm/xe: Add mmio read before GGTT invalidateMatthew Brost1-0/+10
On LNL without a mmio read before a GGTT invalidate the GuC can incorrectly read the GGTT scratch page upon next access leading to jobs not getting scheduled. A mmio read before a GGTT invalidate seems to fix this. Since a GGTT invalidate is not a hot code path, blindly do a mmio read before each GGTT invalidate. Cc: John Harrison <[email protected]> Cc: Daniele Ceraolo Spurio <[email protected]> Cc: Thomas Hellström <[email protected]> Cc: Lucas De Marchi <[email protected]> Cc: [email protected] Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reported-by: Paulo Zanoni <[email protected]> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3164 Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 5a710196883e0ac019ac6df2a6d79c16ad3c32fa) [ Fix conflict with mmio vs gt argument ] Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-29drm/xe/display: Add missing HPD interrupt enabling during non-d3cold RPM resumeImre Deak1-0/+1
Atm the display HPD interrupts that got disabled during runtime suspend, are re-enabled only if d3cold is enabled. Fix things by also re-enabling the interrupts if d3cold is disabled. Cc: Rodrigo Vivi <[email protected]> Reviewed-by: Jonathan Cavitt <[email protected]> Signed-off-by: Imre Deak <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit bbc4a30de095f0349d3c278500345a1b620d495e) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-29drm/xe/display: Separate the d3cold and non-d3cold runtime PM handlingImre Deak1-5/+14
For clarity separate the d3cold and non-d3cold runtime PM handling. The only change in behavior is disabling polling later during runtime resume. This shouldn't make a difference, since the poll disabling is handled from a work, which could run at any point wrt. the runtime resume handler. The work will also require a runtime PM reference, syncing it with the resume handler. Cc: Rodrigo Vivi <[email protected]> Reviewed-by: Jonathan Cavitt <[email protected]> Signed-off-by: Imre Deak <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit a4de6beb83fc5adee788518350247c629568901e) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-29drm/xe: Remove runtime argument from display s/r functionsMaarten Lankhorst3-28/+39
The previous change ensures that pm_suspend is only called when suspending or resuming. This ensures no further bugs like those in the previous commit. Signed-off-by: Maarten Lankhorst <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Reviewed-by: Vinod Govindapillai <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit f90491d4b64e302e940133103d3d9908e70e454f) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-24drm/xe: Don't restart parallel queues multiple times on GT resetNirmoy Das1-2/+12
In case of parallel submissions multiple GuC id will point to the same exec queue and on GT reset such exec queues will get restarted multiple times which is not desirable. v2: don't use exec_queue_enabled() which could race, do the same for xe_guc_submit_stop (Matt B) Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2295 Cc: Jonathan Cavitt <[email protected]> Cc: Himal Prasad Ghimiray <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Tejas Upadhyay <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Nirmoy Das <[email protected]> (cherry picked from commit c8b0acd6d8745fd7e6450f5acc38f0227bd253b3) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-24drm/xe/ufence: Prefetch ufence addr to catch bogus addressNirmoy Das1-1/+2
access_ok() only checks for addr overflow so also try to read the addr to catch invalid addr sent from userspace. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1630 Cc: Francois Dugast <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Nirmoy Das <[email protected]> (cherry picked from commit 9408c4508483ffc60811e910a93d6425b8e63928) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-24drm/xe: Handle unreliable MMIO reads during forcewakeShuicheng Lin1-3/+9
In some cases, when the driver attempts to read an MMIO register, the hardware may return 0xFFFFFFFF. The current force wake path code treats this as a valid response, as it only checks the BIT. However, 0xFFFFFFFF should be considered an invalid value, indicating a potential issue. To address this, we should add a log entry to highlight this condition and return failure. The force wake failure log level is changed from notice to err to match the failure return value. v2 (Matt Brost): - set ret value (-EIO) to kick the error to upper layers v3 (Rodrigo): - add commit message for the log level promotion from notice to err v4: - update reviewed info Suggested-by: Alex Zuo <[email protected]> Signed-off-by: Shuicheng Lin <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Michal Wajdeczko <[email protected]> Reviewed-by: Himal Prasad Ghimiray <[email protected]> Acked-by: Badal Nilawar <[email protected]> Cc: Anshuman Gupta <[email protected]> Cc: Matt Roper <[email protected]> Cc: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]> (cherry picked from commit a9fbeabe7226a3bf90f82d0e28a02c18e3c67447) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-24drm/xe/guc/ct: Flush g2h worker in case of g2h response timeoutBadal Nilawar1-0/+18
In case if g2h worker doesn't get opportunity to within specified timeout delay then flush the g2h worker explicitly. v2: - Describe change in the comment and add TODO (Matt B/John H) - Add xe_gt_warn on fence done after G2H flush (John H) v3: - Updated the comment with root cause - Clean up xe_gt_warn message (John H) Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620 Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902 Signed-off-by: Badal Nilawar <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Matthew Auld <[email protected]> Cc: John Harrison <[email protected]> Cc: Himal Prasad Ghimiray <[email protected]> Reviewed-by: Himal Prasad Ghimiray <[email protected]> Acked-by: Matthew Brost <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit e5152723380404acb8175e0777b1cea57f319a01) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-24drm/xe: Enlarge the invalidation timeout from 150 to 500Shuicheng Lin1-1/+1
There are error messages like below that are occurring during stress testing: "[ 31.004009] xe 0000:03:00.0: [drm] ERROR GT0: Global invalidation timeout". Previously it was hitting this 3 out of 1000 executions of warm reboot. After raising it to 500, 1000 warm reboot executions passed and it didn't fail. Due to the way xe_mmio_wait32() is implemented, the timeout is able to expire early when the register matches the expected value due to the wait increments starting small. So, the larger timeout value should have no effect during normal use cases. v2 (Jonathan): - rework the commit message v3 (Lucas): - add conclusive message for the fail rate and test case v4: - add suggested-by Suggested-by: Jia Yao <[email protected]> Signed-off-by: Shuicheng Lin <[email protected]> Cc: Lucas De Marchi <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Nirmoy Das <[email protected]> Reviewed-by: Jonathan Cavitt <[email protected]> Tested-by: Zongyao Bai <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Signed-off-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 2eb460ab9f4bc5b575f52568d17936da0af681d8) [ Fix conflict with gt->mmio ] Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe/bmg: improve cache flushing behaviourMatthew Auld2-4/+0
The BSpec says that EN_L3_RW_CCS_CACHE_FLUSH must be toggled on for manual global invalidation to take effect and actually flush device cache, however this also turns on flushing for things like pipecontrol, which occurs between submissions for compute/render. This sounds like massive overkill for our needs, where we already have the manual flushing on the display side with the global invalidation. Some observations on BMG: 1. Disabling l2 caching for host writes and stubbing out the driver global invalidation but keeping EN_L3_RW_CCS_CACHE_FLUSH enabled, has no impact on wb-transient-vs-display IGT, which makes sense since the pipecontrol is now flushing the device cache after the render copy. Without EN_L3_RW_CCS_CACHE_FLUSH the test then fails, which is also expected since device cache is now dirty and display engine can't see the writes. 2. Disabling EN_L3_RW_CCS_CACHE_FLUSH, but keeping the driver global invalidation also has no impact on wb-transient-vs-display. This suggests that the global invalidation still works as expected and is flushing the device cache without EN_L3_RW_CCS_CACHE_FLUSH turned on. With that drop EN_L3_RW_CCS_CACHE_FLUSH. This helps some workloads since we no longer flush the device cache between submissions as part of pipecontrol. Edit: We now also have clarification from HW side that BSpec was indeed wrong here. v2: - Rebase and update commit message. BSpec: 71718 Signed-off-by: Matthew Auld <[email protected]> Cc: Vitasta Wattal <[email protected]> Cc: Matt Roper <[email protected]> Cc: Nirmoy Das <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 67ec9f87bd6c57db1251bb2244d242f7ca5a0b6a) [ Fix conflict due to changed xe_mmio_write32() signature ] Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe/xe_sync: initialise ufence.signalledMatthew Auld1-1/+1
We can incorrectly think that the fence has signalled, if we get a non-zero value here from the kmalloc, which is quite plausible. Just use kzalloc to prevent stuff like this. Fixes: 977e5b82e090 ("drm/xe: Expose user fence from xe_sync_entry") Signed-off-by: Matthew Auld <[email protected]> Cc: Mika Kuoppala <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Nirmoy Das <[email protected]> Cc: <[email protected]> # v6.10+ Reviewed-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 26f69e88dcc95fffc62ed2aea30ad7b1fdf31fdb) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe/ufence: ufence can be signaled right after wait_wokenNirmoy Das1-3/+0
do_comapre() can return success after a timedout wait_woken() which was treated as -ETIME. The loop calling wait_woken() sets correct err so there is no need to re-evaluate err. v2: Remove entire check that reevaluate err at the end(Matt) Fixes: e670f0b4ef24 ("drm/xe/uapi: Return correct error code for xe_wait_user_fence_ioctl") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1630 Cc: [email protected] # v6.8+ Cc: Bommu Krishnaiah <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Nirmoy Das <[email protected]> (cherry picked from commit ec7e6a1d527755fc3c7a3303eaa5577aac5cf6be) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe: Use bookkeep slots for external BO's in exec IOCTLMatthew Brost1-8/+4
Fix external BO's dma-resv usage in exec IOCTL using bookkeep slots rather than write slots. This leaves syncing to user space rather than the KMD blindly enforcing write semantics on every external BO. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: José Roberto de Souza <[email protected]> Cc: Kenneth Graunke <[email protected]> Cc: Paulo Zanoni <[email protected]> Reported-by: Simona Vetter <[email protected]> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2673 Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: José Roberto de Souza <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit b8b1163248759ba18509f7443a2d19b15b4c1df8) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe/query: Increase timestamp widthLucas De Marchi1-1/+5
Starting with Xe2 the timestamp is a full 64 bit counter, contrary to the 36 bit that was available before. Although 36 should be sufficient for any reasonable delta calculation (for Xe2, of about 30min), it's surprising to userspace to get something truncated. Also if the timestamp being compared to is coming from the GPU and the application is not careful enough to apply the width there, a delta calculation would be wrong. Extend it to full 64-bits starting with Xe2. v2: Expand width=64 to media gt, as it's just a wrong tagging in the spec - empirical tests show it goes beyond 36 bits and match the engines for the main gt Bspec: 60411 Cc: Szymon Morek <[email protected]> Reviewed-by: Matt Roper <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Lucas De Marchi <[email protected]> (cherry picked from commit 9d559cdcb21f42188d4c3ff3b4fe42b240f4af5d) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe: Don't free job in TDRMatthew Brost1-2/+5
Freeing job in TDR is not safe as TDR can pass the run_job thread resulting in UAF. It is only safe for free job to naturally be called by the scheduler. Rather free job in TDR, add to pending list. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2811 Cc: Matthew Auld <[email protected]> Fixes: e275d61c5f3f ("drm/xe/guc: Handle timing out of signaled jobs gracefully") Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit ea2f6a77d0c40d97f4a4dc93fee4afe15d94926d) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe: Take job list lock in xe_sched_add_pending_jobMatthew Brost1-0/+2
A fragile micro optimization in xe_sched_add_pending_job relied on both the GPU scheduler being stopped and fence signaling stopped to safely add a job to the pending list without the job list lock in xe_sched_add_pending_job. Remove this optimization and just take the job list lock. Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 90521df5fc43980e4575bd8c5b1cb62afe1a9f5f) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe: fix unbalanced rpm put() with declare_wedged()Matthew Auld1-2/+2
Technically the or_reset() means we call the action on failure, however that would lead to unbalanced rpm put(). Move the get() earlier to fix this. It should be extremely unlikely to ever trigger this in practice. Fixes: 90936a0a4c54 ("drm/xe: Don't suspend device upon wedge") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Nirmoy Das <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit a187c1b0a800565a4db6372268692aff99df7f53) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe: fix unbalanced rpm put() with fence_fini()Matthew Auld3-23/+15
Currently we can call fence_fini() twice if something goes wrong when sending the GuC CT for the tlb request, since we signal the fence and return an error, leading to the caller also calling fini() on the error path in the case of stack version of the flow, which leads to an extra rpm put() which might later cause device to enter suspend when it shouldn't. It looks like we can just drop the fini() call since the fence signaller side will already call this for us. There are known mysterious splats with device going to sleep even with an rpm ref, and this could be one candidate. v2 (Matt B): - Prefer warning if we detect double fini() Fixes: f002702290fc ("drm/xe: Hold a PM ref when GT TLB invalidations are inflight") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Nirmoy Das <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit cfcbc0520d5055825f0647ab922b655688605183) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-16drm/xe/xe2lpg: Extend Wa_15016589081 for xe2lpgAradhya Bhatia1-0/+4
Add workaround (wa) 15016589081 which applies to Xe2_v3_LPG_MD. Xe2_v3_LPG_MD is a Lunar Lake platform with GFX version: 20.04. This wa is type: permanent, and hence is applicable on all steppings. Signed-off-by: Aradhya Bhatia <[email protected]> Reviewed-by: Tejas Upadhyay <[email protected]> Signed-off-by: Matt Roper <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 8fb1da9f9bfb02f710a7f826d50781b0b030cf53) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-11Merge tag 'drm-xe-fixes-2024-10-10' of ↵Dave Airlie4-26/+33
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Fix error checking with xa_store() (Matthe Auld) - Fix missing freq restore on GSC load error (Vinay) - Fix wedged_mode file permission (Matt Roper) - Fix use-after-free in ct communication (Matthew Auld) Signed-off-by: Dave Airlie <[email protected]> From: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/jri65tmv3bjbhqhxs5smv45nazssxzhtwphojem4uufwtjuliy@gsdhlh6kzsdy
2024-10-11Merge tag 'drm-misc-fixes-2024-10-10' of ↵Dave Airlie2-82/+1
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes Short summary of fixes pull: fbdev-dma: - Only clean up deferred I/O if instanciated nouveau: - dmem: Fix privileged error in copy engine channel; Fix possible data leak in migrate_to_ram() - gsp: Fix coding style sched: - Avoid leaking lockdep map v3d: - Stop active perfmon before destroying it vc4: - Stop active perfmon before destroying it xe: - Drop GuC submit_wq pool Signed-off-by: Dave Airlie <[email protected]> From: Thomas Zimmermann <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-10-08drm/xe: Make wedged_mode debugfs writableMatt Roper1-1/+1
The intent of this debugfs entry is to allow modification of wedging behavior, either from IGT tests or during manual debug; it should be marked as writable to properly reflect this. In practice this hasn't caused a problem because we always access wedged_mode as root, which ignores file permissions, but it's still misleading to have the entry incorrectly marked as RO. Cc: Rodrigo Vivi <[email protected]> Fixes: 6b8ef44cc0a9 ("drm/xe: Introduce the wedged_mode debugfs") Signed-off-by: Matt Roper <[email protected]> Reviewed-by: Gustavo Sousa <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 93d93813422758f6c99289de446b19184019ef5a) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-08drm/xe: Restore GT freq on GSC load errorVinay Belgaumkar1-1/+3
As part of a Wa_22019338487, ensure that GT freq is restored even when GSC reload is not successful. Fixes: 3b1592fb7835 ("drm/xe/lnl: Apply Wa_22019338487") Signed-off-by: Vinay Belgaumkar <[email protected]> Reviewed-by: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]> (cherry picked from commit 491418a258322bbd7f045e36884d2849b673f23d) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-08drm/xe/guc_submit: fix xa_store() error checkingMatthew Auld1-6/+3
Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Badal Nilawar <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Badal Nilawar <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit f040327238b1a8311598c40ac94464e77fff368c) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-08drm/xe/ct: fix xa_store() error checkingMatthew Auld1-15/+8
Looks like we are meant to use xa_err() to extract the error encoded in the ptr. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Badal Nilawar <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Badal Nilawar <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 1aa4b7864707886fa40d959483591f3d3937fa28) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-08drm/xe/ct: prevent UAF in send_recv()Matthew Auld1-3/+18
Ensure we serialize with completion side to prevent UAF with fence going out of scope on the stack, since we have no clue if it will fire after the timeout before we can erase from the xa. Also we have some dependent loads and stores for which we need the correct ordering, and we lack the needed barriers. Fix this by grabbing the ct->lock after the wait, which is also held by the completion side. v2 (Badal): - Also print done after acquiring the lock and seeing timeout. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Badal Nilawar <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Badal Nilawar <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 52789ce35c55ccd30c4b67b9cc5b2af55e0122ea) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe: Fix memory leak when aborting bindsMatthew Brost1-1/+1
Make sure to call xe_pt_update_ops_fini in xe_pt_update_ops_abort to free any memory the bind allocated. Caught by kmemleak when running Vulkan CTS tests on LNL. The leak seems to happen only when there's some kind of failure happening, like the lack of memory. Example output: unreferenced object 0xffff9120bdf62000 (size 8192): comm "deqp-vk", pid 115008, jiffies 4310295728 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 1b 05 f9 28 01 00 00 40 ...........(...@ 00 00 00 00 00 00 00 00 1b 15 f9 28 01 00 00 40 ...........(...@ backtrace (crc 7a56be79): [<ffffffff86dd81f0>] __kmalloc_cache_noprof+0x310/0x3d0 [<ffffffffc08e8211>] xe_pt_new_shared.constprop.0+0x81/0xb0 [xe] [<ffffffffc08e8309>] xe_pt_insert_entry+0xb9/0x140 [xe] [<ffffffffc08eab6d>] xe_pt_stage_bind_entry+0x12d/0x5b0 [xe] [<ffffffffc08ecbca>] xe_pt_walk_range+0xea/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08e9eff>] xe_pt_stage_bind.constprop.0+0x25f/0x580 [xe] [<ffffffffc08eb21a>] bind_op_prepare+0xea/0x6e0 [xe] [<ffffffffc08ebab8>] xe_pt_update_ops_prepare+0x1c8/0x440 [xe] [<ffffffffc08ffbf3>] ops_execute+0x143/0x850 [xe] [<ffffffffc0900b64>] vm_bind_ioctl_ops_execute+0x244/0x800 [xe] [<ffffffffc0906467>] xe_vm_bind_ioctl+0x1877/0x2370 [xe] [<ffffffffc05e92b3>] drm_ioctl_kernel+0xb3/0x110 [drm] unreferenced object 0xffff9120bdf72000 (size 8192): comm "deqp-vk", pid 115008, jiffies 4310295728 hex dump (first 32 bytes): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk backtrace (crc 23b2f0b5): [<ffffffff86dd81f0>] __kmalloc_cache_noprof+0x310/0x3d0 [<ffffffffc08e8211>] xe_pt_new_shared.constprop.0+0x81/0xb0 [xe] [<ffffffffc08e8453>] xe_pt_stage_unbind_post_descend+0xb3/0x150 [xe] [<ffffffffc08ecd26>] xe_pt_walk_range+0x246/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08eccea>] xe_pt_walk_range+0x20a/0x280 [xe] [<ffffffffc08ece31>] xe_pt_walk_shared+0xc1/0x110 [xe] [<ffffffffc08e7b2a>] xe_pt_stage_unbind+0x9a/0xd0 [xe] [<ffffffffc08e913d>] unbind_op_prepare+0xdd/0x270 [xe] [<ffffffffc08eb9f6>] xe_pt_update_ops_prepare+0x106/0x440 [xe] [<ffffffffc08ffbf3>] ops_execute+0x143/0x850 [xe] [<ffffffffc0900b64>] vm_bind_ioctl_ops_execute+0x244/0x800 [xe] [<ffffffffc0906467>] xe_vm_bind_ioctl+0x1877/0x2370 [xe] [<ffffffffc05e92b3>] drm_ioctl_kernel+0xb3/0x110 [drm] [<ffffffffc05e95a0>] drm_ioctl+0x280/0x4e0 [drm] Reported-by: Paulo Zanoni <[email protected]> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2877 Fixes: a708f6501c69 ("drm/xe: Update PT layer with better error handling") Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Paulo Zanoni <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 63e0695597a044c96bf369e4d8ba031291449d95) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe: Prevent null pointer access in xe_migrate_copyZhanjun Dong1-2/+2
xe_migrate_copy designed to copy content of TTM resources. When source resource is null, it will trigger a NULL pointer dereference in xe_migrate_copy. To avoid this situation, update lacks source flag to true for this case, the flag will trigger xe_migrate_clear rather than xe_migrate_copy. Issue trace: <7> [317.089847] xe 0000:00:02.0: [drm:xe_migrate_copy [xe]] Pass 14, sizes: 4194304 & 4194304 <7> [317.089945] xe 0000:00:02.0: [drm:xe_migrate_copy [xe]] Pass 15, sizes: 4194304 & 4194304 <1> [317.128055] BUG: kernel NULL pointer dereference, address: 0000000000000010 <1> [317.128064] #PF: supervisor read access in kernel mode <1> [317.128066] #PF: error_code(0x0000) - not-present page <6> [317.128069] PGD 0 P4D 0 <4> [317.128071] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI <4> [317.128074] CPU: 1 UID: 0 PID: 1440 Comm: kunit_try_catch Tainted: G U N 6.11.0-rc7-xe #1 <4> [317.128078] Tainted: [U]=USER, [N]=TEST <4> [317.128080] Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3221.D80.2407291239 07/29/2024 <4> [317.128082] RIP: 0010:xe_migrate_copy+0x66/0x13e0 [xe] <4> [317.128158] Code: 00 00 48 89 8d e0 fe ff ff 48 8b 40 10 4c 89 85 c8 fe ff ff 44 88 8d bd fe ff ff 65 48 8b 3c 25 28 00 00 00 48 89 7d d0 31 ff <8b> 79 10 48 89 85 a0 fe ff ff 48 8b 00 48 89 b5 d8 fe ff ff 83 ff <4> [317.128162] RSP: 0018:ffffc9000167f9f0 EFLAGS: 00010246 <4> [317.128164] RAX: ffff8881120d8028 RBX: ffff88814d070428 RCX: 0000000000000000 <4> [317.128166] RDX: ffff88813cb99c00 RSI: 0000000004000000 RDI: 0000000000000000 <4> [317.128168] RBP: ffffc9000167fbb8 R08: ffff88814e7b1f08 R09: 0000000000000001 <4> [317.128170] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88814e7b1f08 <4> [317.128172] R13: ffff88814e7b1f08 R14: ffff88813cb99c00 R15: 0000000000000001 <4> [317.128174] FS: 0000000000000000(0000) GS:ffff88846f280000(0000) knlGS:0000000000000000 <4> [317.128176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [317.128178] CR2: 0000000000000010 CR3: 000000011f676004 CR4: 0000000000770ef0 <4> [317.128180] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [317.128182] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 <4> [317.128184] PKRU: 55555554 <4> [317.128185] Call Trace: <4> [317.128187] <TASK> <4> [317.128189] ? show_regs+0x67/0x70 <4> [317.128194] ? __die_body+0x20/0x70 <4> [317.128196] ? __die+0x2b/0x40 <4> [317.128198] ? page_fault_oops+0x15f/0x4e0 <4> [317.128203] ? do_user_addr_fault+0x3fb/0x970 <4> [317.128205] ? lock_acquire+0xc7/0x2e0 <4> [317.128209] ? exc_page_fault+0x87/0x2b0 <4> [317.128212] ? asm_exc_page_fault+0x27/0x30 <4> [317.128216] ? xe_migrate_copy+0x66/0x13e0 [xe] <4> [317.128263] ? __lock_acquire+0xb9d/0x26f0 <4> [317.128265] ? __lock_acquire+0xb9d/0x26f0 <4> [317.128267] ? sg_free_append_table+0x20/0x80 <4> [317.128271] ? lock_acquire+0xc7/0x2e0 <4> [317.128273] ? mark_held_locks+0x4d/0x80 <4> [317.128275] ? trace_hardirqs_on+0x1e/0xd0 <4> [317.128278] ? _raw_spin_unlock_irqrestore+0x31/0x60 <4> [317.128281] ? __pm_runtime_resume+0x60/0xa0 <4> [317.128284] xe_bo_move+0x682/0xc50 [xe] <4> [317.128315] ? lock_is_held_type+0xaa/0x120 <4> [317.128318] ttm_bo_handle_move_mem+0xe5/0x1a0 [ttm] <4> [317.128324] ttm_bo_validate+0xd1/0x1a0 [ttm] <4> [317.128328] shrink_test_run_device+0x721/0xc10 [xe] <4> [317.128360] ? find_held_lock+0x31/0x90 <4> [317.128363] ? lock_release+0xd1/0x2a0 <4> [317.128365] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10 [kunit] <4> [317.128370] xe_bo_shrink_kunit+0x11/0x20 [xe] <4> [317.128397] kunit_try_run_case+0x6e/0x150 [kunit] <4> [317.128400] ? trace_hardirqs_on+0x1e/0xd0 <4> [317.128402] ? _raw_spin_unlock_irqrestore+0x31/0x60 <4> [317.128404] kunit_generic_run_threadfn_adapter+0x1e/0x40 [kunit] <4> [317.128407] kthread+0xf5/0x130 <4> [317.128410] ? __pfx_kthread+0x10/0x10 <4> [317.128412] ret_from_fork+0x39/0x60 <4> [317.128415] ? __pfx_kthread+0x10/0x10 <4> [317.128416] ret_from_fork_asm+0x1a/0x30 <4> [317.128420] </TASK> Fixes: 266c85885263 ("drm/xe/xe2: Handle flat ccs move for igfx.") Signed-off-by: Zhanjun Dong <[email protected]> Reviewed-by: Thomas Hellström <[email protected]> Signed-off-by: Matt Roper <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 59a1c9c7e1d02b43b415ea92627ce095b7c79e47) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe/oa: Don't reset OAC_CONTEXT_ENABLE on OA stream closeJosé Roberto de Souza1-6/+3
Mesa testing on Xe2+ revealed that when OA metrics are collected for an exec_queue, after the OA stream is closed, future batch buffers submitted on that exec_queue do not complete. Not resetting OAC_CONTEXT_ENABLE on OA stream close resolves these hangs and should not have any adverse effects. v2: Make the change that we don't reset the bit clearer (Ashutosh) Also make the same fix for OAC as OAR (Ashutosh) Bspec: 60314 Fixes: 2f4a730fcd2d ("drm/xe/oa: Add OAR support") Fixes: 14e077f8006d ("drm/xe/oa: Add OAC support") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2821 Signed-off-by: José Roberto de Souza <[email protected]> Signed-off-by: Ashutosh Dixit <[email protected]> Cc: [email protected] Reviewed-by: Ashutosh Dixit <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 0c8650b09a365f4a31fca1d1d1e9d99c56071128) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe/queue: move xa_alloc to prevent UAFMatthew Auld1-1/+3
Evil user can guess the next id of the queue before the ioctl completes and then call queue destroy ioctl to trigger UAF since create ioctl is still referencing the same queue. Move the xa_alloc all the way to the end to prevent this. v2: - Rebase Fixes: 2149ded63079 ("drm/xe: Fix use after free when client stats are captured") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 16536582ddbebdbdf9e1d7af321bbba2bf955a87) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe/vm: move xa_alloc to prevent UAFMatthew Auld1-8/+8
Evil user can guess the next id of the vm before the ioctl completes and then call vm destroy ioctl to trigger UAF since create ioctl is still referencing the same vm. Move the xa_alloc all the way to the end to prevent this. v2: - Rebase Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <[email protected]> Cc: Matthew Brost <[email protected]> Cc: <[email protected]> # v6.8+ Reviewed-by: Nirmoy Das <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit dcfd3971327f3ee92765154baebbaece833d3ca9) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe: Clean up VM / exec queue file lock usage.Matthew Brost5-12/+19
Both the VM / exec queue file lock protect the lookup and reference to the object, nothing more. These locks are not intended anything else underneath them. XA have their own locking too, so no need to take the VM / exec queue file lock aside from when doing a lookup and reference get. Add some kernel doc to make this clear and cleanup a few typos too. Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit fe4f5d4b661666a45b48fe7f95443f8fefc09c8c) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe: Resume TDR after GT resetMatthew Brost3-0/+8
Not starting the TDR after GT reset on exec queue which have been restarted can lead to jobs being able to be run forever. Fix this by restarting the TDR. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <[email protected]> Reviewed-by: Nirmoy Das <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 8ec5a4e5ce97d6ee9f5eb5b4ce4cfc831976fdec) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe/xe2: Add performance tuning for L3 cache flushingGustavo Sousa2-0/+13
A recommended performance tuning for LNL related to L3 cache flushing was recently introduced in Bspec. Implement it. Unlike the other existing tuning settings, we limit this one for LNL only, since there is no info about whether this would be applicable to other platforms yet. In the future we can come back and use IP version ranges if applicable. v2: - Fix reference to Bspec. (Sai Teja, Tejas) - Use correct register name for "Tuning: L3 RW flush all Cache". (Sai Teja) - Use SCRATCH3_LBCF (with the underscore) for better readability. v3: - Limit setting to LNL only. (Matt) Bspec: 72161 Cc: Sai Teja Pottumuttu <[email protected]> Cc: Tejas Upadhyay <[email protected]> Cc: Matt Roper <[email protected]> Signed-off-by: Gustavo Sousa <[email protected]> Reviewed-by: Matt Roper <[email protected]> Reviewed-by: Tejas Upadhyay <[email protected]> Signed-off-by: Matt Roper <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 876253165f3eaaacacb8c8bed16a9df4b6081479) Signed-off-by: Lucas De Marchi <[email protected]>
2024-10-03drm/xe/xe2: Extend performance tuning to media GTGustavo Sousa2-0/+26
With exception of "Tuning: L3 cache - media", we are currently applying recommended performance tuning settings only for the primary GT. Let's also implement them for the media GT when applicable. According to our spec, media GT registers CCCHKNREG1 and L3SQCREG* exist only in Xe2_LPM and their offsets do not match their primary GT counterparts. Furthermore, the range where CCCHKNREG1 belongs is not listed as a multicast range on the media GT. As such, we need to have Xe2_LPM-specific definitions for those registers and apply the setting only for that specific IP. Both Xe2_HPM and Xe2_LPM contain STATELESS_COMPRESSION_CTRL and the offset on the media GT matches the one on the primary one. So we can simply have a copy of "Tuning: Stateless compression control" for the media GT. v2: - Fix implementation with respect to multicast vs non-multicast registers. (Matt) - Add missing XE2LPM_CCCHKNREG1 on second action of "Tuning: Compression Overfetch - media". v3: - STATELESS_COMPRESSION_CTRL on Xe2_HPM is also a multicast register, do not define a XE2HPM_STATELESS_COMPRESSION_CTRL register. (Tejas) Bspec: 72161 Cc: Matt Roper <[email protected]> Reviewed-by: Tejas Upadhyay <[email protected]> Signed-off-by: Gustavo Sousa <[email protected]> Signed-off-by: Matt Roper <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit e1f813947ccf2326cfda4558b7d31430d7860c4b) Signed-off-by: Lucas De Marchi <[email protected]>