Age | Commit message (Collapse) | Author | Files | Lines |
|
Compiling AMD GPU drivers displays two warnings:
drivers/gpu/drm/scheduler/sched_main.c:738: warning: Function parameter or member 'file' not described in 'drm_sched_job_add_syncobj_dependency'
drivers/gpu/drm/scheduler/sched_main.c:738: warning: Excess function
parameter 'file_private' description in
'drm_sched_job_add_syncobj_dependency'
Get rid of them by renaming the variable name on the function description
Signed-off-by: Caio Novais <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v6.4-rc1:
Note: Only changes since pull request from 2023-02-23 are included here.
UAPI Changes:
- Convert rockchip bindings to YAML.
- Constify kobj_type structure in dma-buf.
- FBDEV cmdline parser fixes, and other small fbdev fixes for mode
parsing.
Cross-subsystem Changes:
- Add Neil Armstrong as linaro maintainer.
- Actually signal the private stub dma-fence.
Core Changes:
- Add function for adding syncobj dep to sched_job and use it in panfrost, v3d.
- Improve DisplayID 2.0 topology parsing and EDID parsing in general.
- Add a gem eviction function and callback for generic GEM shrinker
purposes.
- Prepare to convert shmem helper to use the GEM reservation lock instead of own
locking. (Actual commit itself got reverted for now)
- Move the suballocator from radeon and amdgpu drivers to core in preparation
for Xe.
- Assorted small fixes and documentation.
- Fixes to HPD polling.
- Assorted small fixes in simpledrm, bridge, accel, shmem-helper,
and the selftest of format-helper.
- Remove dummy resource when ttm bo is created, and during pipelined
gutting. Fix all drivers to accept a NULL ttm_bo->resource.
- Handle pinned BO moving prevention in ttm core.
- Set drm panel-bridge orientation before connector is registered.
- Remove dumb_destroy callback.
- Add documentation to GEM_CLOSE, PRIME_HANDLE_TO_FD, PRIME_FD_TO_HANDLE, GETFB2 ioctl's.
- Add atomic enable_plane callback, use it in ast, mgag200, tidss.
Driver Changes:
- Use drm_gem_objects_lookup in vc4.
- Assorted small fixes to virtio, ast, bridge/tc358762, meson, nouveau.
- Allow virtio KMS to be disabled and compiled out.
- Add Radxa 8/10HD, Samsung AMS495QA01 panels.
- Fix ivpu compiler errors.
- Assorted fixes to drm/panel, malidp, rockchip, ivpu, amdgpu, vgem,
nouveau, vc4.
- Assorted cleanups, simplifications and fixes to vmwgfx.
Signed-off-by: Dave Airlie <[email protected]>
From: Maarten Lankhorst <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Backmerging to get v6.3-rc1 and sync with the other DRM trees.
Signed-off-by: Thomas Zimmermann <[email protected]>
|
|
In order to add a syncobj's fence as a dependency to a job, it is
necessary to call drm_syncobj_find_fence() to find the fence and then
add the dependency with drm_sched_job_add_dependency(). So, wrap these
steps in one single function, drm_sched_job_add_syncobj_dependency().
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Maíra Canal <[email protected]>
Signed-off-by: Maíra Canal <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
drm-next
This time we've added support for reporting of GPU load via the common
fdinfo format, as already supported by multiple other drivers. Improved
diagnostic messages for MMU faults. And finally added experimental
support for driving the VeriSilicon NPU cores, which are very close
relatives to the GPU designs, so close in fact that they can run the
same compute instruction set, but with a big NN-fabric/matrix/tensor
execution array glued to the side.
Signed-off-by: Dave Airlie <[email protected]>
From: Lucas Stach <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Track the accumulated time that jobs from this entity were active
on the GPU. This allows drivers using the scheduler to trivially
implement the DRM fdinfo when the hardware doesn't provide more
specific information than signalling job completion anyways.
[Bagas: Append missing colon to @elapsed_ns]
Signed-off-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Lucas Stach <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
|
|
This interface is not working as it should.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Backmerging to get v6.1-rc6 into drm-misc-next.
Signed-off-by: Thomas Zimmermann <[email protected]>
|
|
The drm_sched_entity_kill() is invoked twice by drm_sched_entity_destroy()
while userspace process is exiting or being killed. First time it's invoked
when sched entity is flushed and second time when entity is released. This
causes a lockup within wait_for_completion(entity_idle) due to how completion
API works.
Calling wait_for_completion() more times than complete() was invoked is a
error condition that causes lockup because completion internally uses
counter for complete/wait calls. The complete_all() must be used instead
in such cases.
This patch fixes lockup of Panfrost driver that is reproducible by killing
any application in a middle of 3d drawing operation.
Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini")
Signed-off-by: Dmitry Osipenko <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.2-2022-11-18:
amdgpu:
- SR-IOV fixes
- Clean up DC checks
- DCN 3.2.x fixes
- DCN 3.1.x fixes
- Don't enable degamma on asics which don't support it
- IP discovery fixes
- BACO fixes
- Fix vbios allocation handling when vkms is enabled
- Drop buggy tdr advanced mode GPU reset handling
- Fix the build when DCN is not set in kconfig
- MST DSC fixes
- Userptr fixes
- FRU and RAS EEPROM fixes
- VCN 4.x RAS support
- Aldrebaran CU occupancy reporting fix
- PSP ring cleanup
amdkfd:
- Memory limit fix
- Enable cooperative launch on gfx 10.3
amd-drm-next-6.2-2022-11-11:
amdgpu:
- SMU 13.x updates
- GPUVM TLB race fix
- DCN 3.1.4 updates
- DCN 3.2.x updates
- PSR fixes
- Kerneldoc fix
- Vega10 fan fix
- GPUVM locking fixes in error pathes
- BACO fix for Beige Goby
- EEPROM I2C address cleanup
- GFXOFF fix
- Fix DC memory leak in error pathes
- Flexible array updates
- Mtype fix for GPUVM PTEs
- Move Kconfig into amdgpu directory
- SR-IOV updates
- Fix possible memory leak in CS IOCTL error path
amdkfd:
- Fix possible memory overrun
- CRIU fixes
radeon:
- ACPI ref count fix
- HDA audio notifier support
- Move Kconfig into radeon directory
UAPI:
- Add new GEM_CREATE flags to help to transition more KFD functionality to the DRM UAPI.
These are used internally in the driver to align location based memory coherency
requirements from memory allocated in the KFD with how we manage GPUVM PTEs. They
are currently blocked in the GEM_CREATE IOCTL as we don't have a user right now.
They are just used internally in the kernel driver for now for existing KFD memory
allocations. So a change to the UAPI header, but no functional change in the UAPI.
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Dave Airlie <[email protected]>
|
|
This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75.
This feature basically re-submits one job after another to
figure out which one was the one causing a hang.
This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.
For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.
Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Not used any more.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Add a new function to update job dependencies from a resv obj.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
The currently default Round-Robin GPU scheduling can result in starvation
of entities which have a large number of jobs, over entities which have
a very small number of jobs (single digit).
This can be illustrated in the following diagram, where jobs are
alphabetized to show their chronological order of arrival, where job A is
the oldest, B is the second oldest, and so on, to J, the most recent job to
arrive.
---> entities
j | H-F-----A--E--I--
o | --G-----B-----J--
b | --------C--------
s\/ --------D--------
WLOG, assuming all jobs are "ready", then a R-R scheduling will execute them
in the following order (a slice off of the top of the entities' list),
H, F, A, E, I, G, B, J, C, D.
However, to mitigate job starvation, we'd rather execute C and D before E,
and so on, given, of course, that they're all ready to be executed.
So, if all jobs are ready at this instant, the order of execution for this
and the next 9 instances of picking the next job to execute, should really
be,
A, B, C, D, E, F, G, H, I, J,
which is their chronological order. The only reason for this order to be
broken, is if an older job is not yet ready, but a younger job is ready, at
an instant of picking a new job to execute. For instance if job C wasn't
ready at time 2, but job D was ready, then we'd pick job D, like this:
0 +1 +2 ...
A, B, D, ...
And from then on, C would be preferred before all other jobs, if it is ready
at the time when a new job for execution is picked. So, if C became ready
two steps later, the execution order would look like this:
......0 +1 +2 ...
A, B, D, E, C, F, G, H, I, J
This is what the FIFO GPU scheduling algorithm achieves. It uses a
Red-Black tree to keep jobs sorted in chronological order, where picking
the oldest job is O(1) (we use the "cached" structure), and balancing the
tree is O(log n). IOW, it picks the *oldest ready* job to execute now.
The implementation is already in the kernel, and this commit only changes
the default GPU scheduling algorithm to use.
This was tested and achieves about 1% faster performance over the Round
Robin algorithm.
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Direct Rendering Infrastructure - Development <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>
|
|
Let's kick-off this release cycle.
Signed-off-by: Maxime Ripard <[email protected]>
|
|
This reverts commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86.
This is causing instability on Linus' desktop, and I'm seeing
oops with VK CTS runs.
netconsole got me the following oops:
[ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
[ 1234.778782] #PF: supervisor read access in kernel mode
[ 1234.778787] #PF: error_code(0x0000) - not-present page
[ 1234.778791] PGD 0 P4D 0
[ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
[ 1234.778809] Hardware name: System manufacturer System Product
Name/PRIME X370-PRO, BIOS 5603 07/28/2020
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
[ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
00 f0
[ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
[ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
[ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
[ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
[ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
[ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
[ 1234.778912] FS: 00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
knlGS:0000000000000000
[ 1234.778916] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
[ 1234.778924] Call Trace:
[ 1234.778981] <IRQ>
[ 1234.778989] dma_fence_signal_timestamp_locked+0x6a/0xe0
[ 1234.778999] dma_fence_signal+0x2c/0x50
[ 1234.779005] amdgpu_fence_process+0xc8/0x140 [amdgpu]
[ 1234.779234] sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
[ 1234.779395] amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
[ 1234.779609] amdgpu_ih_process+0x80/0x100 [amdgpu]
[ 1234.779783] amdgpu_irq_handler+0x1f/0x60 [amdgpu]
[ 1234.779940] __handle_irq_event_percpu+0x46/0x190
[ 1234.779946] handle_irq_event+0x34/0x70
[ 1234.779949] handle_edge_irq+0x9f/0x240
[ 1234.779954] __common_interrupt+0x66/0x100
[ 1234.779960] common_interrupt+0xa0/0xc0
[ 1234.779965] </IRQ>
[ 1234.779968] <TASK>
[ 1234.779971] asm_common_interrupt+0x22/0x40
[ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
[ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
83 ea
[ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202
Revert it for now and figure it out later.
Signed-off-by: Dave Airlie <[email protected]>
|
|
Otherwise we would crash if the job is not resubmitted.
v2: fix second usage of s_fence->parent as well.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
When many entities are competing for the same run queue
on the same scheduler, we observe an unusually long wait
times and some jobs get starved. This has been observed on GPUVis.
The issue is due to the Round Robin policy used by schedulers
to pick up the next entity's job queue for execution. Under stress
of many entities and long job queues within entity some
jobs could be stuck for very long time in it's entity's
queue before being popped from the queue and executed
while for other entities with smaller job queues a job
might execute earlier even though that job arrived later
then the job in the long queue.
Fix:
Add FIFO selection policy to entities in run queue, chose next entity
on run queue in such order that if job on one entity arrived
earlier then job on another entity the first job will start
executing earlier regardless of the length of the entity's job
queue.
v2:
Switch to rb tree structure for entities based on TS of
oldest job waiting in the job queue of an entity. Improves next
entity extraction to O(1). Entity TS update
O(log N) where N is the number of entities in the run-queue
Drop default option in module control parameter.
v3:
Various cosmetical fixes and minor refactoring of fifo update function. (Luben)
v4:
Switch drm_sched_rq_select_entity_fifo to in order search (Luben)
v5: Fix up drm_sched_rq_select_entity_fifo loop (Luben)
v6: Add missing drm_sched_rq_remove_fifo_locked
v7: Fix ts sampling bug and more cosmetic stuff (Luben)
v8: Fix module parameter string (Luben)
Cc: Luben Tuikov <[email protected]>
Cc: Christian König <[email protected]>
Cc: Direct Rendering Infrastructure - Development <[email protected]>
Cc: AMD Graphics <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Tested-by: Yunxiang Li (Teddy) <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Using the parent fence instead of the finished fence
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.
Signed-off-by: Arvind Yadav <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>
|
|
Fix kernel-doc warnings in gpu_scheduler.h and sched_main.c.
Quashes these warnings:
include/drm/gpu_scheduler.h:332: warning: missing initial short description on line:
* struct drm_sched_backend_ops
include/drm/gpu_scheduler.h:412: warning: missing initial short description on line:
* struct drm_gpu_scheduler
include/drm/gpu_scheduler.h:461: warning: Function parameter or member 'dev' not described in 'drm_gpu_scheduler'
drivers/gpu/drm/scheduler/sched_main.c:201: warning: missing initial short description on line:
* drm_sched_dependency_optimized
drivers/gpu/drm/scheduler/sched_main.c:995: warning: Function parameter or member 'dev' not described in 'drm_sched_init'
Fixes: 2d33948e4e00 ("drm/scheduler: add documentation")
Fixes: 8ab62eda177b ("drm/sched: Add device pointer to drm_gpu_scheduler")
Fixes: 542cff7893a3 ("drm/sched: Avoid lockdep spalt on killing a processes")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Nayan Deshmukh <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Jiawei Gu <[email protected]>
Cc: [email protected]
Acked-by: Christian König <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
We already discussed that the call to drm_sched_entity_select_rq() needs
to move to drm_sched_job_arm() to be able to set a new scheduler list
between _init() and _arm(). This was just not applied for some reason.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Problem:
This patch caused negative refcount as described in [1] because
for that case parent fence did not signal by the time of drm_sched_stop and hence
kept in pending list the assumption was they will not signal and
so fence was put to account for the s_fence->parent refcount but for
amdgpu which has embedded HW fence (always same parent fence)
drm_sched_fence_release_scheduled was always called and would
still drop the count for parent fence once more. For jobs that
never signaled this imbalance was masked by refcount bug in
amdgpu_fence_driver_clear_job_fences that would not drop
refcount on the fences that were removed from fence drive
fences array (against prevois insertion into the array in
get in amdgpu_fence_emit).
Fix:
Revert this patch and by setting s_job->s_fence->parent to NULL
as before prevent the extra refcount drop in amdgpu when
drm_sched_fence_release_scheduled is called on job release.
Also - align behaviour in drm_sched_resubmit_jobs_ext with that of
drm_sched_main when submitting jobs - take a refcount for the
new parent fence pointer and drop refcount for original kref_init
for new HW fence creation (or fake new HW fence in amdgpu - see next patch).
[1] - https://lore.kernel.org/all/[email protected]/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
Signed-off-by: Andrey Grodzovsky <[email protected]>
Tested-by: Yiqing Yao <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
This change adds the dma_resv_usage enum and allows us to specify why a
dma_resv object is queried for its containing fences.
Additional to that a dma_resv_usage_rw() helper function is added to aid
retrieving the fences for a read or write userspace submission.
This is then deployed to the different query functions of the dma_resv
object and all of their users. When the write paratermer was previously
true we now use DMA_RESV_USAGE_WRITE and DMA_RESV_USAGE_READ otherwise.
v2: add KERNEL/OTHER in separate patch
v3: some kerneldoc suggestions by Daniel
v4: some more kerneldoc suggestions by Daniel, fix missing cases lost in
the rebase pointed out by Bas.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
You really need to hold the reservation here or all kinds of funny
things can happen between grabbing the dependencies and inserting the
new fences.
v2: Fix commit summary (Christian)
Acked-by: Melissa Wen <[email protected]>
Reviewed-by: "Christian König" <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Add device pointer so scheduler's printing can use
DRM_DEV_ERROR() instead, which makes life easier under multiple GPU
scenario.
v2: amend all calls of drm_sched_init()
v3: fill dev pointer for all drm_sched_init() calls
Signed-off-by: Jiawei Gu <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
drm_sched_job_add_dependency() could drop the last ref, so we need to do
the dma_fence_get() first.
Cc: Christian König <[email protected]>
Fixes: 9c2ba265352a ("drm/scheduler: use new iterator in drm_sched_job_add_implicit_dependencies v2")
Signed-off-by: Rob Clark <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Daniel Vetter <[email protected]>
Reviewed-by: Christian König <[email protected]>
Tested-by: Amit Pundir <[email protected]>
Signed-off-by: Christian König <[email protected]>
|
|
Trivial fix since we now need to grab a reference to the fence we have
added. Previously the dma_resv function where doing that for us.
Signed-off-by: Christian König <[email protected]>
Fixes: 9c2ba265352a ("drm/scheduler: use new iterator in drm_sched_job_add_implicit_dependencies v2")
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Daniel Vetter <[email protected]>
Reported-by: Nicolas Frattaroli <[email protected]>
References: https://lore.kernel.org/dri-devel/2023306.UmlnhvANQh@archbook/
Tested-by: Nicolas Frattaroli <[email protected]>
Tested-by: Yassine Oudjana <[email protected]>
|
|
Simplifying the code a bit.
v2: use dma_resv_for_each_fence
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer
even the its corresponding job is still running.
fix:
do not cancel the timer in cleanup_job, instead do the cancelling
only when the heading job is signaled, and if there is a "next" job
we start_timeout again.
v2:
further cleanup the logic, and do the TDR timer cancelling if the signaled job
is the last one in its scheduler.
v3:
change the issue description
remove the cancel_delayed_work in the begining of the cleanup_job
recover the implement of drm_sched_job_begin.
v4:
remove the kthread_should_park() checking in cleanup_job routine,
we should cleanup the signaled job asap
TODO:
1)introduce pause/resume scheduler in job_timeout to serial the handling
of scheduler and job_timeout.
2)drop the bad job's del and insert in scheduler due to above serialization
(no race issue anymore with the serialization)
Tested-by: jingwen <jingwen.chen@@amd.com>
Signed-off-by: Monk Liu <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
drm_sched_job_cleanup() will pass an uninitialized fence to
drm_sched_fence_free(), which will cause to_drm_sched_fence() to return
a NULL fence object, causing a NULL pointer deref when this NULL object
is passed to kmem_cache_free().
Let's create a new drm_sched_fence_free() function that takes a
drm_sched_fence pointer and suffix the old function with _rcu. While at
it, complain if drm_sched_fence_free() is passed an initialized fence
or if drm_sched_fence_free_rcu() is passed an uninitialized fence.
Fixes: dbe48d030b28 ("drm/sched: Split drm_sched_job_init")
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Instead of just a callback we can just glue in the gem helpers that
panfrost, v3d and lima currently use. There's really not that many
ways to skin this cat.
v2/3: Rebased.
v4: Repaint this shed. The functions are now called _add_dependency()
and _add_implicit_dependency()
Reviewed-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]> (v3)
Reviewed-by: Steven Price <[email protected]> (v1)
Acked-by: Melissa Wen <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sumit Semwal <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Lee Jones <[email protected]>
Cc: Nirmoy Das <[email protected]>
Cc: Boris Brezillon <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
This is a very confusingly named function, because not just does it
init an object, it arms it and provides a point of no return for
pushing a job into the scheduler. It would be nice if that's a bit
clearer in the interface.
But the real reason is that I want to push the dependency tracking
helpers into the scheduler code, and that means drm_sched_job_init
must be called a lot earlier, without arming the job.
v2:
- don't change .gitignore (Steven)
- don't forget v3d (Emma)
v3: Emma noticed that I leak the memory allocated in
drm_sched_job_init if we bail out before the point of no return in
subsequent driver patches. To be able to fix this change
drm_sched_job_cleanup() so it can handle being called both before and
after drm_sched_job_arm().
Also improve the kerneldoc for this.
v4:
- Fix the drm_sched_job_cleanup logic, I inverted the booleans, as
usual (Melissa)
- Christian pointed out that drm_sched_entity_select_rq() also needs
to be moved into drm_sched_job_arm, which made me realize that the
job->id definitely needs to be moved too.
Shuffle things to fit between job_init and job_arm.
v5:
Reshuffle the split between init/arm once more, amdgpu abuses
drm_sched.ready to signal gpu reset failures. Also document this
somewhat. (Christian)
v6:
Rebase on top of the msm drm/sched support. Note that the
drm_sched_job_init() call is completely misplaced, and hence also the
split-out drm_sched_entity_push_job(). I've put in a FIXME which the next
patch will address.
v7: Drop the FIXME in msm, after discussions with Rob I agree it shouldn't
be a problem where it is now.
Acked-by: Christian König <[email protected]>
Acked-by: Melissa Wen <[email protected]>
Cc: Melissa Wen <[email protected]>
Acked-by: Emma Anholt <[email protected]>
Acked-by: Steven Price <[email protected]> (v2)
Reviewed-by: Boris Brezillon <[email protected]> (v5)
Signed-off-by: Daniel Vetter <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christian Gmeiner <[email protected]>
Cc: Qiang Yu <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Tomeu Vizoso <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Alyssa Rosenzweig <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sumit Semwal <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Adam Borowski <[email protected]>
Cc: Nick Terrell <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Paul Menzel <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Nirmoy Das <[email protected]>
Cc: Deepak R Varma <[email protected]>
Cc: Lee Jones <[email protected]>
Cc: Kevin Wang <[email protected]>
Cc: Chen Li <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: "Marek Olšák" <[email protected]>
Cc: Dennis Li <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Sonny Jiang <[email protected]>
Cc: Boris Brezillon <[email protected]>
Cc: Tian Tao <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Emma Anholt <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.
v5:
* Add a new paragraph to the timedout_job() method
v3:
* New patch
v4:
* Actually use the timeout_wq to queue the timeout work
Suggested-by: Daniel Vetter <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Lucas Stach <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Acked-by: Christian König <[email protected]>
Cc: Qiang Yu <[email protected]>
Cc: Emma Anholt <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: "Christian König" <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
The panfrost driver tries to kill in-flight jobs on FD close after
destroying the FD scheduler entities. For this to work properly, we
need to make sure the jobs popped from the scheduler entities have
been queued at the HW level before declaring the entity idle, otherwise
we might iterate over a list that doesn't contain those jobs.
Suggested-by: Lucas Stach <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Cc: Lucas Stach <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Lucas Stach <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.
Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.
v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.
v3:
Drop drm_sched_rq_remove_entity, only modify entity->stopped
and check for it in drm_sched_entity_is_idle
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
We don't want to rearm the timer if driver hook reports
that the device is gone.
v5: Update drm_gpu_sched_stat values in code.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Update the timestamp of scheduled fence on HW
completion of the previous fences
This allow more accurate tracking of the fence
execution in HW
v2 (chk): drop the flag check and improve the comment
Signed-off-by: David M Nieto <[email protected]>
Signed-off-by: Roy Sun <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.
[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.
1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.
2. For whole gpu reset(vram lost), do resubmit as the old way.
v2: squash in build fix (Alex)
Signed-off-by: Jack Zhang <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Allow multiple schedulers to share the load balancing score.
This is useful when one engine has different hw rings.
Signed-off-by: Christian König <[email protected]>
Reviewed-and-Tested-by: Leo Liu <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
This patch does not change current behaviour.
The driver's job timeout handler now returns
status indicating back to the DRM layer whether
the device (GPU) is no longer available, such as
after it's been unplugged, or whether all is
normal, i.e. current behaviour.
All drivers which make use of the
drm_sched_backend_ops' .timedout_job() callback
have been accordingly renamed and return the
would've-been default value of
DRM_GPU_SCHED_STAT_NOMINAL to restart the task's
timeout timer--this is the old behaviour, and is
preserved by this patch.
v2: Use enum as the status of a driver's job
timeout callback method.
v3: Return scheduler/device information, rather
than task information.
Cc: Alexander Deucher <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Lucas Stach <[email protected]>
Cc: Russell King <[email protected]>
Cc: Christian Gmeiner <[email protected]>
Cc: Qiang Yu <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Tomeu Vizoso <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Alyssa Rosenzweig <[email protected]>
Cc: Eric Anholt <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Alyssa Rosenzweig <[email protected]>
Acked-by: Christian König <[email protected]>
Acked-by: Steven Price <[email protected]>
Signed-off-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/415095/
|
|
To avoid any possible use after free.
Signed-off-by: Andrey Grodzovsky <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/414814/
CC: [email protected]
Signed-off-by: Christian König <[email protected]>
|
|
Required backmerge since we will be based on top of v5.11, and there
has been a request to backmerge already to upstream some features.
Signed-off-by: Maarten Lankhorst <[email protected]>
|
|
git://people.freedesktop.org/~agd5f/linux into drm-next
amd-drm-next-5.11-2020-12-09:
amdgpu:
- SR-IOV fixes
- Navy Flounder updates
- Sienna Cichlid updates
- Dimgrey Cavefish updates
- Vangogh updates
- Misc SMU fixes
- Misc display fixes
- Last big hunk of W=1 warning fixes
- Cursor validation fixes
- CI BACO updates
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Dave Airlie <[email protected]>
|
|
The job done callback is called from various
places, in two ways: in job done role, and
as a fence callback role.
Essentialize the callback to an atom
function to just complete the job,
and into a second function as a prototype
of fence callback which calls to complete
the job.
This is used in latter patches by the completion
code.
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/405574/
Cc: Alexander Deucher <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Christian König <[email protected]>
|
|
Rename "ring_mirror_list" to "pending_list",
to describe what something is, not what it does,
how it's used, or how the hardware implements it.
This also abstracts the actual hardware
implementation, i.e. how the low-level driver
communicates with the device it drives, ring, CAM,
etc., shouldn't be exposed to DRM.
The pending_list keeps jobs submitted, which are
out of our control. Usually this means they are
pending execution status in hardware, but the
latter definition is a more general (inclusive)
definition.
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/405573/
Cc: Alexander Deucher <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Christian König <[email protected]>
|
|
Rename "node" to "list" in struct drm_sched_job,
in order to make it consistent with what we see
being used throughout gpu_scheduler.h, for
instance in struct drm_sched_entity, as well as
the rest of DRM and the kernel.
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/403515/
Cc: Alexander Deucher <[email protected]>
Cc: Andrey Grodzovsky <[email protected]>
Cc: Christian König <[email protected]>
Cc: Daniel Vetter <[email protected]>
Signed-off-by: Christian König <[email protected]>
|
|
Some identifiers have different names between their prototypes
and the kernel-doc markup.
Others need to be fixed, as kernel-doc markups should use this format:
identifier - description
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Acked-by: Jani Nikula <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/12d4ca26f6843618200529ce5445063734d38c04.1605521731.git.mchehab+huawei@kernel.org
|
|
paramter
Fixes the following W=1 kernel build warning(s):
drivers/gpu/drm/scheduler/sched_main.c:74: warning: Function parameter or member 'sched' not described in 'drm_sched_rq_init'
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sumit Semwal <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Lee Jones <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
git://people.freedesktop.org/~agd5f/linux into drm-next
amd-drm-next-5.10-2020-09-03:
amdgpu:
- RAS fixes
- Sienna Cichlid updates
- Navy Flounder updates
- DCE6 (SI) support in DC
- Enable plane rotation
- Rework pre-OS vram reservation handling during driver init
- Add standard interface to dump GPU metrics table from SMU
- Rework tiling and tmz state handling in atomic commits
- Pstate fixes
- Add voltage and power hwmon interfaces for renoir
- SW CTF fixes
- S/G display fix for Raven
- Print client strings for vmfaults for vega and newer
- Manual fan control fixes
- Display updates
- Reorg power management directory structure
- Misc bug fixes
- Misc code cleanups
amdkfd:
- Topology fixes
- Add SMI events for thermal throttling and GPU resets
radeon:
- switch from pci_* to dma_* for dma allocations
- PLL fix
Scheduler:
- Clean up priority levels
UAPI:
- amdgpu INFO IOCTL query update for TMZ state
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6049
- amdkfd SMI event interface updates
https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/therm_thrott
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Remove DRM_SCHED_PRIORITY_LOW, as it was used
in only one place.
Rename and separate by a line
DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT
as it represents a (total) count of said
priorities and it is used as such in loops
throughout the code. (0-based indexing is the
the count number.)
Remove redundant word HIGH in priority names,
and rename *KERNEL* to *HIGH*, as it really
means that, high.
v2: Add back KERNEL and remove SW and HW,
in lieu of a single HIGH between NORMAL and KERNEL.
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|