Age | Commit message (Collapse) | Author | Files | Lines |
|
add amdgpu_amdkfd_pre_reset and amdgpu_amdkfd_post_reset inside amdgpu_device_reset_sriov.
Signed-off-by: Wentao Lou <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
When page table are updated by the CPU, synchronize with the
allocation and initialization of newly allocated page tables.
Signed-off-by: Felix Kuehling <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
If using old kernel config file, CONFIG_ZONE_DEVICE is not selected,
so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver
error message "Failed to register MMU notifier" is not clear. Inform
user with more descriptive message on how to fix the missing kernel
config option.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109808
Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
userptr may cross two VMAs if the forked child process (not call exec
after fork) malloc buffer, then free it, and then malloc larger size
buf, kerenl will create new VMA adjacent to old VMA which was cloned
from parent process, some pages of userptr are in the first VMA, the
rest pages are in the second VMA.
HMM expects range only have one VMA, loop over all VMAs in the address
range, create multiple ranges to handle this case. See
is_mergeable_anon_vma in mm/mmap.c for details.
Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Userptr restore may have concurrent userptr invalidation after
hmm_vma_fault adds the range to the hmm->ranges list, needs call
hmm_vma_range_done to remove the range from hmm->ranges list first,
then reschedule the restore worker. Otherwise hmm_vma_fault will add
same range to the list, this will cause loop in the list because
range->next point to range itself.
Add function untrack_invalid_user_pages to reduce code duplication.
Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Otherwise we won't be able to cleanly handle page faults.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Make sure that not only the entities are flush, but that
we also wait for the HW to finish all processing.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
It's a bug having a dead pointer in the IDR, silently returning
is the worst we can do.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Remove the chash implementation for now since it isn't used any more.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Further testing showed that the idea with the chash doesn't work as expected.
Especially we can't predict when we can remove the entries from the hash again.
So replace the chash with a ring buffer/hash mix where entries in the container
age automatically based on their timestamp.
v2: use ring buffer / hash mix
v3: check the timeout to make sure all entries age
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]> (v2)
Signed-off-by: Alex Deucher <[email protected]>
|
|
Only process a maximum of 32 IVs before writing back the RPTR. This improves
hw handling when we get close to an overflow in the ring buffer.
Signed-off-by: Christian König <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
That doesn't seem to have any negative effects.
Signed-off-by: Christian König <[email protected]>
Acked-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The doorbells should already be reserved, just enable them.
Signed-off-by: Christian König <[email protected]>
Acked-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Disable overflow and enable full drain. This makes fault handling on ring 1
much more reliable since we don't generate back pressure any more.
Signed-off-by: Christian König <[email protected]>
Acked-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The buffers should be cleared when possible but we also don't want
buffer creation to fail in the rare case where the ring isn't ready
during the call. This could happen during some suspend/resume sequences.
Cc: Christian König <[email protected]>
Signed-off-by: Nicholas Kazlauskas <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The dumb_create API isn't intended for high performance rendering
and it's more useful for userspace (ie. IGT) to have them precleared.
The bonus here is that we also won't needlessly leak whatever was
previously in VRAM, but it also probably wasn't sensitive if it was
going through this API.
Signed-off-by: Nicholas Kazlauskas <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:405:2-3: Unneeded semicolon
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:435:2-3: Unneeded semicolon
Remove unneeded semicolon.
Generated by: scripts/coccinelle/misc/semicolon.cocci
CC: xinhui pan <[email protected]>
Signed-off-by: kbuild test robot <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add ras post init function.
Do some initialization after all IP have finished their late init.
Add new member flags which will control the ras work flow.
For now, vbios enable ras for us on boot. That might change in the
future.
So there should be a flag from vbios to tell us if ras is enabled or not
on boot. Looks like there is no such info now.
Other bits of the flags are reserved to control other parts of ras.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add drm info output if ras initialized successfully.
add ras atomfirmware sanity check.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
lockdep need a static key.
Previously we set ignore bit to avoid the warning.
Now call sysfs_attr_init to initialize the static key.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-and-Tested-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Unzero char is accepted by sscanf, so when data is structure but
unexpectedly return error invalid;
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Currently, it is not clear how ras is supported. Both software and
hardware can set the supported. That is confusing.
Fix it by adding new member hw_supported.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Set ignore bit to satisfy locpdep.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Ta is optional, so check if ta firmware is loaded or not.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
The descriptions of modinfo wrongly show two parameters
for each feature(see below). This patch can fix this
incorrect outputs.
parm: amdgpu_ras_enable:Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))
parm: ras_enable:int
parm: amdgpu_ras_mask:Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
parm: ras_mask:uint
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Reviewed-by: xinhui pan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
query sram ecc capability via amdgpu_atomfirmware_ecc_default_enabled
query ecc availability via amdgpu_atomfirmware_sram_ecc_supported
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
sram ecc capability could be get from firmware_capability field in firmwareinfo table
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
ecc default status (enabled or disabled) could be get from umc_config field in umc_info table
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Suspend will put irq, so resume need get irq back.
And in the same time, skip other ras initialization.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
RAS ECC event will combine with GPU reset event, due to
ECC interrupts are caused by uncorrectable error that triggers
GPU reset.
v2: Fix misleading-indentation warning
v3: fix build with CONFIG_HSA_AMD disabled
Signed-off-by: Eric Huang <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Currently, the debugfs control node can't parse bash-like commands.
Now add such support for any tester that uses scripts.
v2: squash in fixes for input validation
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
gpu reset is not stable on vega20 A1.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add a query for userspace to check which RAS features
are enabled.
v2: squash in warning fix
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add AMDGPU_CTX_QUERY2_FLAGS_RAS_CE/UE which indicate if any error happened
between previous query and this query.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Register ecc interrupts and ecc interrupt handler on gfx9.
Add ras support on gfx9
v2: squash in warning fix
Signed-off-by: Feifei Xu <[email protected]>
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
register IH, enable ras features on sdma.
create sysfs debugfs file for sdma.
Signed-off-by: xinhui pan <[email protected]>
Signed-off-by: Feifei Xu <[email protected]>
Signed-off-by: Eric Huang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Mark vram pages with errors as bad and prevent the driver
from using them.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
allow userspace enable/disable ras
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
add obj management.
add feature control.
add debugfs infrastructure.
add sysfs infrastructure.
add IH infrastructure.
add recovery infrastructure.
It is a framework. Other IPs need call amdgpu_ras_xxx function instead of
psp_ras_xxx functions.
v2: squash in warning fixes
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add trigger_error and cure_posion.
Acked-by: Hawking Zhang <[email protected]>
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add ras fw loading, init, terminate.
Add ras cmd submit helper.
Add ras feature enable/disable common function.
v2: squash in unused variable warning fix
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Define the driver side interface for ras ta.
Acked-by: Hawking Zhang <[email protected]>
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Allow RAS feature enable/disable via boot parameter.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Output the ta fw, aka xgmi/ras, via debugfs.
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Add ras fw part, xgmi and ras fw are combined together in ta binary.
Reading the data from the info is not implemented yet.
v2: squash in "drm/amdgpu: fix NULL pointer when ta is missing"
Signed-off-by: xinhui pan <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|