aboutsummaryrefslogtreecommitdiff
path: root/drivers/accel
diff options
context:
space:
mode:
authorKoby Elbaz <kelbaz@habana.ai>2023-05-29 11:41:04 +0300
committerOded Gabbay <ogabbay@kernel.org>2023-10-09 12:37:18 +0300
commite4a97d6b62599cc6c4bd37674de378b2a86225c3 (patch)
tree04f7eb537d451918876013b96b3d7ca390378111 /drivers/accel
parente7b2902a330eba354024581edead0002521bcedf (diff)
accel/habanalabs: set device status 'malfunction' while in rmmod
hl_device_status() returns the status of an acquired device. If a device is going down (following an rmmod cmd), it should be marked as an unusable/malfunctioning device, and hence should not be acquired. However, since this was not the case so far (i.e., a device going down would inaccurately return 'in reset' status allowing the user to acquire the device) it introduced a bug where as part of a reset flow, the driver could not kill processes that have not run yet, and since those processes aren't blocked from reacquiring a device, we get eventually a new flow of a driver attempting to kill all processes in a list that can't be ever really empty. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Diffstat (limited to 'drivers/accel')
-rw-r--r--drivers/accel/habanalabs/common/device.c6
1 files changed, 4 insertions, 2 deletions
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 993305871292..5973e4d64e19 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -315,7 +315,9 @@ enum hl_device_status hl_device_status(struct hl_device *hdev)
{
enum hl_device_status status;
- if (hdev->reset_info.in_reset) {
+ if (hdev->device_fini_pending) {
+ status = HL_DEVICE_STATUS_MALFUNCTION;
+ } else if (hdev->reset_info.in_reset) {
if (hdev->reset_info.in_compute_reset)
status = HL_DEVICE_STATUS_IN_RESET_AFTER_DEVICE_RELEASE;
else
@@ -343,9 +345,9 @@ bool hl_device_operational(struct hl_device *hdev,
*status = current_status;
switch (current_status) {
+ case HL_DEVICE_STATUS_MALFUNCTION:
case HL_DEVICE_STATUS_IN_RESET:
case HL_DEVICE_STATUS_IN_RESET_AFTER_DEVICE_RELEASE:
- case HL_DEVICE_STATUS_MALFUNCTION:
case HL_DEVICE_STATUS_NEEDS_RESET:
return false;
case HL_DEVICE_STATUS_OPERATIONAL: