aboutsummaryrefslogtreecommitdiff
path: root/drivers/misc
AgeCommit message (Collapse)AuthorFilesLines
2021-12-30misc: lattice-ecp3-config: Fix task hung when firmware load failedWei Yongjun1-6/+6
When firmware load failed, kernel report task hung as follows: INFO: task xrun:5191 blocked for more than 147 seconds. Tainted: G W 5.16.0-rc5-next-20211220+ #11 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:xrun state:D stack: 0 pid: 5191 ppid: 270 flags:0x00000004 Call Trace: __schedule+0xc12/0x4b50 kernel/sched/core.c:4986 schedule+0xd7/0x260 kernel/sched/core.c:6369 (discriminator 1) schedule_timeout+0x7aa/0xa80 kernel/time/timer.c:1857 wait_for_completion+0x181/0x290 kernel/sched/completion.c:85 lattice_ecp3_remove+0x32/0x40 drivers/misc/lattice-ecp3-config.c:221 spi_remove+0x72/0xb0 drivers/spi/spi.c:409 lattice_ecp3_remove() wait for signals from firmware loading, but when load failed, firmware_load() does not send this signal. This cause device remove hung. Fix it by sending signal even if load failed. Fixes: 781551df57c7 ("misc: Add Lattice ECP3 FPGA configuration via SPI") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wei Yongjun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-12-29cxl: use default_groups in kobj_typeGreg Kroah-Hartman1-1/+2
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the cxl code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Frederic Barrat <[email protected]> Cc: Andrew Donnellan <[email protected]> Cc: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-12-27mei: cleanup status before client dma setup callAlexander Usyskin1-0/+4
The upper layer may retry call to mei_cl_dma_alloc_and_map(), in that case the client status may be non-zero after the previous call and the wait condition will be true immediately. Set cl->status to zero to allow waiting for an actual result from the firmware. Signed-off-by: Alexander Usyskin <[email protected]> Signed-off-by: Tomas Winkler <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-12-27mei: add POWERING_DOWN into device state printAlexander Usyskin1-0/+1
The POWERING_DOWN state string was missing from the device states list, add it. Signed-off-by: Alexander Usyskin <[email protected]> Signed-off-by: Tomas Winkler <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-12-26habanalabs: support hard-reset scheduling during soft-resetOfir Bitton2-3/+31
As hard-reset can be requested during soft-reset, driver must allow it or else critical events received during soft-reset will be ignored. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add a lock to protect multiple reset variablesOfir Bitton4-26/+49
Atomic operations during reset are replaced by a spinlock in order to have the ability to protect more than a single variable. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: refactor reset information variablesOfir Bitton13-106/+119
Unify variables related to device reset, which will help us to add some new reset functionality in future patches. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: handle skip multi-CS if handling not doneOhad Sharabi1-1/+11
This patch fixes issue in which we have timeout for multi-CS although the CS in the list actually completed. Example scenario (the two threads marked as WAIT for the thread that handles the wait_for_multi_cs and CMPL as the thread that signal completion for both CS and multi-CS): 1. Submit CS with sequence X 2. [WAIT]: call wait_for_multi_cs with single CS X 3. [CMPL]: CS X do invoke complete_all for both CS and multi-CS (multi_cs_completion_done still false) 4. [WAIT]: enter poll_fences, reinit the completion and find the CS as completed when asking on the fence but multi_cs_done is still false it returns that no CS actually completed 5. [CMPL]: set multi_cs_handling_done as true 6. [WAIT]: wait for completion but no CS to awake the wait context and hence wait till timeout Solution: if CS detected as completed in poll_fences but multi_cs_done is still false invoke complete_all to the multi-CS completion and so it will not go to sleep in wait_for_completion but rather will have a "second chance" to wait for multi_cs_completion_done. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add CPU-CP packet for engine core ASID cfgTomer Tayar3-0/+26
In some cases the driver cannot configure ASID of some engines due to the security level of the relevant registers. For this a new CPU-CP packet is introduced, which will allow the driver to ask the F/W to do this configuration instead. Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: replace some -ENOTTY with -EINVALOded Gabbay3-5/+5
-ENOTTY is returned in case of error in the ioctl arguments themselves, such as function that doesn't exists. In all other cases, where the error is in the arguments of the custom data structures that we define that are passed in the various ioctls, we need to return -EINVAL. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix comments according to kernel-docOfir Bitton1-7/+17
Fix missing fields, descriptions not according to kernel-doc style. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix endianness when reading cpld versionOfir Bitton1-1/+1
Current sysfs implementation does not take endianness into consideration when dumping the cpld version. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: change wait_for_interrupt implementationfarah kassabri4-10/+145
Currently the cq counters are allocated in userspace memory, and mapped by the driver to the device address space. A new requirement that is part of new future API related to this one, requires that cq counters will be allocated in kernel memory. We leverage the existing cb_create API with KERNEL_MAPPED flag set to allocate this memory. That way we gain two things: 1. The memory cannot be freed while in use since it's protected by refcount in driver. 2. No need to wake up the user thread upon each interrupt from CQ, because the kernel has direct access to the counter. Therefore, it can make comparison with the target value in the interrupt handler and wake up the user thread only if the counter reaches the target value. This is instead of waking the thread up to copy counter value from user then go sleep again if target value wasn't reached. Signed-off-by: farah kassabri <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: prevent wait if CS in multi-CS list completedOhad Sharabi2-34/+54
By the original design we assumed that if we "miss" multi CS completion it is of no severe consequence as we'll just call wait_for_multi_cs again. Sequence of events for such scenario: 1. user submit CS with sequence N 2. user calls wait for multi-CS with only CS #N in the list 3. the multi CS call starts with poll of the CSs but find that none completed (while CS #N did not completed yet) 4. now, multi CS #N complete but multi CS CTX was not yet created for the above multi-CS. so, attempt to complete multi-CS fails (as no multi CS CTX exist) 5. wait_for_multi_cs call now does init_wait_multi_cs_completion (and for this create the multi-CS CTX) 6. wait_for_multi_cs wits on completion but will not get one as CS #N already completed To fix the issue we initialize the multi-CS CTX prior polling the fences. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: modify cpu boot status error printOfir Bitton1-1/+1
As BTL can be replaced by ROM we should modify relevant error print. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: clean MMU headers definitionsOhad Sharabi6-49/+64
During the MMU development the MMU header files were left with unclean definitions: - MMU "version specific" definitions that were left in the mmu_general file - unused definitions This patch attempts, where possible, to keep definitions that can serve multiple MMU versions (but that are not tightly bound with specific MMU arch) in the mmu_general header file (e.g. different definitions for number of HOPs). Otherwise, move MMU version specific definitions (e.g. HOPs masks and shifts) to the specific MMU version file. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: expose soft reset sysfs nodes for inference ASICOfir Bitton1-2/+30
As we allow soft-reset to be performed only on inference devices, having the sysfs nodes may cause a confusion. Hence, we remove those nodes on training ASICs. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: sysfs support for two infineon versionsOfir Bitton2-5/+17
Currently sysfs support dumping a single infineon version, in future asics we will have two infineon versions. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: keep control device alive during hard resetDani Liberman4-30/+50
Need to allow user retrieve data during reset and afterwards without the need to reopen the device. Did it by seperating the user peocesses list into two lists: 1. fpriv_list which contains list of user processes that opened the device (currently only one). 2. fpriv_ctrl_list which contains list of user processes that opened the control device. This processes in this list shall not be killed during reset, only when the device is suddenly removed from PCI chain. Signed-off-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix hwmon handling for legacy f/wOded Gabbay1-32/+169
In legacy f/w that use old hwmon.h file, the values of the hwmon enums are different than the values that are in newer kernels (5.6 and above). Therefore, to support working with those f/w, we need to do some fixup before registering with the hwmon subsystem and also when calling the functions that communicate with the f/w to retrieve sensors information. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add current PI value to cpu packetsOfir Bitton1-2/+12
In order to increase cpucp messaging reliability we will add the current PI value to the descriptor sent to F/W. F/W will wait for the PI value as an indication of a valid packet. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: remove in_debug check in device openOded Gabbay2-10/+3
The driver supports only a single user anyway, so there is no point in checking whether we are in_debug state when a user tries to open the device, because if we are in_debug, it means a user is already using the device. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: return correct clock throttling periodOfir Bitton1-2/+2
Current clock throttling period returned from driver was wrong due to wrong time comparison. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: wait again for multi-CS if no CS completedOhad Sharabi2-51/+50
The original multi-CS design assumption that stream masters are used exclusively (i.e. multi-CS with set of stream master QIDs will not get completed by CS not from the multi-CS set) is inaccurate. Thus multi-CS behavior is now modified not to treat such case as an error. Instead, if we have multi-CS completion but we detect that no CS from the list is actually completed we will do another multi-CS wait (with modified timeout). Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: remove compute context pointerOded Gabbay6-14/+13
It was an error to save the compute context's pointer in the device structure, as it allowed its use without proper ref-cnt. Change the variable to a flag that only indicates whether there is an active compute context. Code that needs the pointer will now be forced to use proper internal APIs to get the pointer. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add helper to get compute contextOded Gabbay4-15/+36
There are multiple places where the code needs to get the context's pointer and increment its ref cnt. This is the proper way instead of using the compute context pointer in the device structure. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix etr asid configurationOded Gabbay8-20/+21
Pass the user's context pointer into the etr configuration function to extract its ASID. Using the compute_ctx pointer is an error as it is just an indication of whether a user has opened the compute device. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: save ctx inside encaps signalOded Gabbay4-9/+16
Compute context pointer in hdev shouldn't be used for fetching the context's pointer. If an object needs the context's pointer, it should get it while incrementing its kref, and when the object is released, put it. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: remove redundant check on ctx_finiOded Gabbay1-3/+1
The driver supports only a single context. Therefore, no need to check if the user context that is closed is the compute context. The user context, if exists, is always the compute context. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: free signal handle on failureOded Gabbay1-1/+3
Fix a bug where in case of failure to allocate idr, the handle's memory wasn't freed as part of the error handling code. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add missing kernel-doc comments for hl_device fieldsTomer Tayar1-0/+2
Add missing kernel-doc comments for the "last_error" and "stream_master_qid_arr" fields of the "hl_device" structure". Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: pass reset flags to reset threadTomer Tayar2-9/+5
The reset flags used by the reset thread are currently a mix of hard-coded values and a specific flag which is passed from the context that initiates the reset. To make it easier to pass more flags in future from this context to the reset thread, modify it to pass all the original reset flags to the thread. Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: enable access to info ioctl during hard resetDani Liberman2-11/+1
Because info ioctl is used to retrieve data, some of its opcodes may be used during hard reset. Other ioctls should be blocked while device is not operational. Signed-off-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add SOB information to signal submission uAPIDani Liberman3-7/+38
For debug purpose, add SOB address and SOB initial counter value before current submission to uAPI output. Using SOB address and initial counter, user can calculate how much of the submmision has been completed. Signed-off-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: skip read fw errors if dynamic descriptor invalidOhad Sharabi2-2/+17
Reporting FW errors involves reading of the error registers. In case we have a corrupted FW descriptor we cannot do that since the dynamic scratchpad is potentially corrupted as well and may cause kernel crush when attempting access to a corrupted register offset. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: handle events during soft-resetOfir Bitton3-1/+7
Driver should handle events during soft-reset as F/W is not going through reset and it keeps sending events towards host. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: change misleading IRQ warning during resetOfir Bitton1-3/+1
Currently we dump the physical IRQ line index in host if an event is received during reset. This ID is confusing as it means nothing to the user. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add power information type to POWER_GET packetTomer Tayar1-0/+1
In new f/w versions, it is required to explicitly indicate the power information type when querying the F/W for power info. When getting the current power level it should be set to power_input. Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add more info ioctls support during resetOfir Bitton1-28/+27
Some info ioctls can be served even if the device is disabled or in reset. Hence, we enable more info ioctls during reset, as these ioctls do not require any H/W nor F/W communication. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix race condition in multi CS completionDani Liberman1-0/+7
Race example scenario: 1. User have 2 threads that waits on multi CS: - thread_0 waits on QID 0 and uses multi CS context 0. - thread_1 waits on QID 1 and uses multi CS context 1. 2. thread_1 got completion and release multi CS context 1. 3. CS related to multi CS of thread_0 starts executing complete_multi_cs function, the first iteration of the loop completes the multi CS of thread_0, hence multi CS context 0 is released. 4. thread_1 waits on QID 1 and uses multi CS context 0. 5. thread_0 waits on QID 0 and uses multi CS context 1. 6. The second iterattion of the loop (from step 3) starts, which means, start checking multi CS context 1: - multi CS contetxt is being used by thread_0 waiting on QID 0. - The fence of the CS (still CS from step 3) has QID map the same as the multi CS context 1. - multi CS context 1 (thread_0) gets completion on CS that triggered already thread_0 (with multi CS context 0) and is no longer being waited on. Fixed by exiting the loop in complete_multi_cs after getting completion Signed-off-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: move device boot warnings to the correct locationOfir Bitton1-22/+23
As device boot warnings clears the indication from the error mask, they must be located together before the unknown error validation. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs/gaudi: return EPERM on non hard-resetOded Gabbay1-6/+2
GAUDI supports only hard-reset. Therefore, this function should return an error of operation not permitted. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: rename late init after reset functionOded Gabbay4-7/+7
The ASIC-specific soft_reset_late_init() is now called after either soft-reset or reset-upon-device-release. Therefore, it needs a more appropriate name. No need to split it to two functions, as an ASIC either supports soft-reset or reset-upon-device-release. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix soft reset accountingOded Gabbay1-25/+25
Reset upon device release is not a soft-reset from user/system point of view. As such, we shouldn't count that reset in the statistics we gather and expose to the monitoring applications. We also shouldn't print soft-reset when doing the reset upon device release. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: Move frequency change thread to goya_late_initRajaravi Krishna Katta8-89/+100
Changing the frequency automatically is only done in Goya. In future ASICs this is done inside the firmware. Therefore, move the common code into the Goya specific files. Main changes as part of the commit are: 1. The thread for setting frequency is moved from device_late_init to goya_late_init 2. hl_device_set_frequency is removed from hl_device_open as it is not relevant for other ASICs and for Goya it is taken care by the thread 3. hl_device_set_frequency is renamed as goya_set_frequency Signed-off-by: Rajaravi Krishna Katta <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: abort reset on invalid requestOded Gabbay1-7/+9
Hard-reset is mutually exclusive with reset-on-device-release. Therefore, if such a request arrives to the reset function, abort the reset and return an error to the callee. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: fix possible deadlock in cache invl failureOfir Bitton6-35/+39
Currently there is a deadlock in driver in scenarios where MMU cache invalidation fails. The issue is basically device reset being performed without releasing the MMU mutex. The solution is to skip device reset as it is not necessary. In addition we introduce a slight code refactor that prints the invalidation error from a single location. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: skip PLL freq fetchOhad Sharabi2-0/+10
Getting the used PLL index with which to send the CPUPU packet relies on the CPUCP info packet. In case CPU queues are not enabled getting the PLL index will issue an error and in some ASICs will also fail the driver load. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: prevent false heartbeat messageOded Gabbay1-1/+3
If a device reset has started, there is a chance that the heartbeat function will fail because the device is disabled at the beginning of the reset function. In that case, we don't want the error message to appear in the log. Signed-off-by: Oded Gabbay <[email protected]>
2021-12-26habanalabs: add support for fetching historic errorsDani Liberman5-43/+233
A new uAPI is added for debug purposes of the user-space to retrieve errors related data from previous session (before device reset was performed). Inforamtion is filled when a razwi or CS timeout happens and can contain one of the following: 1. Retrieve timestamp of last time the device was opened and razwi or CS timeout happened. 2. Retrieve information about last CS timeout. 3. Retrieve information about last razwi error. This information doesn't contain user data, so no danger of data leakage between users. Signed-off-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>