aboutsummaryrefslogtreecommitdiff
path: root/drivers/misc
AgeCommit message (Collapse)AuthorFilesLines
2022-07-12habanalabs/gaudi2: add gaudi2 security moduleOfir Bitton4-2/+3867
Use the generic security module to block all registers in the ASIC and then open only those that are needed to be accessed by the user. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: add generic security moduleOfir Bitton3-2/+671
As the ASICs become more complex and have many more registers, we need a better way to configure the security properties. As a reminder, we have two dedicated mechanisms for security: Range Registers and Protection bits. Those mechanisms protect sensitive memory and configuration areas inside the device. The generic module handles the low-level part of the configuration, because the configuration mechanism is identical in all ASICs. The difference is the address ranges and register names. Any ASIC that use this block should first block all the register blocks in the ASIC. Then, it should open only the registers that need to be accessed by the user (This is opposed to Goya and Gaudi, where we blocked only what should not be accesses by the user). The module contains several functions, to unblock single register, multiple registers, entire blocks, ranges, ranges with mask. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: remove obsolete device variables used for testingOded Gabbay5-173/+24
There are a couple of device variables that are used for testing purposes and they are set to fixed values. Remove the variables that are not relevant anymore and document the remaining variables. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: initialize new asic propertiesOded Gabbay3-14/+28
New asic properties were added for Gaudi2. We want to initialize and use them, when relevant, also for Goya and Gaudi. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: add unsupported functionsOded Gabbay2-0/+42
There are a number of new ASIC-specific functions that were added for Gaudi2. To make the common code work, we need to define empty implementations of those functions for Goya and Gaudi. Some functions will return error if called with Goya/Gaudi. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: add gaudi2 asic-specific codeOded Gabbay22-97/+11155
Add the ASIC-specific code for Gaudi2. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the Gaudi2 ASIC. It also contains the code to initialize the F/W of the Gaudi2 ASIC and to receive events from the F/W. It contains new debugfs entry to dump razwi events. razwi is a case where the device's engines create a transaction that reaches an invalid destination. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi2: add asic registers header filesOded Gabbay168-2/+136492
Add the relevant GAUDI2 ASIC registers header files. These files are generated automatically from a tool maintained by the VLSI engineers. There are more files which are not upstreamed because only very few defines from those files are used in the driver. For those files, I copied the relevant defines into gaudi2_regs.h and gaudi2_masks.h, to reduce the size of this patch. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: remove redundant argument in access_dev_mem APIsOfir Bitton3-9/+7
Region structure is derived from region type, hence no need to pass it as an argument. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: use %pa to print pci bar sizeOded Gabbay2-28/+22
PCI bar size is resource_size_t so we should use %pa to make it work correctly on all architectures. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: replace hl_poll_timeout with while loopDafna Hirschfeld1-12/+11
in gaudi_scrub_device_mem, replace call to hl_poll_timeout with a while loop to avoid using dummy variables. Reported-by: kernel test robot <[email protected]> Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: communicate supported page sizes to userOhad Sharabi5-19/+6
Because in future ASICs the driver will allow the user to set the page size we need to make sure this data is propagated in all APIs. In addition, since this is already an ASIC property we no longer need ASIC function for it. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: remove dead code from free_device_memory()Tomer Tayar1-28/+22
free_device_memory() ends with if and else, each has a return statement, followed by another return statement that can never be reached. Restructure the function and remove this dead code. Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: enable error interrupt on ARB WDTOded Gabbay1-0/+1
We want to receive an error interrupt in case the watchdog timer expires on arbitration event in the queues. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: page size can only be a power of 2Ohad Sharabi4-7/+2
We dropped support for page sizes that are not power of 2. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: refactor dma asic-specific functionsOhad Sharabi8-152/+162
This is a pre-requisite patch for adding tracepoints to the DMA memory operations (allocation/free) in the driver. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. Signed-off-by: Ohad Sharabi <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: remove unused enumOded Gabbay1-22/+9
Also beautify code by preferring single line wherever possible. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: mask constant value before castOded Gabbay1-4/+4
This fixes a sparse warning of "cast truncates bits from constant value" Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: use correct type in assignmentOded Gabbay1-1/+1
packets are defined as LE so we need to convert before assigning values to them. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: fix function name in commentOded Gabbay1-1/+1
function name in comment didn't match actual function name. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/goya: move dma direction enum to uapi fileOded Gabbay2-26/+14
The values in this enum are not used by h/w but are a contract between userspace and the kernel driver so they must be defined in the uapi file. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: set default value for memory_scrubDafna Hirschfeld1-0/+3
Set a default value for memory scrubbing Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: move call to scrub_device_mem after ctx_finiDafna Hirschfeld2-5/+14
In future ASICs, it would be possible to have a non-idle device when context is released. We thus need to postpone the scrubbing. Postpone it to hpriv release if reset is not executed or to device late init if reset is executed. Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: use memory_scrub_val from debugfsDafna Hirschfeld1-3/+2
In the callback scrub_device_mem, use 'memory_scrub_val' from debugfs for the scrubbing value. Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: don't send addr and size to scrub_device_mem cbDafna Hirschfeld5-38/+36
We use scrub_device_mem only to scrub the entire SRAM and entire DRAM. Therefore there is no need to send addr and size args to the callback. Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: don't do memory scrubbing when unmappingDafna Hirschfeld1-30/+6
There is no need to do memory scrub when unmapping anymore as it is an overhead as long as we have a single user at any given time. Remove that code and change return value of free_phys_pg_pack to void Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: print if firmware is secured during loadOfir Bitton1-1/+2
For easier debug, it is desirable to have a simple way to know whether the device is secured or not, hence we dump this indication during boot. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: fix a race condition causing DMAR errorYuri Nudelman5-16/+40
There is a rare race condition in CB completion mechanism, that can occur under a very high pressure of command submissions. The preconditions for this to happen are: 1. There should be enough command submissions for the pre-allocated patched CB pool to run out of commands. At this stage we start allocating new patched CBs as they arrive. 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below a cache line end. The flow: 1. Two command buffers being completed on different streams, at the same time. Denote those CB1 and CB2. 2. Each command buffer is injected with two messages, 16B each - one for a HBW update of the completion queue, another to raise interrupt. 3. Assume CB1 updated the completion queue and raise the interrupt. 4. Assume CB2 updated the completion queue but did not raise the interrupt yet. 5. The host receives the interrupt. It goes over the completion queue and sees two completions - CB1 and CB2. Release them both. 6. CB2 performs the last command. The problem is that the last command is split between 2 cache lines. So to read the last 8B of the last command, it has to access the host again. Problem is - CB2 is already released. This causes a DMAR error. The solution to this problem is simply to make sure the last two commands in the CB are always in the same cache line, using NOP padding. Signed-off-by: Yuri Nudelman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: fix warning: var might be used uninitializedKoby Elbaz1-1/+1
kernel test robot: "warning: variable 'index' is used uninitialized whenever 'if' condition is false" Reported-by: kernel test robot <[email protected]> Signed-off-by: Koby Elbaz <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: move memory_scrub_val to hdev structDafna Hirschfeld2-4/+4
move the field memory_scrub_val from struct hl_dbg_device_entry to struct hl_device. This is because we want to use this field also if debugfs is off. Signed-off-by: Dafna Hirschfeld <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: fix comment styleOded Gabbay1-1/+1
function name should not be preceded with @ Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: use kvcalloc when possibleOded Gabbay1-3/+2
kvcalloc is same as kvmalloc_array with GFP_ZERO. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: print pointer with correct modifierOded Gabbay1-2/+2
Use %p instead of %llx for printing pointers. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: check fence pointer before useOded Gabbay1-1/+1
fence pointer can be NULL in this path, as shown by an earlier check. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: add critical indication in sram eccran shalit1-1/+2
Multiple SRAM SERR events are treated as critical events, and host should be notified about it. Thus, adding is_critical indication as part of SRAM ECC failure packet. Signed-off-by: ran shalit <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: notify user process on device unavailableTal Cohen1-1/+4
When a device error occurs, user process would like to get some indication on the error by reading some device HW info. If the device is unavailable, user process can't perform any HW device reading. Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: remove unused get_dma_desc_list_sizeOded Gabbay3-5/+0
This asic callback function is not called anymore from the common code. The asic-specific function itself is called but from within the asic-specific code. Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: fix NULL dereference on cs timeoutYuri Nudelman1-2/+2
Device descriptor is accessed before an assignment Signed-off-by: Yuri Nudelman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: fix shift out of boundsOfir Bitton1-7/+9
When validating NIC queues, queue offset calculation must be performed only for NIC queues. Signed-off-by: Ofir Bitton <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: add validity check for cq counter offsetfarah kassabri1-1/+8
Driver performs no validity check for the user cq counter offset used in both wait_for_interrupt and register_for_timestamp APIs. Signed-off-by: farah kassabri <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: fix incorrect MME offset calculationKoby Elbaz1-3/+8
Once FW raised an event following a MME2 QMAN error, the driver should have gone to the corresponding status registers, trying to gather more info on the error, yet it was accidentally accessing MME1 QMAN address space. Generally, we have x4 MMEs, while 0 & 2 are marked MASTER, and 1 & 3 are marked SLAVE. The former can be addressed, yet addressing the latter is considered an access violation, and will result in a hung system, which is what unintentionally happened above. Note that this cannot happen in a secured system, since these registers are protected with range registers. Signed-off-by: Koby Elbaz <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: avoid unnecessary error printDani Liberman1-1/+8
When sending a packet to FW right after it made reset, we will get packet timeout. Since it is expected behavior, we don't need to print an error in such case. Hence, when driver is in hard reset it will avoid from printing error messages about packet timeout. Signed-off-by: Dani Liberman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: send an event notification when CS timeout occursTal Cohen1-9/+17
The Driver needs to inform the User process whenever one of its CS is timed out. The Driver shall recognize the CS timeout and shall send an eventfd notification, towards user space, whenever a timeout is expired on a CS. Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: send device reset notificationTal Cohen1-3/+10
Device reset event, indicates that the device shall be reset - after a short delay. In such case, the driver sends a notification towards the User process. This allows the User process to be able to take several debug actions for system diagnostic purposes. Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: invoke device reset from one code blockTal Cohen1-9/+16
In order to prepare the driver code for device reset event notification, change the event handler function flow to call device reset from one code block. In addition, the commit fixes an issue that reset was performed w/o checking the 'hard_reset_on_fw_event' state and w/o setting the HL_DRV_RESET_DELAY flag. Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: expose undefined opcode status via info ioctlTal Cohen1-0/+25
The info ioctl retrieves information on the last undefined opcode occurred. Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: collect undefined opcode error infoTal Cohen4-30/+132
when an undefined opcode error occurres, the driver collects the relevant information from the Qman and stores it inside the hdev data structure. An event fd indication is sent towards the user space. Note: another commit shall be followed which will add support to read the error info by an ioctl. Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: fix race between hl_get_compute_ctx() and hl_ctx_put()Tomer Tayar4-25/+47
hl_get_compute_ctx() is used to get the pointer to the compute context from the hpriv object. The function is called in code paths that are not necessarily initiated by user, so it is possible that a context release process will happen in parallel. This can lead to a race condition in which hl_get_compute_ctx() retrieves the context pointer, and just before it increments the context refcount, the context object is released and a freed memory is accessed. To avoid this race, add a mutex to protect the context pointer in hpriv. With this lock, hl_get_compute_ctx() will be able to detect if the context has been released or is about to be released. struct hl_ctx_mgr has a mutex for contexts IDR with a similar "ctx_lock" name, so rename it to just "lock" to avoid a confusion with the new lock. Signed-off-by: Tomer Tayar <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: keep a record of completed CS outcomesYuri Nudelman3-12/+147
Often, the user is not interested in the completion timestamp of all command submissions. A common situation is, for example, when the user submits a burst of, possibly, several thousands of commands, then request the completion timestamp of only couple of specific key commands from all the burst. The problem is that currently, the outcome of the early commands may be lost, due to a large amount of later commands, that the user does not really care about. This patch creates a separate store with the outcomes of commands the user has mark explicitly as interested in. This store does not mix the marked commands with the unmarked ones, hence the data there will survive for much longer. Signed-off-by: Yuri Nudelman <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs/gaudi: fix comment to reflect current codeOded Gabbay1-2/+8
Due to code changes in the past few years, the original comment of how parser->user_cb_size is checked was not correct anymore. Fix it to reflect current code and add more explanation as the code is more complex now. Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2022-07-12habanalabs: change the write flag name of error info structsTal Cohen4-12/+12
positive flags naming will make more clear code while adding more 'error info' structures Signed-off-by: Tal Cohen <[email protected]> Reviewed-by: Oded Gabbay <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>