Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes.
The high blast radius core update is the removal of write same, which
affects block and several non-SCSI devices. The other big change,
which is more local, is the removal of the SCSI pointer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
scsi: scsi_ioctl: Drop needless assignment in sg_io()
scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
scsi: lpfc: Copyright updates for 14.2.0.0 patches
scsi: lpfc: Update lpfc version to 14.2.0.0
scsi: lpfc: SLI path split: Refactor BSG paths
scsi: lpfc: SLI path split: Refactor Abort paths
scsi: lpfc: SLI path split: Refactor SCSI paths
scsi: lpfc: SLI path split: Refactor CT paths
scsi: lpfc: SLI path split: Refactor misc ELS paths
scsi: lpfc: SLI path split: Refactor VMID paths
scsi: lpfc: SLI path split: Refactor FDISC paths
scsi: lpfc: SLI path split: Refactor LS_RJT paths
scsi: lpfc: SLI path split: Refactor LS_ACC paths
scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
scsi: lpfc: SLI path split: Refactor lpfc_iocbq
scsi: lpfc: Use kcalloc()
...
|
|
Update copyrights to 2022 for files modified in the 14.2.0.0 patch set.
Link: https://lore.kernel.org/r/20220225022308.16486-18-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the remaining ELS paths to use SLI-4 as the primary
interface. Paths include RRQ, RSCN, unsolicited ELS RQST and RSP paths, ELS
timeouts, etc.:
- Remove unused routines lpfc_sli4_bpl2sgl and lpfc_sli4_iocb2wqe
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-12-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the VMID paths to use SLI-4 as the primary interface:
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-11-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the FDISC paths to use SLI-4 as the primary interface:
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-10-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the LS_RJT paths to use SLI-4 as the primary
interface:
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-9-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the LS_ACC paths to use SLI-4 as the primary
interface:
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-8-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the SLI3/SLI4 RSCN/SCR/RDF/EDC/FARPR paths to use
SLI-4 as the primary interface:
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-7-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This patch refactors the PLOGI/PRLI/ADISC/LOGO paths to use SLI-4 as
the primary interface:
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
Link: https://lore.kernel.org/r/20220225022308.16486-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The patch refactors the general ELS handling paths to migrate to SLI-4
structures or common element abstractions. The fabric login paths are
revised as part of this patch:
- New generic lpfc_sli_prep_els_req_rsp jump table routine
- Introduce ls_rjt_error_be and ulp_bde64_le unions to correct legacy
endianness assignments
- Conversion away from using SLI-3 iocb structures to set/access fields in
common routines. Use the new generic get/set routines that were added.
This move changes code from indirect structure references to using local
variables with the generic routines.
- Refactor routines when setting non-generic fields, to have both SLI3 and
SLI4 specific sections. This replaces the set-as-SLI3 then translate to
SLI4 behavior of the past.
- Clean up poor indentation on some of the ELS paths
Link: https://lore.kernel.org/r/20220225022308.16486-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Introduce lpfc_prep_wqe routine.
The lpfc_prep_wqe() routine is used with lpfc_sli_issue_iocb() and
lpfc_sli_issue_iocb_wait(). The routine performs additional SLI-4 wqe field
setting that the generic routines did not perform as they kept their
actions compatible with both SLI3 and SLI4.
Link: https://lore.kernel.org/r/20220225022308.16486-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Currently, SLI3 and SLI4 data paths use the same lpfc_iocbq structure.
This is a "common" structure but many of the components refer to sli-rev
specific entities which can lead the developer astray as to what they
actually mean, should be set to, or when they should be used.
This first patch prepares the lpfc_iocbq structure so that elements common
to both SLI3 and SLI4 data paths are more appropriately named, making it
clear they apply generically.
Fieldnames based on 'iocb' (sli3) or 'wqe' (sli4) which are actually
generic to the paths are renamed to 'cmd':
- iocb_flag is renamed to cmd_flag
- lpfc_vmid_iocb_tag is renamed to lpfc_vmid_tag
- fabric_iocb_cmpl is renamed to fabric_cmd_cmpl
- wait_iocb_cmpl is renamed to wait_cmd_cmpl
- iocb_cmpl and wqe_cmpl are combined and renamed to cmd_cmpl
- rsvd2 member is renamed to num_bdes due to pre-existing usage
The structure name itself will retain the iocb reference as changing to a
more relevant "job" or "cmd" title induces many hundreds of line changes
for only a name change.
lpfc_post_buffer is also renamed to lpfc_sli3_post_buffer to indicate use
in the SLI3 path only.
Link: https://lore.kernel.org/r/20220225022308.16486-2-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
When connected point to point, the driver does not know the FC4's supported
by the other end. In Fabrics, it can query the nameserver. Thus the driver
must send PRLIs for the FC4s it supports and enable support based on the
acc(ept) or rej(ect) of the respective FC4 PRLI. Currently the driver
supports SCSI and NVMe PRLIs.
Unfortunately, although the behavior is per standard, many devices have
come to expect only SCSI PRLIs. In this particular example, the NVMe PRLI
is properly RJT'd but the target decided that it must LOGO after seeing the
unexpected NVMe PRLI. The LOGO causes the sequence to restart and login is
now in an infinite failure loop.
Fix the problem by having the driver, on a pt2pt link, remember NVMe PRLI
accept or reject status across logout as long as the link stays "up". When
retrying login, if the prior NVMe PRLI was rejected, it will not be sent on
the next login.
Link: https://lore.kernel.org/r/20220212163120.15385-1-jsmart2021@gmail.com
Cc: <stable@vger.kernel.org> # v5.4+
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Pull in the 5.16 fixes branch to resolve a conflict in the UFS driver
core.
Conflicts:
drivers/scsi/ufs/ufshcd.c
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The driver is calling schedule_timeout after the DA_ID nameserver request
and LOGO commands are issued to the fabric by the initiator virtual
endport. These fixed delay functions are causing long delays in the
driver's worker thread when processing discovery I/Os in a serialized
fashion, which is then triggering mailbox timeout errors artificially.
To fix this, don't wait on the DA_ID request to complete and call
wait_event_timeout to allow the vport delete thread to make progress on an
event driven basis rather than fixing the wait time.
Link: https://lore.kernel.org/r/20211204002644.116455-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Issuing lpfc_force_rscn twice results in an ndlp kref use-after-free call
trace.
A prior patch reworked the get/put handling by ensuring nlp_get was done
before WQE submission and a put was done in the completion path.
Unfortunately, the issue_els_rscn path had a piece of legacy code that did
a nlp_put, causing an imbalance on the ref counts.
Fixed by removing the unnecessary legacy code snippet.
Link: https://lore.kernel.org/r/20211204002644.116455-4-jsmart2021@gmail.com
Fixes: 4430f7fd09ec ("scsi: lpfc: Rework locations of ndlp reference taking")
Cc: <stable@vger.kernel.org> # v5.11+
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
During rmmod testing, messages appeared indicating lpfc_mbuf_pool entries
were still busy. This situation was only seen doing rmmod after at least 1
vport (NPIV) instance was created and destroyed. The number of messages
scaled with the number of vports created.
When a vport is created, it can receive a PLOGI from another initiator
Nport. When this happens, the driver prepares to ack the PLOGI and
prepares an RPI for registration (via mbx cmd) which includes an mbuf
allocation. During the unsolicited PLOGI processing and after the RPI
preparation, the driver recognizes it is one of the vport instances and
decides to reject the PLOGI. During the LS_RJT preparation for the PLOGI,
the mailbox struct allocated for RPI registration is freed, but the mbuf
that was also allocated is not released.
Fix by freeing the mbuf with the mailbox struct in the LS_RJT path.
As part of the code review to figure the issue out a couple of other areas
where found that also would not have released the mbuf. Those are cleaned
up as well.
Link: https://lore.kernel.org/r/20211204002644.116455-2-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
A commit introduced formal regstration of all Fabric nodes to the SCSI
transport as well as REG/UNREG RPI mailbox requests. The commit introduced
the NLP_RELEASE_RPI flag for rports set in the lpfc_cmpl_els_logo_acc()
routine to help clean up the RPIs. This new code caused the driver to
release the RPI value used for the remote port and marked the RPI invalid.
When the driver later attempted to re-login, it would use the invalid RPI
and the adapter rejected the PLOGI request. As no login occurred, the
devloss timer on the rport expired and connectivity was lost.
This patch corrects the code by removing the snippet that requests the rpi
to be unregistered. This change only occurs on a node that is already
marked to be rediscovered. This puts the code back to its original
behavior, preserving the already-assigned rpi value (registered or not)
which can be used on the re-login attempts.
Link: https://lore.kernel.org/r/20211123165646.62740-1-jsmart2021@gmail.com
Fixes: fe83e3b9b422 ("scsi: lpfc: Fix node handling for Fabric Controller and Domain Controller")
Cc: <stable@vger.kernel.org> # v5.14+
Co-developed-by: Paul Ely <paul.ely@broadcom.com>
Signed-off-by: Paul Ely <paul.ely@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
A link bounce to a slow fabric may observe FDISC response delays lasting
longer than devloss tmo. Current logic decrements the final fabric node
kref during a devloss tmo event. This results in a NULL ptr dereference
crash if the FDISC completes for that fabric node after devloss tmo.
Fix by adding the NLP_IN_RECOV_POST_DEV_LOSS flag, which is set when
devloss tmo triggers and we've noticed that fabric node recovery has
already started or finished in between the time lpfc_dev_loss_tmo_callbk
queues lpfc_dev_loss_tmo_handler. If fabric node recovery succeeds, then
the driver reverses the devloss tmo marked kref put with a kref get. If
fabric node recovery fails, then the final kref put relies on the ELS
timing out or the REG_LOGIN cmpl routine.
Link: https://lore.kernel.org/r/20211020211417.88754-8-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
A remote nport can stop responding to PLOGI beyond the ELS I/O timeout
under some fault conditions. When this happens, the non-response triggers
a dev_loss_tmo event from the transport which causes the driver to abort
the PLOGI and stop any retries. This was due to a policy in the ELS
completion handler whenever an ELS was terminated due to driver request.
Revise the ELS completion path to detect PLOGIs that were aborted and
allow retries.
Link: https://lore.kernel.org/r/20211020211417.88754-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Merge the 5.15/scsi-fixes branch into the staging tree to resolve UFS
conflict reported by sfr.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This scnprintf() uses the wrong limit. It should be
"LPFC_FPIN_WWPN_LINE_SZ - len" instead of LPFC_FPIN_WWPN_LINE_SZ.
Link: https://lore.kernel.org/r/20210916132251.GD25094@kili
Fixes: 428569e66fa7 ("scsi: lpfc: Expand FPIN and RDF receive logging")
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Currently, we hold off unregistering with NVMe transport layer until GID_FT
or ADISC completes upon receipt of RSCN. In the ADISC discovery routine,
for nodes not found in the GID_FT response, the nodes are unregistered from
the SCSI transport but not UNREG_RPI'd. Meaning outstanding WQEs continue
to be outstanding and were not failed back to the OS. If an NVMe device,
this mean there wasn't initial termination of the I/Os so they could be
issued on a different NVMe path.
Fix by unregistering the RPI so that I/O is cancelled.
Link: https://lore.kernel.org/r/20210910233159.115896-8-jsmart2021@gmail.com
Fixes: 0614568361b0 ("scsi: lpfc: Delay unregistering from transport until GIDFT or ADISC completes")
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
In pt-2-pt mode, the initiator does not log into the target after a PRLI
error. In pt-2-pt mode, the target responded to the PRLI by sending a
LOGO. The LOGO causes all ELS and I/Os to be aborted. This caused the PRLI
to fail. The PRLI completion path caused the discovery node to be dropped
to avoid being stick in an UNUSED (not logged in) state. As the node was
dropped there is no retry of the login and as it is pt-2-pt, there is no
RSCN to retrigger discovery. Thus the other end is not seen by the OS.
Fix by ensuring the discovery node is not dropped if connecting pt-2-pt.
This will cause PLOGI to be retried.
Link: https://lore.kernel.org/r/20210910233159.115896-7-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
A test scenario encountered an unload hang while an FLOGI ELS was in flight
when a link down condition occurred. The driver fails unload as it never
releases the fport node.
For most nodes, when the link drops, devloss tmo is started and the timeout
will cause the final node release. For the Fport, as it has not yet
registered with the SCSI transport, there is no devloss timer to be
started, so there is no final release. Additionally, the link down
sequence causes ABORTS to be issued for pending ELS's. The completions from
the ABORTS perform the release of node references. However, as the adapter
is being reset to be unloaded, those completions will never occur.
Fix by the following:
- In the ELS cleanup, recognize when unloading and place the ELS's on a
different list that immediately cleans up/completes the ELS's. It's
recognized that this condition primarily affects only the fport, with
other ports having normal clean up logic that handles things.
- Resolve the devloss issue by, when cleaning up nodes on after link down,
recognizing when the fabric node does not have a completed state (its
state is UNUSED) and removing a reference so the node can delete after
the ELS reference is released.
Link: https://lore.kernel.org/r/20210910233159.115896-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
A test scenario has a target issuing a TPLS after accepting the driver's
PRLI. TPLS is not supported by the driver so it rejects the ELS. However,
the reject was only happening on the primary N_Port. If the TPLS was to a
NPIV vport, not only would it reject the ELS, but it would act on the TPLS,
starting devloss, then unregister from the SCSI transport and release the
node. When devloss expired, it would access the node again and cause a page
faul.
Fix by altering the NPIV code to recognize that a correctly registered node
can reject unsolicited ELS I/O and to not unregister with the SCSI
transport and tear the node down. Add a check of the fc4_xpt_flags so that
only a zero value allows the unreg and teardown.
Link: https://lore.kernel.org/r/20210910233159.115896-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
In a rarely executed path, FLOGI failure, there is a refcounting error. If
FLOGI completed with an error, typically a timeout, the initial completion
handler would remove the job reference. However, the job completion isn't
the actual end of the job/exchange as the timeout usually initiates an
ABTS, and upon that ABTS completion, a final completion is sent. The driver
removes the reference again in the final completion. Thus the imbalance.
In the buggy cases, if there was a link bounce while the delayed response
is outstanding, the fport node may be referenced again but there was no
additional reference as it is already present. The delayed completion then
occurs and removes the last reference freeing the node and causing issues
in the link up processed that is using the node.
Fix this scenario by removing the snippet that removed the reference in the
initial FLOGI completion. The bad snippet was poorly trying to identify the
FLOGI as OK to do so by realizing the node was not registered with either
SCSI or NVMe transport.
Link: https://lore.kernel.org/r/20210910233159.115896-3-jsmart2021@gmail.com
Fixes: 618e2ee146d4 ("scsi: lpfc: Fix FLOGI failure due to accessing a freed node")
Cc: <stable@vger.kernel.org> # v5.13+
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The kernel test robot reported the following sparse warning:
".../lpfc_els.c:3984:25: sparse: sparse: cast from restricted __be16"
For the error being flagged, using be32_to_cpu() on a be16 data type, it
was simple enough. But a review of other elements and warnings were also
evaluated.
This patch corrected several items in the original patch:
- Using be32_to_cpu() on a be16 data type
- cpu_to_le32() used on a std uint32_t (CPU) data type.
Note: This is a byte array, but stored in LE layout by hardware at
32-bit boundaries. So it possibly needed conversion.
- Using cpu_to_le32() on a std uint16_t and assigned to a char typeA
- Using le32_to_cpu() on a le16 type
- Missing cpu_to_le16() on an assignment
Link: https://lore.kernel.org/r/20210830231243.6227-1-jsmart2021@gmail.com
Fixes: 9064aeb2df8e ("scsi: lpfc: Add EDC ELS support")
Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Allow abbreviated cm framework status information to be obtained via sysfs.
Link: https://lore.kernel.org/r/20210816162901.121235-14-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Add the logic to move the congestion management and event information into
the cmd statistics buffer maintained for the adapter. The update includes
rolling up values for the last minute, hour, and day information.
Link: https://lore.kernel.org/r/20210816162901.121235-12-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Complete the enablement of the cm framework feature in the adapter. Perform
the following:
- Detect the presence of the congestion management framework feature.
When the cm framework is present:
- Issue the SET_FEATURE command to enable the feature.
- Register the cm statistics buffer with the adapter.
- Read the cm enablement buffer to determine the cm framework state for cm
management.
When cm management is enabled:
- Monitor all FPIN and congestion signalling events, incrementing
counters.
- Regularly sync with the adapter to communicate congestion events and to
receive an rx request limit.
- Monitor requests for rx data and ensure that no more than the
adapter prescribed limit is issued on the link. If the limit is
exceeded, SCSI and/or NVMe traffic is temporarily suspended.
- Maintain the minute, hourly, daily statistics buffer.
- Monitor for congestion enablement change events, causing a reread of the
enablement buffer and acting on any change in enablement.
And:
- Add teardown logic, including buffer deregistration, on adapter
detachment or reset.
Link: https://lore.kernel.org/r/20210816162901.121235-10-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
When congestion management is enabled, issue EDC ELS to register congestion
signaling capabilities with the fabric. The response handling will process
the fabric parameters and set the reporting parameters.
Similarly, add support for receiving an EDC request from the fabric
generating a corresponding response.
Implement handlers for congestion signals from the fabric and maintain
statistics for them.
Link: https://lore.kernel.org/r/20210816162901.121235-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Expand FPIN logging:
- Display Attached Port Names for Link Integrity and Peer Congestion
events
- Log Delivery, Peer Congestion, and Congestion events
- Sanity check FPIN descriptor lengths when processing FPIN descriptors.
Log RDF events when congestion logging is enabled.
Link: https://lore.kernel.org/r/20210816162901.121235-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Update routines to support 256 Gb link speed for LPe37000/LPe38000
adapters. 256 Gb speeds can be seen on trunk links.
Link: https://lore.kernel.org/r/20210722221721.74388-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
During RSCN storms, some instances of LIP on SLI-3 adapters lead to a
situation where FLOGIs keep failing with firmware indicating an illegal
command error code. This situation was preceded by an ADISC completion
that was processed while the link was down. This path on SLI-3 performs a
CLEAR_LA and attempts to activate a VPI with REG_VPI. Later, as the FLOGI
completes, it's no longer in sync with the VPI state. In SLI-3 it is
illegal to have an active VPI during FLOGI.
Resolve by circumventing the SLI-3 path that performs the CLEAR_LA and
REG_VPI. The path will be taken after the FLOGI after the next Link Up.
Link: https://lore.kernel.org/r/20210707184351.67872-18-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
In the PLOGI and ADISC completion handling, the device removal event could
be skipped during some link errors. This could leave a stale node in UNUSED
state. Driver unload would hang for a long time waiting for this node to
be freed.
Resolve by taking the following steps:
- Always post ADISC completion events to discovery state machine upon
ADISC completion.
- In case of a completion error for PLOGI/ADISC, ensure that init refcount
is dropped if not registered with transport.
Link: https://lore.kernel.org/r/20210707184351.67872-17-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
On an RSCN event, the nodes specified in RSCN payload and in MAPPED state
are moved to NPR state in order to revalidate the login. This triggers an
immediate unregister from SCSI/NVMe backend. The assumption is that the
node may be missing. The re-registration with the backend happens after
either relogin (PLOGI/PRLI; if ADISC is disabled or login truly lost) or
when ADISC completes successfully (rediscover with ADISC enabled).
However, the NVMe-FC standard provides for an RSCN to be triggered when
the remote port supports a discovery controller and there was a change
of discovery log content. As the remote port typically also supports
storage subsystems, this unregister causes all storage controller
connections to fail and require reconnect.
Correct by reworking the code to ensure that the unregistration only occurs
when a login state is truly terminated, thereby leaving the NVMe storage
controllers in place.
The changes made are:
- Retain node state in ADISC_ISSUE when scheduling ADISC ELS retry.
- Do not clear wwpn/wwnn values upon ADISC failure.
- Move MAPPED nodes to NPR during RSCN processing, but do not unregister
with transport. On GIDFT completion, identify missing nodes (not marked
NLP_NPR_2B_DISC) and unregister them.
- Perform unregistration for nodes that will go through ADISC processing
if ADISC completion fails.
- Successful ADISC completion will move node back to MAPPED state.
Link: https://lore.kernel.org/r/20210707184351.67872-16-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Since the REG_LOGIN to the fabric controller happens in parallel with SCR,
it may or may not be completed by the time RDF is sent. RDF and SCR are
sent to the fabric in parallel, so checking for a completed REG_LOGIN in
the RDF submit path is not needed.
Remove the REG_LOGI check from the RDF submission path.
Link: https://lore.kernel.org/r/20210707184351.67872-11-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The ELS job request structure, that is allocated while issuing ELS RDF/SCR
request path, is not being released in an error path causing a memory leak
message on driver unload.
Free the ELS job structure in the error paths.
Link: https://lore.kernel.org/r/20210707184351.67872-10-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
RDF ELS handling for NPIV ports may result in an incorrect NDLP reference
count. In the event of a persistent link down, this may lead to premature
release of an NDLP structure and subsequent NULL ptr dereference panic.
Remove extraneous lpfc_nlp_put() call in NPIV port RDF processing.
Link: https://lore.kernel.org/r/20210707184351.67872-9-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
If a LOGO is received for a node that is in the NPR state, post a DEVICE_RM
event to allow clean up of the logged out node.
Clearing the NLP_NPR_2B_DISC flag upon receipt of a LOGO ACC may cause
skipping of processing outstanding PLOGIs triggered by parallel RSCN
events. If an outstanding PLOGI is being retried and receipt of a LOGO ACC
occurs, then allow the discovery state machine's PLOGI completion to clean
up the node.
Link: https://lore.kernel.org/r/20210707184351.67872-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Implement ELS commands QFPA and UVEM for the priority tagging appid
support.
Link: https://lore.kernel.org/r/20210608043556.274139-8-muneendra.kumar@broadcom.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Gaurav Srivastava <gaurav.srivastava@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
FC-LS-5 specifies that a received RDF implies a possible change to fabric
supported diagnostic functions. Endpoints are to re-perform the RDF
exchange with the fabric to enable possible new features or adapt to
changes in values.
This patch adds the logic to RDF receive to re-perform the RDF exchange
with the switch.
Link: https://lore.kernel.org/r/20210514195559.119853-11-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
During link bounce testing, RPI counts were seen to differ from the number
of nodes. For fabric and domain controllers, a temporary RPI is assigned,
but the code isn't registering it. If the nodes do go away, such as on link
down, the temporary RPI isn't being released.
Change the way these two fabric services are managed, make them behave like
any other remote port. Register the RPI and register with the transport.
Never leave the nodes in a NPR or UNUSED state where their RPI is in limbo.
This allows them to follow normal dev_loss_tmo handling, RPI refcounting,
and normal removal rules. It also allows fabric I/Os to use the RPI for
traffic requests.
Note: There is some logic that still has a couple of exceptions when the
Domain controller (0xfffcXX). There are cases where the fabric won't have a
valid login but will send RDP. Other times, it will it send a LOGO then an
RDP. It makes for ad-hoc behavior to manage the node. Exceptions are
documented in the code.
Link: https://lore.kernel.org/r/20210514195559.119853-7-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
When lpfc is handling a solicited and unsolicited PLOGI with another
initiator, the remote initiator is never recovered. The node for the
initiator is erroneouosly removed and all resources released.
In lpfc_cmpl_els_plogi(), when lpfc_els_retry() returns a failure code, the
driver is calling the state machine with a device remove event because the
remote port is not currently registered with the SCSI or NVMe
transports. The issue is that on a PLOGI "collision" the driver correctly
aborts the solicited PLOGI and allows the unsolicited PLOGI to complete the
process, but this process is interrupted with a device_rm event.
Introduce logic in the PLOGI completion to capture the PLOGI collision
event and jump out of the routine. This will avoid removal of the node.
If there is no collision, the normal node removal will occur.
Fixes: 52edb2caf675 ("scsi: lpfc: Remove ndlp when a PLOGI/ADISC/PRLI/REG_RPI ultimately fails")
Cc: <stable@vger.kernel.org> # v5.11+
Link: https://lore.kernel.org/r/20210514195559.119853-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
An 'unexpected timeout' message may be seen in a point-2-point topology.
The message occurs when a PLOGI is received before the driver is notified
of FLOGI completion. The FLOGI completion failure causes discovery to be
triggered for a second time. The discovery timer is restarted but no new
discovery activity is initiated, thus the timeout message eventually
appears.
In point-2-point, when discovery has progressed before the FLOGI completion
is processed, it is not a failure. Add code to FLOGI completion to detect
that discovery has progressed and exit the FLOGI handling (noop'ing it).
Link: https://lore.kernel.org/r/20210514195559.119853-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
While testing NPIV and watching logins and used RPI levels, it was seen the
used RPI count was much higher than the number of remote ports discovered.
Code inspection showed that remote port removals on any NPIV instance are
releasing the RPI, but not performing an UNREG_RPI with the adapter thus
the reference counting never fully drops and the RPI is never fully
released. This was happening on NPIV nodes due to a log of fabric ELS's to
fabric addresses. This lack of UNREG_RPI was introduced by a prior node
rework patch that performed the UNREG_RPI as part of node cleanup.
To resolve the issue, do the following:
- Restore the RPI release code, but move the location to so that it is in
line with the new node cleanup design.
- NPIV ports now release the RPI and drop the node when the caller sets
the NLP_RELEASE_RPI flag.
- Set the NLP_RELEASE_RPI flag in node cleanup which will trigger a
release of RPI to free pool.
- Ensure there's an UNREG_RPI at LOGO completion so that RPI release is
completed.
- Stop offline_prep from skipping nodes that are UNUSED. The RPI may
not have been released.
- Stop the default RPI handling in lpfc_cmpl_els_rsp() for SLI4.
- Fixed up debugfs RPI displays for better debugging.
Fixes: a70e63eee1c1 ("scsi: lpfc: Fix NPIV Fabric Node reference counting")
Link: https://lore.kernel.org/r/20210514195559.119853-2-jsmart2021@gmail.com
Cc: <stable@vger.kernel.org> # v5.11+
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Clean up minor issues spotted by tools and code review:
- Spelling Errors
- Spurious characters and errors in function headers
- nvme_info wqerr and err fields source data reversed
- Extraneous new line in log message 0466
- Spacing error in log message 0109
- Messages 0140 and 0141 have portname and nodename reversed
- Incorrect function labelling in comment
Link: https://lore.kernel.org/r/20210412013127.2387-13-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
During target port swap, the swap logic ignores the DROPPED flag in the
nodes. As a node then moves into the UNUSED state, the reference count will
be dropped. If a node is later reused and moved out of the UNUSED state, an
access can result in a use-after-free assert.
Fix by having the port swap logic propagate the DROPPED flag when switching
nodes. This will avoid reference from being dropped.
Link: https://lore.kernel.org/r/20210412013127.2387-8-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
During target port-swap testing with link flips, the initiator could
encounter PRLI errors. If the target node disappears permanently, the ndlp
is found stuck in UNUSED state with ref count of 1. The rmmod of the driver
will hang waiting for this node to be freed.
While handling a link error in PRLI completion path, the code intends to
skip triggering the discovery state machine. However this is causing the
final reference release path to be skipped. This causes the node to be
stuck with ref count of 1
Fix by ensuring the code path triggers the device removal event on the node
state machine.
Link: https://lore.kernel.org/r/20210412013127.2387-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|