Age | Commit message (Collapse) | Author | Files | Lines |
|
Disable the VFIO_UPDATE_VADDR capability if mediated devices are present.
Their kernel threads could be blocked indefinitely by a misbehaving
userland while trying to pin/unpin pages while vaddrs are being updated.
Do not allow groups to be added to the container while vaddr's are invalid,
so we never need to block user threads from pinning, and can delete the
vaddr-waiting code in a subsequent patch.
Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1675184289-267876-2-git-send-email-steven.sistare@oracle.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The optional PRE_COPY states open the saving data transfer FD before
reaching STOP_COPY and allows the device to dirty track internal state
changes with the general idea to reduce the volume of data transferred
in the STOP_COPY stage.
While in PRE_COPY the device remains RUNNING, but the saving FD is open.
Only if the device also supports RUNNING_P2P can it support PRE_COPY_P2P,
which halts P2P transfers while continuing the saving FD.
PRE_COPY, with P2P support, requires the driver to implement 7 new arcs
and exists as an optional FSM branch between RUNNING and STOP_COPY:
RUNNING -> PRE_COPY -> PRE_COPY_P2P -> STOP_COPY
A new ioctl VFIO_MIG_GET_PRECOPY_INFO is provided to allow userspace to
query the progress of the precopy operation in the driver with the idea it
will judge to move to STOP_COPY at least once the initial data set is
transferred, and possibly after the dirty size has shrunk appropriately.
This ioctl is valid only in PRE_COPY states and kernel driver should
return -EINVAL from any other migration state.
Compared to the v1 clarification, STOP_COPY -> PRE_COPY is blocked
and to be defined in future.
We also split the pending_bytes report into the initial and sustaining
values, e.g.: initial_bytes and dirty_bytes.
initial_bytes: Amount of initial precopy data.
dirty_bytes: Device state changes relative to data previously retrieved.
These fields are not required to have any bearing to STOP_COPY phase.
It is recommended to leave PRE_COPY for STOP_COPY only after the
initial_bytes field reaches zero. Leaving PRE_COPY earlier might make
things slower.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Add an option to get migration data size by introducing a new migration
feature named VFIO_DEVICE_FEATURE_MIG_DATA_SIZE.
Upon VFIO_DEVICE_FEATURE_GET the estimated data length that will be
required to complete STOP_COPY is returned.
This option may better enable user space to consider before moving to
STOP_COPY whether it can meet the downtime SLA based on the returned
data.
The patch also includes the implementation for mlx5 and hisi for this
new option to make it feature complete for the existing drivers in this
area.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20221106174630.25909-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
DMA logging allows a device to internally record what DMAs the device is
initiating and report them back to userspace. It is part of the VFIO
migration infrastructure that allows implementing dirty page tracking
during the pre copy phase of live migration. Only DMA WRITEs are logged,
and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
This patch introduces the DMA logging involved uAPIs.
It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.
It exposes a PROBE option to detect if the device supports DMA logging.
It exposes a SET option to start device DMA logging in given IOVAs
ranges.
It exposes a SET option to stop device DMA logging that was previously
started.
It exposes a GET option to read back and clear the device DMA log.
Extra details exist as part of vfio.h per a specific option.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
This patch adds the following new device features for the low
power entry and exit in the header file. The implementation for the
same will be added in the subsequent patches.
- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
- VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
For vfio-pci based devices, with the standard PCI PM registers,
all power states cannot be achieved. The platform-based power management
needs to be involved to go into the lowest power state. For doing low
power entry and exit with platform-based power management,
these device features can be used.
The entry device feature has two variants. These two variants are mainly
to support the different behaviour for the low power entry.
If there is any access for the VFIO device on the host side, then the
device will be moved out of the low power state without the user's
guest driver involvement. Some devices (for example NVIDIA VGA or
3D controller) require the user's guest driver involvement for
each low-power entry. In the first variant, the host can return the
device to low power automatically. The device will continue to
attempt to reach low power until the low power exit feature is called.
In the second variant, if the device exits low power due to an access,
the host kernel will signal the user via the provided eventfd and will
not return the device to low power without a subsequent call to one of
the low power entry features. A call to the low power exit feature is
optional if the user provided eventfd is signaled.
These device features only support VFIO_DEVICE_FEATURE_SET and
VFIO_DEVICE_FEATURE_PROBE operations.
Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220829114850.4341-2-abhsahu@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
There is no macro called _IORW, so use _IOWR in the comment instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20220516101202.88373-1-thuth@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
v1 was never implemented and is replaced by v2.
The old uAPI documentation is removed from the header file.
The old uAPI definitions are still kept in the header file to ease
transition for userspace copying these headers. They will be fully
removed down the road.
Link: https://lore.kernel.org/all/20220224142024.147653-12-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
The RUNNING_P2P state is designed to support multiple devices in the same
VM that are doing P2P transactions between themselves. When in RUNNING_P2P
the device must be able to accept incoming P2P transactions but should not
generate outgoing P2P transactions.
As an optional extension to the mandatory states it is defined as
in between STOP and RUNNING:
STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
For drivers that are unable to support RUNNING_P2P the core code
silently merges RUNNING_P2P and RUNNING together. Unless driver support
is present, the new state cannot be used in SET_STATE.
Drivers that support this will be required to implement 4 FSM arcs
beyond the basic FSM. 2 of the basic FSM arcs become combination
transitions.
Compared to the v1 clarification, NDMA is redefined into FSM states and is
described in terms of the desired P2P quiescent behavior, noting that
halting all DMA is an acceptable implementation.
Link: https://lore.kernel.org/all/20220224142024.147653-11-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.
This is the STOP_COPY portion of the new protocol, it defines the 5 states
for basic stop and copy migration and the protocol to move the migration
data in/out of the kernel.
Compared to the clarification of the v1 protocol Alex proposed:
https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen
This has a few deliberate functional differences:
- ERROR arcs allow the device function to remain unchanged.
- The protocol is not required to return to the original state on
transition failure. Instead userspace can execute an unwind back to
the original state, reset, or do something else without needing kernel
support. This simplifies the kernel design and should userspace choose
a policy like always reset, avoids doing useless work in the kernel
on error handling paths.
- PRE_COPY is made optional, userspace must discover it before using it.
This reflects the fact that the majority of drivers we are aware of
right now will not implement PRE_COPY.
- segmentation is not part of the data stream protocol, the receiver
does not have to reproduce the framing boundaries.
The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.
A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is
defined to replace writing to the device_state field in the region. This
allows returning a brand new FD whenever the requested transition opens
a data transfer session.
The VFIO core code implements the new feature and provides a helper
function to the driver. Using the helper the driver only has to
implement 6 of the FSM arcs and the other combination transitions are
elaborated consistently from those arcs.
A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
report the capability for migration and indicate which set of states and
arcs are supported by the device. The FSM provides a lot of flexibility to
make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
allows for future breaking extensions for scenarios that cannot support
even the basic STOP_COPY requirements.
The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e.
VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state
of the VFIO device.
Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.
The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using a read() on the
region.
Link: https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Revert the uAPI changes from the below commit with notice that these
regions and capabilities are no longer provided.
Fixes: b392a1989170 ("vfio/pci: remove vfio_pci_nvlink2")
Reported-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <162014341432.3807030.11054087109120670135.stgit@omen>
|
|
This driver never had any open userspace (which for VFIO would include
VM kernel drivers) that use it, and thus should never have been added
by our normal userspace ABI rules.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <20210326061311.1497642-2-hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Define interfaces that allow the underlying memory object of an iova
range to be mapped to a new host virtual address in the host process:
- VFIO_DMA_UNMAP_FLAG_VADDR for VFIO_IOMMU_UNMAP_DMA
- VFIO_DMA_MAP_FLAG_VADDR flag for VFIO_IOMMU_MAP_DMA
- VFIO_UPDATE_VADDR extension for VFIO_CHECK_EXTENSION
Unmap vaddr invalidates the host virtual address in an iova range, and
blocks vfio translation of host virtual addresses. DMA to already-mapped
pages continues. Map vaddr updates the base VA and resumes translation.
See comments in uapi/linux/vfio.h for more details.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
For the UNMAP_DMA ioctl, delete all mappings if VFIO_DMA_UNMAP_FLAG_ALL
is set. Define the corresponding VFIO_UNMAP_ALL extension.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The device is being unplugged, so pass the request to userspace to
ask for a graceful cleanup. This should free up the thread that
would otherwise loop waiting for the device to be fully released.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
v5.10/vfio/next
|
|
Allow the VFIO_DEVICE_GET_INFO ioctl to include a capability chain.
Add a flag indicating capability chain support, and introduce the
definitions for the first set of capabilities which are specified to
s390 zPCI devices.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
DPAA2 (Data Path Acceleration Architecture) consists in
mechanisms for processing Ethernet packets, queue management,
accelerators, etc.
The Management Complex (mc) is a hardware entity that manages the DPAA2
hardware resources. It provides an object-based abstraction for software
drivers to use the DPAA2 hardware. The MC mediates operations such as
create, discover, destroy of DPAA2 objects.
The MC provides memory-mapped I/O command interfaces (MC portals) which
DPAA2 software drivers use to operate on DPAA2 objects.
A DPRC is a container object that holds other types of DPAA2 objects.
Each object in the DPRC is a Linux device and bound to a driver.
The MC-bus driver is a platform driver (different from PCI or platform
bus). The DPRC driver does runtime management of a bus instance. It
performs the initial scan of the DPRC and handles changes in the DPRC
configuration (adding/removing objects).
All objects inside a container share the same hardware isolation
context, meaning that only an entire DPRC can be assigned to
a virtual machine.
When a container is assigned to a virtual machine, all the objects
within that container are assigned to that virtual machine.
The DPRC container assigned to the virtual machine is not allowed
to change contents (add/remove objects) by the guest. The restriction
is set by the host and enforced by the mc hardware.
The DPAA2 objects can be directly assigned to the guest. However
the MC portals (the memory mapped command interface to the MC) need
to be emulated because there are commands that configure the
interrupts and the isolation IDs which are virtual in the guest.
Example:
echo vfio-fsl-mc > /sys/bus/fsl-mc/devices/dprc.2/driver_override
echo dprc.2 > /sys/bus/fsl-mc/drivers/vfio-fsl-mc/bind
The dprc.2 is bound to the VFIO driver and all the objects within
dprc.2 are going to be bound to the VFIO driver.
This patch adds the infrastructure for VFIO support for fsl-mc
devices. Subsequent patches will add support for binding and secure
assigning these devices using VFIO.
More details about the DPAA2 objects can be found here:
Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@nxp.com>
Signed-off-by: Diana Craciun <diana.craciun@oss.nxp.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
'v5.10/vfio/misc', 'v5.10/vfio/no-cmd-mem' and 'v5.10/vfio/yan_zhao_fixes' into v5.10/vfio/next
|
|
Commit 492855939bdb ("vfio/type1: Limit DMA mappings per container")
added the ability to limit the number of memory backed DMA mappings.
However on s390x, when lazy mapping is in use, we use a very large
number of concurrent mappings. Let's provide the current allowable
number of DMA mappings to userspace via the IOMMU info chain so that
userspace can take appropriate mitigation.
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
A typo fix ("_RUNNNG" => "_RUNNING") in comment block of the uapi header.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
ID 1 is already used by the IOVA range capability, use ID 2.
Reported-by: Liu Yi L <yi.l.liu@intel.com>
Fixes: ad721705d09c ("vfio iommu: Add migration capability to report supported features")
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Add support for multi-function devices in pci code.
- Enable PF-VF linking for architectures using the pdev->no_vf_scan
flag (currently just s390).
- Add reipl from NVMe support.
- Get rid of critical section cleanup in entry.S.
- Refactor PNSO CHSC (perform network subchannel operation) in cio and
qeth.
- QDIO interrupts and error handling fixes and improvements, more
refactoring changes.
- Align ioremap() with generic code.
- Accept requests without the prefetch bit set in vfio-ccw.
- Enable path handling via two new regions in vfio-ccw.
- Other small fixes and improvements all over the code.
* tag 's390-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (52 commits)
vfio-ccw: make vfio_ccw_regops variables declarations static
vfio-ccw: Add trace for CRW event
vfio-ccw: Wire up the CRW irq and CRW region
vfio-ccw: Introduce a new CRW region
vfio-ccw: Refactor IRQ handlers
vfio-ccw: Introduce a new schib region
vfio-ccw: Refactor the unregister of the async regions
vfio-ccw: Register a chp_event callback for vfio-ccw
vfio-ccw: Introduce new helper functions to free/destroy regions
vfio-ccw: document possible errors
vfio-ccw: Enable transparent CCW IPL from DASD
s390/pci: Log new handle in clp_disable_fh()
s390/cio, s390/qeth: cleanup PNSO CHSC
s390/qdio: remove q->first_to_kick
s390/qdio: fix up qdio_start_irq() kerneldoc
s390: remove critical section cleanup from entry.S
s390: add machine check SIGP
s390/pci: ioremap() align with generic code
s390/ap: introduce new ap function ap_get_qdev()
Documentation/s390: Update / remove developerWorks web links
...
|
|
This region provides a mechanism to pass a Channel Report Word
that affect vfio-ccw devices, and needs to be passed to the guest
for its awareness and/or processing.
The base driver (see crw_collect_info()) provides space for two
CRWs, as a subchannel event may have two CRWs chained together
(one for the ssid, one for the subchannel). As vfio-ccw will
deal with everything at the subchannel level, provide space
for a single CRW to be transferred in one shot.
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20200505122745.53208-7-farman@linux.ibm.com>
[CH: added padding to ccw_crw_region]
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
|
|
The schib region can be used by userspace to get the subchannel-
information block (SCHIB) for the passthrough subchannel.
This can be useful to get information such as channel path
information via the SCHIB.PMCW fields.
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20200505122745.53208-5-farman@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
|
|
Added migration capability in IOMMU info chain.
User application should check IOMMU info chain for migration capability
to use dirty page tracking feature provided by kernel module.
User application must check page sizes supported and maximum dirty
bitmap size returned by this capability structure for ioctls used to get
dirty bitmap.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
DMA mapped pages, including those pinned by mdev vendor drivers, might
get unpinned and unmapped while migration is active and device is still
running. For example, in pre-copy phase while guest driver could access
those pages, host device or vendor driver can dirty these mapped pages.
Such pages should be marked dirty so as to maintain memory consistency
for a user making use of dirty page tracking.
To get bitmap during unmap, user should allocate memory for bitmap, set
it all zeros, set size of allocated memory, set page size to be
considered for bitmap and set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
IOMMU container maintains a list of all pages pinned by vfio_pin_pages API.
All pages pinned by vendor driver through this API should be considered as
dirty during migration. When container consists of IOMMU capable device and
all pages are pinned and mapped, then all pages are marked dirty.
Added support to start/stop dirtied pages tracking and to get bitmap of all
dirtied pages for requested IO virtual address range.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
- Defined MIGRATION region type and sub-type.
- Defined vfio_device_migration_info structure which will be placed at the
0th offset of migration region to get/set VFIO device related
information. Defined members of structure and usage on read/write access.
- Defined device states and state transition details.
- Defined sequence to be followed while saving and resuming VFIO device.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The VFIO_DEVICE_FEATURE ioctl is meant to be a general purpose, device
agnostic ioctl for setting, retrieving, and probing device features.
This implementation provides a 16-bit field for specifying a feature
index, where the data porition of the ioctl is determined by the
semantics for the given feature. Additional flag bits indicate the
direction and nature of the operation; SET indicates user data is
provided into the device feature, GET indicates the device feature is
written out into user data. The PROBE flag augments determining
whether the given feature is supported, and if provided, whether the
given operation on the feature is supported.
The first user of this ioctl is for setting the vfio-pci VF token,
where the user provides a shared secret key (UUID) on a SR-IOV PF
device, which users must provide when opening associated VF devices.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
'v5.4/vfio/connie-re-arrange-v2', 'v5.4/vfio/hexin-pci-reset-v3', 'v5.4/vfio/parav-mtty-uuid-v2' and 'v5.4/vfio/shameer-iova-list-v8' into v5.4/vfio/next
|
|
This allows the user-space to retrieve the supported IOVA
range(s), excluding any non-relaxable reserved regions. The
implementation is based on capability chains, added to
VFIO_IOMMU_GET_INFO ioctl.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
It is easy to miss already defined region types. Let's re-arrange
the definitions a bit and add more comments to make it hopefully
a bit clearer.
No functional change.
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Add a region to the vfio-ccw device that can be used to submit
asynchronous I/O instructions. ssch continues to be handled by the
existing I/O region; the new region handles hsch and csch.
Interrupt status continues to be reported through the same channels
as for ssch.
Acked-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
|
|
Allow to extend the regions used by vfio-ccw. The first user will be
handling of halt and clear subchannel.
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
|
|
POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
pluggable PCIe devices but still have PCIe links which are used
for config space and MMIO. In addition to that the GPUs have 6 NVLinks
which are connected to other GPUs and the POWER9 CPU. POWER9 chips
have a special unit on a die called an NPU which is an NVLink2 host bus
adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
These systems also support ATS (address translation services) which is
a part of the NVLink2 protocol. Such GPUs also share on-board RAM
(16GB or 32GB) to the system via the same NVLink2 so a CPU has
cache-coherent access to a GPU RAM.
This exports GPU RAM to the userspace as a new VFIO device region. This
preregisters the new memory as device memory as it might be used for DMA.
This inserts pfns from the fault handler as the GPU memory is not onlined
until the vendor driver is loaded and trained the NVLinks so doing this
earlier causes low level errors which we fence in the firmware so
it does not hurt the host system but still better be avoided; for the same
reason this does not map GPU RAM into the host kernel (usual thing for
emulated access otherwise).
This exports an ATSD (Address Translation Shootdown) register of NPU which
allows TLB invalidations inside GPU for an operating system. The register
conveniently occupies a single 64k page. It is also presented to
the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
each of them can be used for TLB invalidation in a GPU linked to this NPU.
This allocates one ATSD register per an NVLink bridge allowing passing
up to 6 registers. Due to the host firmware bug (just recently fixed),
only 1 ATSD register per NPU was actually advertised to the host system
so this passes that alone register via the first NVLink bridge device in
the group which is still enough as QEMU collects them all back and
presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
In order to provide the userspace with the information about GPU-to-NVLink
connections, this exports an additional capability called "tgt"
(which is an abbreviated host system bus address). The "tgt" property
tells the GPU its own system address and allows the guest driver to
conglomerate the routing information so each GPU knows how to get directly
to the other GPUs.
For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
know LPID (a logical partition ID or a KVM guest hardware ID in other
words) and PID (a memory context ID of a userspace process, not to be
confused with a linux pid). This assigns a GPU to LPID in the NPU and
this is why this adds a listener for KVM on an IOMMU group. A PID comes
via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
This requires coherent memory and ATSD to be available on the host as
the GPU vendor only supports configurations with both features enabled
and other configurations are known not to work. Because of this and
because of the ways the features are advertised to the host system
(which is a device tree with very platform specific properties),
this requires enabled POWERNV platform.
The V100 GPUs do not advertise any of these capabilities via the config
space and there are more than just one device ID so this relies on
the platform to tell whether these GPUs have special abilities such as
NVLinks.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Pull VFIO updates from Alex Williamson:
- EDID interfaces for vfio devices supporting display extensions (Gerd
Hoffmann)
- Generically select Type-1 IOMMU model support on ARM/ARM64 (Geert
Uytterhoeven)
- Quirk for VFs reporting INTx pin (Alex Williamson)
- Fix error path memory leak in MSI support (Li Qiang)
* tag 'vfio-v4.20-rc1.v2' of git://github.com/awilliam/linux-vfio:
vfio: add edid support to mbochs sample driver
vfio: add edid api for display (vgpu) devices.
drivers/vfio: Allow type-1 IOMMU instantiation with all ARM/ARM64 IOMMUs
vfio/pci: Mask buggy SR-IOV VF INTx support
vfio/pci: Fix potential memory leak in vfio_msi_cap_len
|
|
This allows to set EDID monitor information for the vgpu display, for a
more flexible display configuration, using a special vfio region. Check
the comment describing struct vfio_region_gfx_edid for more details.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Adds support for the VFIO_DEVICE_GET_INFO ioctl to the VFIO
AP Matrix device driver. This is a minimal implementation,
as vfio-ap does not use I/O regions.
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Tested-by: Michael Mueller <mimu@linux.ibm.com>
Tested-by: Farhan Ali <alifm@linux.ibm.com>
Tested-by: Pierre Morel <pmorel@linux.ibm.com>
Message-Id: <20180925231641.4954-13-akrowiak@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
Registers the matrix device created by the VFIO AP device
driver with the VFIO mediated device framework.
Registering the matrix device will create the sysfs
structures needed to create mediated matrix devices
each of which will be used to configure the AP matrix
for a guest and connect it to the VFIO AP device driver.
Registering the matrix device with the VFIO mediated device
framework will create the following sysfs structures:
/sys/devices/vfio_ap/matrix/
...... [mdev_supported_types]
......... [vfio_ap-passthrough]
............ create
To create a mediated device for the AP matrix device, write a UUID
to the create file:
uuidgen > create
A symbolic link to the mediated device's directory will be created in the
devices subdirectory named after the generated $uuid:
/sys/devices/vfio_ap/matrix/
...... [mdev_supported_types]
......... [vfio_ap-passthrough]
............ [devices]
............... [$uuid]
A symbolic link to the mediated device will also be created
in the vfio_ap matrix's directory:
/sys/devices/vfio_ap/matrix/[$uuid]
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Tested-by: Michael Mueller <mimu@linux.ibm.com>
Tested-by: Farhan Ali <alifm@linux.ibm.com>
Message-Id: <20180925231641.4954-6-akrowiak@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
The ioeventfd here is actually irqfd handling of an ioeventfd such as
supported in KVM. A user is able to pre-program a device write to
occur when the eventfd triggers. This is yet another instance of
eventfd-irqfd triggering between KVM and vfio. The impetus for this
is high frequency writes to pages which are virtualized in QEMU.
Enabling this near-direct write path for selected registers within
the virtualized page can improve performance and reduce overhead.
Specifically this is initially targeted at NVIDIA graphics cards where
the driver issues a write to an MMIO register within a virtualized
region in order to allow the MSI interrupt to re-trigger.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Pull drm updates from Dave Airlie:
"This seems to have been a comparatively quieter merge window, I assume
due to holidays etc. The "biggest" change is AMD header cleanups, which
merge/remove a bunch of them. The AMD gpu scheduler is now being made generic
with the etnaviv driver wanting to reuse the code, hopefully other drivers
can go in the same direction.
Otherwise it's the usual lots of stuff in i915/amdgpu, not so much stuff
elsewhere.
Core:
- Add .last_close and .output_poll_changed helpers to reduce driver footprints
- Fix plane clipping
- Improved debug printing support
- Add panel orientation property
- Update edid derived properties at edid setting
- Reduction in fbdev driver footprint
- Move amdgpu scheduler into core for other drivers to use.
i915:
- Selftest and IGT improvements
- Fast boot prep work on IPS, pipe config
- HW workarounds for Cannonlake, Geminilake
- Cannonlake clock and HDMI2.0 fixes
- GPU cache invalidation and context switch improvements
- Display planes cleanup
- New PMU interface for perf queries
- New firmware support for KBL/SKL
- Geminilake HW workaround for perforamce
- Coffeelake stolen memory improvements
- GPU reset robustness work
- Cannonlake horizontal plane flipping
- GVT work
amdgpu/radeon:
- RV and Vega header file cleanups (lots of lines gone!)
- TTM operation context support
- 48-bit GPUVM support for Vega/RV
- ECC support for Vega
- Resizeable BAR support
- Multi-display sync support
- Enable swapout for reserved BOs during allocation
- S3 fixes on Raven
- GPU reset cleanup and fixes
- 2+1 level GPU page table
amdkfd:
- GFX7/8 SDMA user queues support
- Hardware scheduling for multiple processes
- dGPU prep work
rcar:
- Added R8A7743/5 support
- System suspend/resume support
sun4i:
- Multi-plane support for YUV formats
- A83T and LVDS support
msm:
- Devfreq support for GPU
tegra:
- Prep work for adding Tegra186 support
- Tegra186 HDMI support
- HDMI2.0 and zpos support by using generic helpers
tilcdc:
- Misc fixes
omapdrm:
- Support memory bandwidth limits
- DSI command mode panel cleanups
- DMM error handling
exynos:
- drop the old IPP subdriver.
etnaviv:
- Occlusion query fixes
- Job handling fixes
- Prep work for hooking in gpu scheduler
armada:
- Move closer to atomic modesetting
- Allow disabling primary plane if overlay is full screen
imx:
- Format modifier support
- Add tile prefetch to PRE
- Runtime PM support for PRG
ast:
- fix LUT loading"
* tag 'drm-for-v4.16' of git://people.freedesktop.org/~airlied/linux: (1471 commits)
drm/ast: Load lut in crtc_commit
drm: Check for lessee in DROP_MASTER ioctl
drm: fix gpu scheduler link order
drm/amd/display: Demote error print to debug print when ATOM impl missing
dma-buf: fix reservation_object_wait_timeout_rcu once more v2
drm/amdgpu: Avoid leaking PM domain on driver unbind (v2)
drm/amd/amdgpu: Add Polaris version check
drm/amdgpu: Reenable manual GPU reset from sysfs
drm/amdgpu: disable MMHUB power gating on raven
drm/ttm: Don't unreserve swapped BOs that were previously reserved
drm/ttm: Don't add swapped BOs to swap-LRU list
drm/amdgpu: only check for ECC on Vega10
drm/amd/powerplay: Fix smu_table_entry.handle type
drm/ttm: add VADDR_FLAG_UPDATED_COUNT to correctly update dma_page global count
drm: Fix PANEL_ORIENTATION_QUIRKS breaking the Kconfig DRM menuconfig
drm/radeon: fill in rb backend map on evergreen/ni.
drm/amdgpu/gfx9: fix ngg enablement to clear gds reserved memory (v2)
drm/ttm: only free pages rather than update global memory count together
drm/amdgpu: fix CPU based VM updates
drm/amdgpu: fix typo in amdgpu_vce_validate_bo
...
|
|
By default VFIO disables mapping of MSIX BAR to the userspace as
the userspace may program it in a way allowing spurious interrupts;
instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
In order to eliminate guessing from the userspace about what is
mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
This works fine as long as the system page size equals to the MSIX
alignment requirement which is 4KB. However with a bigger page size
the existing code prohibits mapping non-MSIX parts of a page with MSIX
structures so these parts have to be emulated via slow reads/writes on
a VFIO device fd. If these emulated bits are accessed often, this has
serious impact on performance.
This allows mmap of the entire BAR containing MSIX vector table.
This removes the sparse capability for PCI devices as it becomes useless.
As the userspace needs to know for sure whether mmapping of the MSIX
vector containing data can succeed, this adds a new capability -
VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
that the entire BAR can be mmapped.
This does not touch the MSIX mangling in the BAR read/write handlers as
we are doing this just to enable direct access to non MSIX registers.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw - fixup whitespace, trim function name]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Chris requested this backmerge for a reconciliation on
drm_print.h between drm-misc-next and drm-intel-next-queued
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
Add VFIO_DEVICE_QUERY_GFX_PLANE ioctl command to let user query and get
a plane and its information. So far, two types of buffers are supported:
buffers based on dma-buf and buffers based on region.
This ioctl can be invoked with:
1) Either DMABUF or REGION flag. Vendor driver returns a plane_info
successfully only when the specific kind of buffer is supported.
2) Flag PROBE. And at the same time either DMABUF or REGION must be set,
so that vendor driver returns success only when the specific kind of
buffer is supported.
Add VFIO_DEVICE_GET_GFX_DMABUF ioctl command to let user get a specific
dma-buf fd of an exposed MDEV buffer provided by dmabuf_id which was
returned in VFIO_DEVICE_QUERY_GFX_PLANE ioctl command.
The life cycle of an exposed MDEV buffer is handled by userspace and
tracked by kernel space. The returned dmabuf_id in struct vfio_device_
query_gfx_plane can be a new id of a new exposed buffer or an old id of
a re-exported buffer. Host user can check the value of dmabuf_id to see
if it needs to create new resources according to the new exposed buffer
or just re-use the existing resource related to the old buffer.
v18:
- update comments for VFIO_DEVICE_GET_GFX_DMABUF. (Alex)
v17:
- modify VFIO_DEVICE_GET_GFX_DMABUF interface. (Alex)
v16:
- add x_hot and y_hot fields. (Gerd)
- add comments for VFIO_DEVICE_GET_GFX_DMABUF. (Alex)
- rebase to 4.14.0-rc6.
v15:
- add a ioctl to get a dmabuf for a given dmabuf id. (Gerd)
v14:
- add PROBE, DMABUF and REGION flags. (Alex)
v12:
- add drm_format_mod back. (Gerd and Zhenyu)
- add region_index. (Gerd)
v11:
- rename plane_type to drm_plane_type. (Gerd)
- move fields of vfio_device_query_gfx_plane to vfio_device_gfx_plane_info.
(Gerd)
- remove drm_format_mod, start fields. (Daniel)
- remove plane_id.
v10:
- refine the ABI API VFIO_DEVICE_QUERY_GFX_PLANE. (Alex) (Gerd)
v3:
- add a field gvt_plane_info in the drm_i915_gem_obj structure to save
the decoded plane information to avoid look up while need the plane
info. (Gerd)
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
|
|
Many user space API headers have licensing information, which is either
incomplete, badly formatted or just a shorthand for referring to the
license under which the file is supposed to be. This makes it hard for
compliance tools to determine the correct license.
Update these files with an SPDX license identifier. The identifier was
chosen based on the license information in the file.
GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license
identifier with the added 'WITH Linux-syscall-note' exception, which is
the officially assigned exception identifier for the kernel syscall
exception:
NOTE! This copyright does *not* cover user programs that use kernel
services by normal system calls - this is merely considered normal use
of the kernel, and does *not* fall under the heading of "derived work".
This exception makes it possible to include GPL headers into non GPL
code, without confusing license compliance tools.
Headers which have either explicit dual licensing or are just licensed
under a non GPL license are updated with the corresponding SPDX
identifier and the GPLv2 with syscall exception identifier. The format
is:
((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE)
SPDX license identifiers are a legally binding shorthand, which can be
used instead of the full boiler plate text. The update does not remove
existing license information as this has to be done on a case by case
basis and the copyright holders might have to be consulted. This will
happen in a separate step.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne. See the previous patch in this series for the
methodology of how this patch was researched.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Realize VFIO_DEVICE_GET_IRQ_INFO ioctl to retrieve
VFIO_CCW_IO_IRQ information.
Realize VFIO_DEVICE_SET_IRQS ioctl to set an eventfd fd for
VFIO_CCW_IO_IRQ. Once a write operation to the ccw_io_region
was performed, trigger a signal on this fd.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170317031743.40128-12-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
|
|
Introduce device information about vfio-ccw: VFIO_DEVICE_FLAGS_CCW.
Realize VFIO_DEVICE_GET_REGION_INFO ioctl for vfio-ccw.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170317031743.40128-10-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
|
|
Define vfio-ccw device API strings. CCW vendor driver using mediated
device framework should use this string for device_api attribute.
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170317031743.40128-4-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
|
|
Defined device API strings. Vendor driver using mediated device
framework should use corresponding string for device_api attribute.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Provide read-only access to PCI config space of the PCI host bridge
and LPC bridge through device specific regions. This may be used to
configure a VM with matching register contents to satisfy driver
requirements. Providing this through the vfio file descriptor removes
an additional userspace requirement for access through pci-sysfs and
removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to
the specific devices we're accessing.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|