aboutsummaryrefslogtreecommitdiff
path: root/Documentation/virt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt')
-rw-r--r--Documentation/virt/hyperv/clocks.rst73
-rw-r--r--Documentation/virt/hyperv/index.rst12
-rw-r--r--Documentation/virt/hyperv/overview.rst207
-rw-r--r--Documentation/virt/hyperv/vmbus.rst303
-rw-r--r--Documentation/virt/index.rst1
-rw-r--r--Documentation/virt/kvm/api.rst367
-rw-r--r--Documentation/virt/kvm/arm/hyp-abi.rst11
-rw-r--r--Documentation/virt/kvm/s390/index.rst1
-rw-r--r--Documentation/virt/kvm/s390/s390-pv-boot.rst2
-rw-r--r--Documentation/virt/kvm/s390/s390-pv-dump.rst64
-rw-r--r--Documentation/virt/kvm/x86/hypercalls.rst2
-rw-r--r--Documentation/virt/uml/user_mode_linux_howto_v2.rst2
12 files changed, 1027 insertions, 18 deletions
diff --git a/Documentation/virt/hyperv/clocks.rst b/Documentation/virt/hyperv/clocks.rst
new file mode 100644
index 000000000000..2da2879fad52
--- /dev/null
+++ b/Documentation/virt/hyperv/clocks.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Clocks and Timers
+=================
+
+arm64
+-----
+On arm64, Hyper-V virtualizes the ARMv8 architectural system counter
+and timer. Guest VMs use this virtualized hardware as the Linux
+clocksource and clockevents via the standard arm_arch_timer.c
+driver, just as they would on bare metal. Linux vDSO support for the
+architectural system counter is functional in guest VMs on Hyper-V.
+While Hyper-V also provides a synthetic system clock and four synthetic
+per-CPU timers as described in the TLFS, they are not used by the
+Linux kernel in a Hyper-V guest on arm64. However, older versions
+of Hyper-V for arm64 only partially virtualize the ARMv8
+architectural timer, such that the timer does not generate
+interrupts in the VM. Because of this limitation, running current
+Linux kernel versions on these older Hyper-V versions requires an
+out-of-tree patch to use the Hyper-V synthetic clocks/timers instead.
+
+x86/x64
+-------
+On x86/x64, Hyper-V provides guest VMs with a synthetic system clock
+and four synthetic per-CPU timers as described in the TLFS. Hyper-V
+also provides access to the virtualized TSC via the RDTSC and
+related instructions. These TSC instructions do not trap to
+the hypervisor and so provide excellent performance in a VM.
+Hyper-V performs TSC calibration, and provides the TSC frequency
+to the guest VM via a synthetic MSR. Hyper-V initialization code
+in Linux reads this MSR to get the frequency, so it skips TSC
+calibration and sets tsc_reliable. Hyper-V provides virtualized
+versions of the PIT (in Hyper-V Generation 1 VMs only), local
+APIC timer, and RTC. Hyper-V does not provide a virtualized HPET in
+guest VMs.
+
+The Hyper-V synthetic system clock can be read via a synthetic MSR,
+but this access traps to the hypervisor. As a faster alternative,
+the guest can configure a memory page to be shared between the guest
+and the hypervisor. Hyper-V populates this memory page with a
+64-bit scale value and offset value. To read the synthetic clock
+value, the guest reads the TSC and then applies the scale and offset
+as described in the Hyper-V TLFS. The resulting value advances
+at a constant 10 MHz frequency. In the case of a live migration
+to a host with a different TSC frequency, Hyper-V adjusts the
+scale and offset values in the shared page so that the 10 MHz
+frequency is maintained.
+
+Starting with Windows Server 2022 Hyper-V, Hyper-V uses hardware
+support for TSC frequency scaling to enable live migration of VMs
+across Hyper-V hosts where the TSC frequency may be different.
+When a Linux guest detects that this Hyper-V functionality is
+available, it prefers to use Linux's standard TSC-based clocksource.
+Otherwise, it uses the clocksource for the Hyper-V synthetic system
+clock implemented via the shared page (identified as
+"hyperv_clocksource_tsc_page").
+
+The Hyper-V synthetic system clock is available to user space via
+vDSO, and gettimeofday() and related system calls can execute
+entirely in user space. The vDSO is implemented by mapping the
+shared page with scale and offset values into user space. User
+space code performs the same algorithm of reading the TSC and
+appying the scale and offset to get the constant 10 MHz clock.
+
+Linux clockevents are based on Hyper-V synthetic timer 0. While
+Hyper-V offers 4 synthetic timers for each CPU, Linux only uses
+timer 0. Interrupts from stimer0 are recorded on the "HVS" line in
+/proc/interrupts. Clockevents based on the virtualized PIT and
+local APIC timer also work, but the Hyper-V synthetic timer is
+preferred.
+
+The driver for the Hyper-V synthetic system clock and timers is
+drivers/clocksource/hyperv_timer.c.
diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst
new file mode 100644
index 000000000000..4a7a1b738bbe
--- /dev/null
+++ b/Documentation/virt/hyperv/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Hyper-V Enlightenments
+======================
+
+.. toctree::
+ :maxdepth: 1
+
+ overview
+ vmbus
+ clocks
diff --git a/Documentation/virt/hyperv/overview.rst b/Documentation/virt/hyperv/overview.rst
new file mode 100644
index 000000000000..cd493332c88a
--- /dev/null
+++ b/Documentation/virt/hyperv/overview.rst
@@ -0,0 +1,207 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Overview
+========
+The Linux kernel contains a variety of code for running as a fully
+enlightened guest on Microsoft's Hyper-V hypervisor. Hyper-V
+consists primarily of a bare-metal hypervisor plus a virtual machine
+management service running in the parent partition (roughly
+equivalent to KVM and QEMU, for example). Guest VMs run in child
+partitions. In this documentation, references to Hyper-V usually
+encompass both the hypervisor and the VMM service without making a
+distinction about which functionality is provided by which
+component.
+
+Hyper-V runs on x86/x64 and arm64 architectures, and Linux guests
+are supported on both. The functionality and behavior of Hyper-V is
+generally the same on both architectures unless noted otherwise.
+
+Linux Guest Communication with Hyper-V
+--------------------------------------
+Linux guests communicate with Hyper-V in four different ways:
+
+* Implicit traps: As defined by the x86/x64 or arm64 architecture,
+ some guest actions trap to Hyper-V. Hyper-V emulates the action and
+ returns control to the guest. This behavior is generally invisible
+ to the Linux kernel.
+
+* Explicit hypercalls: Linux makes an explicit function call to
+ Hyper-V, passing parameters. Hyper-V performs the requested action
+ and returns control to the caller. Parameters are passed in
+ processor registers or in memory shared between the Linux guest and
+ Hyper-V. On x86/x64, hypercalls use a Hyper-V specific calling
+ sequence. On arm64, hypercalls use the ARM standard SMCCC calling
+ sequence.
+
+* Synthetic register access: Hyper-V implements a variety of
+ synthetic registers. On x86/x64 these registers appear as MSRs in
+ the guest, and the Linux kernel can read or write these MSRs using
+ the normal mechanisms defined by the x86/x64 architecture. On
+ arm64, these synthetic registers must be accessed using explicit
+ hypercalls.
+
+* VMbus: VMbus is a higher-level software construct that is built on
+ the other 3 mechanisms. It is a message passing interface between
+ the Hyper-V host and the Linux guest. It uses memory that is shared
+ between Hyper-V and the guest, along with various signaling
+ mechanisms.
+
+The first three communication mechanisms are documented in the
+`Hyper-V Top Level Functional Spec (TLFS)`_. The TLFS describes
+general Hyper-V functionality and provides details on the hypercalls
+and synthetic registers. The TLFS is currently written for the
+x86/x64 architecture only.
+
+.. _Hyper-V Top Level Functional Spec (TLFS): https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs
+
+VMbus is not documented. This documentation provides a high-level
+overview of VMbus and how it works, but the details can be discerned
+only from the code.
+
+Sharing Memory
+--------------
+Many aspects are communication between Hyper-V and Linux are based
+on sharing memory. Such sharing is generally accomplished as
+follows:
+
+* Linux allocates memory from its physical address space using
+ standard Linux mechanisms.
+
+* Linux tells Hyper-V the guest physical address (GPA) of the
+ allocated memory. Many shared areas are kept to 1 page so that a
+ single GPA is sufficient. Larger shared areas require a list of
+ GPAs, which usually do not need to be contiguous in the guest
+ physical address space. How Hyper-V is told about the GPA or list
+ of GPAs varies. In some cases, a single GPA is written to a
+ synthetic register. In other cases, a GPA or list of GPAs is sent
+ in a VMbus message.
+
+* Hyper-V translates the GPAs into "real" physical memory addresses,
+ and creates a virtual mapping that it can use to access the memory.
+
+* Linux can later revoke sharing it has previously established by
+ telling Hyper-V to set the shared GPA to zero.
+
+Hyper-V operates with a page size of 4 Kbytes. GPAs communicated to
+Hyper-V may be in the form of page numbers, and always describe a
+range of 4 Kbytes. Since the Linux guest page size on x86/x64 is
+also 4 Kbytes, the mapping from guest page to Hyper-V page is 1-to-1.
+On arm64, Hyper-V supports guests with 4/16/64 Kbyte pages as
+defined by the arm64 architecture. If Linux is using 16 or 64
+Kbyte pages, Linux code must be careful to communicate with Hyper-V
+only in terms of 4 Kbyte pages. HV_HYP_PAGE_SIZE and related macros
+are used in code that communicates with Hyper-V so that it works
+correctly in all configurations.
+
+As described in the TLFS, a few memory pages shared between Hyper-V
+and the Linux guest are "overlay" pages. With overlay pages, Linux
+uses the usual approach of allocating guest memory and telling
+Hyper-V the GPA of the allocated memory. But Hyper-V then replaces
+that physical memory page with a page it has allocated, and the
+original physical memory page is no longer accessible in the guest
+VM. Linux may access the memory normally as if it were the memory
+that it originally allocated. The "overlay" behavior is visible
+only because the contents of the page (as seen by Linux) change at
+the time that Linux originally establishes the sharing and the
+overlay page is inserted. Similarly, the contents change if Linux
+revokes the sharing, in which case Hyper-V removes the overlay page,
+and the guest page originally allocated by Linux becomes visible
+again.
+
+Before Linux does a kexec to a kdump kernel or any other kernel,
+memory shared with Hyper-V should be revoked. Hyper-V could modify
+a shared page or remove an overlay page after the new kernel is
+using the page for a different purpose, corrupting the new kernel.
+Hyper-V does not provide a single "set everything" operation to
+guest VMs, so Linux code must individually revoke all sharing before
+doing kexec. See hv_kexec_handler() and hv_crash_handler(). But
+the crash/panic path still has holes in cleanup because some shared
+pages are set using per-CPU synthetic registers and there's no
+mechanism to revoke the shared pages for CPUs other than the CPU
+running the panic path.
+
+CPU Management
+--------------
+Hyper-V does not have a ability to hot-add or hot-remove a CPU
+from a running VM. However, Windows Server 2019 Hyper-V and
+earlier versions may provide guests with ACPI tables that indicate
+more CPUs than are actually present in the VM. As is normal, Linux
+treats these additional CPUs as potential hot-add CPUs, and reports
+them as such even though Hyper-V will never actually hot-add them.
+Starting in Windows Server 2022 Hyper-V, the ACPI tables reflect
+only the CPUs actually present in the VM, so Linux does not report
+any hot-add CPUs.
+
+A Linux guest CPU may be taken offline using the normal Linux
+mechanisms, provided no VMbus channel interrupts are assigned to
+the CPU. See the section on VMbus Interrupts for more details
+on how VMbus channel interrupts can be re-assigned to permit
+taking a CPU offline.
+
+32-bit and 64-bit
+-----------------
+On x86/x64, Hyper-V supports 32-bit and 64-bit guests, and Linux
+will build and run in either version. While the 32-bit version is
+expected to work, it is used rarely and may suffer from undetected
+regressions.
+
+On arm64, Hyper-V supports only 64-bit guests.
+
+Endian-ness
+-----------
+All communication between Hyper-V and guest VMs uses Little-Endian
+format on both x86/x64 and arm64. Big-endian format on arm64 is not
+supported by Hyper-V, and Linux code does not use endian-ness macros
+when accessing data shared with Hyper-V.
+
+Versioning
+----------
+Current Linux kernels operate correctly with older versions of
+Hyper-V back to Windows Server 2012 Hyper-V. Support for running
+on the original Hyper-V release in Windows Server 2008/2008 R2
+has been removed.
+
+A Linux guest on Hyper-V outputs in dmesg the version of Hyper-V
+it is running on. This version is in the form of a Windows build
+number and is for display purposes only. Linux code does not
+test this version number at runtime to determine available features
+and functionality. Hyper-V indicates feature/function availability
+via flags in synthetic MSRs that Hyper-V provides to the guest,
+and the guest code tests these flags.
+
+VMbus has its own protocol version that is negotiated during the
+initial VMbus connection from the guest to Hyper-V. This version
+number is also output to dmesg during boot. This version number
+is checked in a few places in the code to determine if specific
+functionality is present.
+
+Furthermore, each synthetic device on VMbus also has a protocol
+version that is separate from the VMbus protocol version. Device
+drivers for these synthetic devices typically negotiate the device
+protocol version, and may test that protocol version to determine
+if specific device functionality is present.
+
+Code Packaging
+--------------
+Hyper-V related code appears in the Linux kernel code tree in three
+main areas:
+
+1. drivers/hv
+
+2. arch/x86/hyperv and arch/arm64/hyperv
+
+3. individual device driver areas such as drivers/scsi, drivers/net,
+ drivers/clocksource, etc.
+
+A few miscellaneous files appear elsewhere. See the full list under
+"Hyper-V/Azure CORE AND DRIVERS" and "DRM DRIVER FOR HYPERV
+SYNTHETIC VIDEO DEVICE" in the MAINTAINERS file.
+
+The code in #1 and #2 is built only when CONFIG_HYPERV is set.
+Similarly, the code for most Hyper-V related drivers is built only
+when CONFIG_HYPERV is set.
+
+Most Hyper-V related code in #1 and #3 can be built as a module.
+The architecture specific code in #2 must be built-in. Also,
+drivers/hv/hv_common.c is low-level code that is common across
+architectures and must be built-in.
diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst
new file mode 100644
index 000000000000..d2012d9022c5
--- /dev/null
+++ b/Documentation/virt/hyperv/vmbus.rst
@@ -0,0 +1,303 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+VMbus
+=====
+VMbus is a software construct provided by Hyper-V to guest VMs. It
+consists of a control path and common facilities used by synthetic
+devices that Hyper-V presents to guest VMs. The control path is
+used to offer synthetic devices to the guest VM and, in some cases,
+to rescind those devices. The common facilities include software
+channels for communicating between the device driver in the guest VM
+and the synthetic device implementation that is part of Hyper-V, and
+signaling primitives to allow Hyper-V and the guest to interrupt
+each other.
+
+VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
+entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c)
+establishes the VMbus control path with the Hyper-V host, then
+registers itself as a Linux bus driver. It implements the standard
+bus functions for adding and removing devices to/from the bus.
+
+Most synthetic devices offered by Hyper-V have a corresponding Linux
+device driver. These devices include:
+
+* SCSI controller
+* NIC
+* Graphics frame buffer
+* Keyboard
+* Mouse
+* PCI device pass-thru
+* Heartbeat
+* Time Sync
+* Shutdown
+* Memory balloon
+* Key/Value Pair (KVP) exchange with Hyper-V
+* Hyper-V online backup (a.k.a. VSS)
+
+Guest VMs may have multiple instances of the synthetic SCSI
+controller, synthetic NIC, and PCI pass-thru devices. Other
+synthetic devices are limited to a single instance per VM. Not
+listed above are a small number of synthetic devices offered by
+Hyper-V that are used only by Windows guests and for which Linux
+does not have a driver.
+
+Hyper-V uses the terms "VSP" and "VSC" in describing synthetic
+devices. "VSP" refers to the Hyper-V code that implements a
+particular synthetic device, while "VSC" refers to the driver for
+the device in the guest VM. For example, the Linux driver for the
+synthetic NIC is referred to as "netvsc" and the Linux driver for
+the synthetic SCSI controller is "storvsc". These drivers contain
+functions with names like "storvsc_connect_to_vsp".
+
+VMbus channels
+--------------
+An instance of a synthetic device uses VMbus channels to communicate
+between the VSP and the VSC. Channels are bi-directional and used
+for passing messages. Most synthetic devices use a single channel,
+but the synthetic SCSI controller and synthetic NIC may use multiple
+channels to achieve higher performance and greater parallelism.
+
+Each channel consists of two ring buffers. These are classic ring
+buffers from a university data structures textbook. If the read
+and writes pointers are equal, the ring buffer is considered to be
+empty, so a full ring buffer always has at least one byte unused.
+The "in" ring buffer is for messages from the Hyper-V host to the
+guest, and the "out" ring buffer is for messages from the guest to
+the Hyper-V host. In Linux, the "in" and "out" designations are as
+viewed by the guest side. The ring buffers are memory that is
+shared between the guest and the host, and they follow the standard
+paradigm where the memory is allocated by the guest, with the list
+of GPAs that make up the ring buffer communicated to the host. Each
+ring buffer consists of a header page (4 Kbytes) with the read and
+write indices and some control flags, followed by the memory for the
+actual ring. The size of the ring is determined by the VSC in the
+guest and is specific to each synthetic device. The list of GPAs
+making up the ring is communicated to the Hyper-V host over the
+VMbus control path as a GPA Descriptor List (GPADL). See function
+vmbus_establish_gpadl().
+
+Each ring buffer is mapped into contiguous Linux kernel virtual
+space in three parts: 1) the 4 Kbyte header page, 2) the memory
+that makes up the ring itself, and 3) a second mapping of the memory
+that makes up the ring itself. Because (2) and (3) are contiguous
+in kernel virtual space, the code that copies data to and from the
+ring buffer need not be concerned with ring buffer wrap-around.
+Once a copy operation has completed, the read or write index may
+need to be reset to point back into the first mapping, but the
+actual data copy does not need to be broken into two parts. This
+approach also allows complex data structures to be easily accessed
+directly in the ring without handling wrap-around.
+
+On arm64 with page sizes > 4 Kbytes, the header page must still be
+passed to Hyper-V as a 4 Kbyte area. But the memory for the actual
+ring must be aligned to PAGE_SIZE and have a size that is a multiple
+of PAGE_SIZE so that the duplicate mapping trick can be done. Hence
+a portion of the header page is unused and not communicated to
+Hyper-V. This case is handled by vmbus_establish_gpadl().
+
+Hyper-V enforces a limit on the aggregate amount of guest memory
+that can be shared with the host via GPADLs. This limit ensures
+that a rogue guest can't force the consumption of excessive host
+resources. For Windows Server 2019 and later, this limit is
+approximately 1280 Mbytes. For versions prior to Windows Server
+2019, the limit is approximately 384 Mbytes.
+
+VMbus messages
+--------------
+All VMbus messages have a standard header that includes the message
+length, the offset of the message payload, some flags, and a
+transactionID. The portion of the message after the header is
+unique to each VSP/VSC pair.
+
+Messages follow one of two patterns:
+
+* Unidirectional: Either side sends a message and does not
+ expect a response message
+* Request/response: One side (usually the guest) sends a message
+ and expects a response
+
+The transactionID (a.k.a. "requestID") is for matching requests &
+responses. Some synthetic devices allow multiple requests to be in-
+flight simultaneously, so the guest specifies a transactionID when
+sending a request. Hyper-V sends back the same transactionID in the
+matching response.
+
+Messages passed between the VSP and VSC are control messages. For
+example, a message sent from the storvsc driver might be "execute
+this SCSI command". If a message also implies some data transfer
+between the guest and the Hyper-V host, the actual data to be
+transferred may be embedded with the control message, or it may be
+specified as a separate data buffer that the Hyper-V host will
+access as a DMA operation. The former case is used when the size of
+the data is small and the cost of copying the data to and from the
+ring buffer is minimal. For example, time sync messages from the
+Hyper-V host to the guest contain the actual time value. When the
+data is larger, a separate data buffer is used. In this case, the
+control message contains a list of GPAs that describe the data
+buffer. For example, the storvsc driver uses this approach to
+specify the data buffers to/from which disk I/O is done.
+
+Three functions exist to send VMbus messages:
+
+1. vmbus_sendpacket(): Control-only messages and messages with
+ embedded data -- no GPAs
+2. vmbus_sendpacket_pagebuffer(): Message with list of GPAs
+ identifying data to transfer. An offset and length is
+ associated with each GPA so that multiple discontinuous areas
+ of guest memory can be targeted.
+3. vmbus_sendpacket_mpb_desc(): Message with list of GPAs
+ identifying data to transfer. A single offset and length is
+ associated with a list of GPAs. The GPAs must describe a
+ single logical area of guest memory to be targeted.
+
+Historically, Linux guests have trusted Hyper-V to send well-formed
+and valid messages, and Linux drivers for synthetic devices did not
+fully validate messages. With the introduction of processor
+technologies that fully encrypt guest memory and that allow the
+guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting
+the Hyper-V host is no longer a valid assumption. The drivers for
+VMbus synthetic devices are being updated to fully validate any
+values read from memory that is shared with Hyper-V, which includes
+messages from VMbus devices. To facilitate such validation,
+messages read by the guest from the "in" ring buffer are copied to a
+temporary buffer that is not shared with Hyper-V. Validation is
+performed in this temporary buffer without the risk of Hyper-V
+maliciously modifying the message after it is validated but before
+it is used.
+
+VMbus interrupts
+----------------
+VMbus provides a mechanism for the guest to interrupt the host when
+the guest has queued new messages in a ring buffer. The host
+expects that the guest will send an interrupt only when an "out"
+ring buffer transitions from empty to non-empty. If the guest sends
+interrupts at other times, the host deems such interrupts to be
+unnecessary. If a guest sends an excessive number of unnecessary
+interrupts, the host may throttle that guest by suspending its
+execution for a few seconds to prevent a denial-of-service attack.
+
+Similarly, the host will interrupt the guest when it sends a new
+message on the VMbus control path, or when a VMbus channel "in" ring
+buffer transitions from empty to non-empty. Each CPU in the guest
+may receive VMbus interrupts, so they are best modeled as per-CPU
+interrupts in Linux. This model works well on arm64 where a single
+per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for
+per-CPU IRQs, an x86 interrupt vector is statically allocated (see
+HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to
+call the VMbus interrupt service routine. These interrupts are
+visible in /proc/interrupts on the "HYP" line.
+
+The guest CPU that a VMbus channel will interrupt is selected by the
+guest when the channel is created, and the host is informed of that
+selection. VMbus devices are broadly grouped into two categories:
+
+1. "Slow" devices that need only one VMbus channel. The devices
+ (such as keyboard, mouse, heartbeat, and timesync) generate
+ relatively few interrupts. Their VMbus channels are all
+ assigned to interrupt the VMBUS_CONNECT_CPU, which is always
+ CPU 0.
+
+2. "High speed" devices that may use multiple VMbus channels for
+ higher parallelism and performance. These devices include the
+ synthetic SCSI controller and synthetic NIC. Their VMbus
+ channels interrupts are assigned to CPUs that are spread out
+ among the available CPUs in the VM so that interrupts on
+ multiple channels can be processed in parallel.
+
+The assignment of VMbus channel interrupts to CPUs is done in the
+function init_vp_index(). This assignment is done outside of the
+normal Linux interrupt affinity mechanism, so the interrupts are
+neither "unmanaged" nor "managed" interrupts.
+
+The CPU that a VMbus channel will interrupt can be seen in
+/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
+When running on later versions of Hyper-V, the CPU can be changed
+by writing a new value to this sysfs entry. Because the interrupt
+assignment is done outside of the normal Linux affinity mechanism,
+there are no entries in /proc/irq corresponding to individual
+VMbus channel interrupts.
+
+An online CPU in a Linux guest may not be taken offline if it has
+VMbus channel interrupts assigned to it. Any such channel
+interrupts must first be manually reassigned to another CPU as
+described above. When no channel interrupts are assigned to the
+CPU, it can be taken offline.
+
+When a guest CPU receives a VMbus interrupt from the host, the
+function vmbus_isr() handles the interrupt. It first checks for
+channel interrupts by calling vmbus_chan_sched(), which looks at a
+bitmap setup by the host to determine which channels have pending
+interrupts on this CPU. If multiple channels have pending
+interrupts for this CPU, they are processed sequentially. When all
+channel interrupts have been processed, vmbus_isr() checks for and
+processes any message received on the VMbus control path.
+
+The VMbus channel interrupt handling code is designed to work
+correctly even if an interrupt is received on a CPU other than the
+CPU assigned to the channel. Specifically, the code does not use
+CPU-based exclusion for correctness. In normal operation, Hyper-V
+will interrupt the assigned CPU. But when the CPU assigned to a
+channel is being changed via sysfs, the guest doesn't know exactly
+when Hyper-V will make the transition. The code must work correctly
+even if there is a time lag before Hyper-V starts interrupting the
+new CPU. See comments in target_cpu_store().
+
+VMbus device creation/deletion
+------------------------------
+Hyper-V and the Linux guest have a separate message-passing path
+that is used for synthetic device creation and deletion. This
+path does not use a VMbus channel. See vmbus_post_msg() and
+vmbus_on_msg_dpc().
+
+The first step is for the guest to connect to the generic
+Hyper-V VMbus mechanism. As part of establishing this connection,
+the guest and Hyper-V agree on a VMbus protocol version they will
+use. This negotiation allows newer Linux kernels to run on older
+Hyper-V versions, and vice versa.
+
+The guest then tells Hyper-V to "send offers". Hyper-V sends an
+offer message to the guest for each synthetic device that the VM
+is configured to have. Each VMbus device type has a fixed GUID
+known as the "class ID", and each VMbus device instance is also
+identified by a GUID. The offer message from Hyper-V contains
+both GUIDs to uniquely (within the VM) identify the device.
+There is one offer message for each device instance, so a VM with
+two synthetic NICs will get two offers messages with the NIC
+class ID. The ordering of offer messages can vary from boot-to-boot
+and must not be assumed to be consistent in Linux code. Offer
+messages may also arrive long after Linux has initially booted
+because Hyper-V supports adding devices, such as synthetic NICs,
+to running VMs. A new offer message is processed by
+vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().
+
+Upon receipt of an offer message, the guest identifies the device
+type based on the class ID, and invokes the correct driver to set up
+the device. Driver/device matching is performed using the standard
+Linux mechanism.
+
+The device driver probe function opens the primary VMbus channel to
+the corresponding VSP. It allocates guest memory for the channel
+ring buffers and shares the ring buffer with the Hyper-V host by
+giving the host a list of GPAs for the ring buffer memory. See
+vmbus_establish_gpadl().
+
+Once the ring buffer is set up, the device driver and VSP exchange
+setup messages via the primary channel. These messages may include
+negotiating the device protocol version to be used between the Linux
+VSC and the VSP on the Hyper-V host. The setup messages may also
+include creating additional VMbus channels, which are somewhat
+mis-named as "sub-channels" since they are functionally
+equivalent to the primary channel once they are created.
+
+Finally, the device driver may create entries in /dev as with
+any device driver.
+
+The Hyper-V host can send a "rescind" message to the guest to
+remove a device that was previously offered. Linux drivers must
+handle such a rescind message at any time. Rescinding a device
+invokes the device driver "remove" function to cleanly shut
+down the device and remove it. Once a synthetic device is
+rescinded, neither Hyper-V nor Linux retains any state about
+its previous existence. Such a device might be re-added later,
+in which case it is treated as an entirely new device. See
+vmbus_onoffer_rescind().
diff --git a/Documentation/virt/index.rst b/Documentation/virt/index.rst
index 492f0920b988..2f1cffa87b1b 100644
--- a/Documentation/virt/index.rst
+++ b/Documentation/virt/index.rst
@@ -14,6 +14,7 @@ Linux Virtualization Support
ne_overview
acrn/index
coco/sev-guest
+ hyperv/index
.. only:: html and subproject
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 11e00a46c610..abd7c32126ce 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1150,6 +1150,10 @@ The following bits are defined in the flags field:
fields contain a valid state. This bit will be set whenever
KVM_CAP_EXCEPTION_PAYLOAD is enabled.
+- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
+ triple_fault_pending field contains a valid state. This bit will
+ be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.
+
ARM64:
^^^^^^
@@ -1245,6 +1249,10 @@ can be set in the flags field to signal that the
exception_has_payload, exception_payload, and exception.pending fields
contain a valid state and shall be written into the VCPU.
+If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT
+can be set in flags field to signal that the triple_fault field contains
+a valid state and shall be written into the VCPU.
+
ARM64:
^^^^^^
@@ -2998,7 +3006,9 @@ KVM_CREATE_PIT2. The state is returned in the following structure::
Valid flags are::
/* disable PIT in HPET legacy mode */
- #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+ #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001
+ /* speaker port data bit enabled */
+ #define KVM_PIT_FLAGS_SPEAKER_DATA_ON 0x00000002
This IOCTL replaces the obsolete KVM_GET_PIT.
@@ -4667,7 +4677,7 @@ encrypted VMs.
Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
-Documentation/virt/kvm/amd-memory-encryption.rst.
+Documentation/virt/kvm/x86/amd-memory-encryption.rst.
4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------
@@ -5127,7 +5137,15 @@ into ESA mode. This reset is a superset of the initial reset.
__u32 reserved[3];
};
-cmd values:
+**Ultravisor return codes**
+The Ultravisor return (reason) codes are provided by the kernel if a
+Ultravisor call has been executed to achieve the results expected by
+the command. Therefore they are independent of the IOCTL return
+code. If KVM changes `rc`, its value will always be greater than 0
+hence setting it to 0 before issuing a PV command is advised to be
+able to detect a change of `rc`.
+
+**cmd values:**
KVM_PV_ENABLE
Allocate memory and register the VM with the Ultravisor, thereby
@@ -5143,7 +5161,6 @@ KVM_PV_ENABLE
===== =============================
KVM_PV_DISABLE
-
Deregister the VM from the Ultravisor and reclaim the memory that
had been donated to the Ultravisor, making it usable by the kernel
again. All registered VCPUs are converted back to non-protected
@@ -5160,6 +5177,117 @@ KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.
+KVM_PV_INFO
+ :Capability: KVM_CAP_S390_PROTECTED_DUMP
+
+ Presents an API that provides Ultravisor related data to userspace
+ via subcommands. len_max is the size of the user space buffer,
+ len_written is KVM's indication of how much bytes of that buffer
+ were actually written to. len_written can be used to determine the
+ valid fields if more response fields are added in the future.
+
+ ::
+
+ enum pv_cmd_info_id {
+ KVM_PV_INFO_VM,
+ KVM_PV_INFO_DUMP,
+ };
+
+ struct kvm_s390_pv_info_header {
+ __u32 id;
+ __u32 len_max;
+ __u32 len_written;
+ __u32 reserved;
+ };
+
+ struct kvm_s390_pv_info {
+ struct kvm_s390_pv_info_header header;
+ struct kvm_s390_pv_info_dump dump;
+ struct kvm_s390_pv_info_vm vm;
+ };
+
+**subcommands:**
+
+ KVM_PV_INFO_VM
+ This subcommand provides basic Ultravisor information for PV
+ hosts. These values are likely also exported as files in the sysfs
+ firmware UV query interface but they are more easily available to
+ programs in this API.
+
+ The installed calls and feature_indication members provide the
+ installed UV calls and the UV's other feature indications.
+
+ The max_* members provide information about the maximum number of PV
+ vcpus, PV guests and PV guest memory size.
+
+ ::
+
+ struct kvm_s390_pv_info_vm {
+ __u64 inst_calls_list[4];
+ __u64 max_cpus;
+ __u64 max_guests;
+ __u64 max_guest_addr;
+ __u64 feature_indication;
+ };
+
+
+ KVM_PV_INFO_DUMP
+ This subcommand provides information related to dumping PV guests.
+
+ ::
+
+ struct kvm_s390_pv_info_dump {
+ __u64 dump_cpu_buffer_len;
+ __u64 dump_config_mem_buffer_per_1m;
+ __u64 dump_config_finalize_len;
+ };
+
+KVM_PV_DUMP
+ :Capability: KVM_CAP_S390_PROTECTED_DUMP
+
+ Presents an API that provides calls which facilitate dumping a
+ protected VM.
+
+ ::
+
+ struct kvm_s390_pv_dmp {
+ __u64 subcmd;
+ __u64 buff_addr;
+ __u64 buff_len;
+ __u64 gaddr; /* For dump storage state */
+ };
+
+ **subcommands:**
+
+ KVM_PV_DUMP_INIT
+ Initializes the dump process of a protected VM. If this call does
+ not succeed all other subcommands will fail with -EINVAL. This
+ subcommand will return -EINVAL if a dump process has not yet been
+ completed.
+
+ Not all PV vms can be dumped, the owner needs to set `dump
+ allowed` PCF bit 34 in the SE header to allow dumping.
+
+ KVM_PV_DUMP_CONFIG_STOR_STATE
+ Stores `buff_len` bytes of tweak component values starting with
+ the 1MB block specified by the absolute guest address
+ (`gaddr`). `buff_len` needs to be `conf_dump_storage_state_len`
+ aligned and at least >= the `conf_dump_storage_state_len` value
+ provided by the dump uv_info data. buff_user might be written to
+ even if an error rc is returned. For instance if we encounter a
+ fault after writing the first page of data.
+
+ KVM_PV_DUMP_COMPLETE
+ If the subcommand succeeds it completes the dump process and lets
+ KVM_PV_DUMP_INIT be called again.
+
+ On success `conf_dump_finalize_len` bytes of completion data will be
+ stored to the `buff_addr`. The completion data contains a key
+ derivation seed, IV, tweak nonce and encryption keys as well as an
+ authentication tag all of which are needed to decrypt the dump at a
+ later time.
+
+
4.126 KVM_X86_SET_MSR_FILTER
----------------------------
@@ -5657,7 +5785,8 @@ by a string of size ``name_size``.
#define KVM_STATS_UNIT_BYTES (0x1 << KVM_STATS_UNIT_SHIFT)
#define KVM_STATS_UNIT_SECONDS (0x2 << KVM_STATS_UNIT_SHIFT)
#define KVM_STATS_UNIT_CYCLES (0x3 << KVM_STATS_UNIT_SHIFT)
- #define KVM_STATS_UNIT_MAX KVM_STATS_UNIT_CYCLES
+ #define KVM_STATS_UNIT_BOOLEAN (0x4 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_MAX KVM_STATS_UNIT_BOOLEAN
#define KVM_STATS_BASE_SHIFT 8
#define KVM_STATS_BASE_MASK (0xF << KVM_STATS_BASE_SHIFT)
@@ -5702,14 +5831,13 @@ Bits 0-3 of ``flags`` encode the type:
by the ``hist_param`` field. The range of the Nth bucket (1 <= N < ``size``)
is [``hist_param``*(N-1), ``hist_param``*N), while the range of the last
bucket is [``hist_param``*(``size``-1), +INF). (+INF means positive infinity
- value.) The bucket value indicates how many samples fell in the bucket's range.
+ value.)
* ``KVM_STATS_TYPE_LOG_HIST``
The statistic is reported as a logarithmic histogram. The number of
buckets is specified by the ``size`` field. The range of the first bucket is
[0, 1), while the range of the last bucket is [pow(2, ``size``-2), +INF).
Otherwise, The Nth bucket (1 < N < ``size``) covers
- [pow(2, N-2), pow(2, N-1)). The bucket value indicates how many samples fell
- in the bucket's range.
+ [pow(2, N-2), pow(2, N-1)).
Bits 4-7 of ``flags`` encode the unit:
@@ -5724,6 +5852,15 @@ Bits 4-7 of ``flags`` encode the unit:
It indicates that the statistics data is used to measure time or latency.
* ``KVM_STATS_UNIT_CYCLES``
It indicates that the statistics data is used to measure CPU clock cycles.
+ * ``KVM_STATS_UNIT_BOOLEAN``
+ It indicates that the statistic will always be either 0 or 1. Boolean
+ statistics of "peak" type will never go back from 1 to 0. Boolean
+ statistics can be linear histograms (with two buckets) but not logarithmic
+ histograms.
+
+Note that, in the case of histograms, the unit applies to the bucket
+ranges, while the bucket value indicates how many samples fell in the
+bucket's range.
Bits 8-11 of ``flags``, together with ``exponent``, encode the scale of the
unit:
@@ -5746,7 +5883,7 @@ the corresponding statistics data.
The ``bucket_size`` field is used as a parameter for histogram statistics data.
It is only used by linear histogram statistics data, specifying the size of a
-bucket.
+bucket in the unit expressed by bits 4-11 of ``flags`` together with ``exponent``.
The ``name`` field is the name string of the statistics data. The name string
starts at the end of ``struct kvm_stats_desc``. The maximum length including
@@ -5802,6 +5939,78 @@ of CPUID leaf 0xD on the host.
This ioctl injects an event channel interrupt directly to the guest vCPU.
+4.136 KVM_S390_PV_CPU_COMMAND
+-----------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED_DUMP
+:Architectures: s390
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0 on success, < 0 on error
+
+This ioctl closely mirrors `KVM_S390_PV_COMMAND` but handles requests
+for vcpus. It re-uses the kvm_s390_pv_dmp struct and hence also shares
+the command ids.
+
+**command:**
+
+KVM_PV_DUMP
+ Presents an API that provides calls which facilitate dumping a vcpu
+ of a protected VM.
+
+**subcommand:**
+
+KVM_PV_DUMP_CPU
+ Provides encrypted dump data like register values.
+ The length of the returned data is provided by uv_info.guest_cpu_stor_len.
+
+4.137 KVM_S390_ZPCI_OP
+----------------------
+
+:Capability: KVM_CAP_S390_ZPCI_OP
+:Architectures: s390
+:Type: vm ioctl
+:Parameters: struct kvm_s390_zpci_op (in)
+:Returns: 0 on success, <0 on error
+
+Used to manage hardware-assisted virtualization features for zPCI devices.
+
+Parameters are specified via the following structure::
+
+ struct kvm_s390_zpci_op {
+ /* in */
+ __u32 fh; /* target device */
+ __u8 op; /* operation to perform */
+ __u8 pad[3];
+ union {
+ /* for KVM_S390_ZPCIOP_REG_AEN */
+ struct {
+ __u64 ibv; /* Guest addr of interrupt bit vector */
+ __u64 sb; /* Guest addr of summary bit */
+ __u32 flags;
+ __u32 noi; /* Number of interrupts */
+ __u8 isc; /* Guest interrupt subclass */
+ __u8 sbo; /* Offset of guest summary bit vector */
+ __u16 pad;
+ } reg_aen;
+ __u64 reserved[8];
+ } u;
+ };
+
+The type of operation is specified in the "op" field.
+KVM_S390_ZPCIOP_REG_AEN is used to register the VM for adapter event
+notification interpretation, which will allow firmware delivery of adapter
+events directly to the vm, with KVM providing a backup delivery mechanism;
+KVM_S390_ZPCIOP_DEREG_AEN is used to subsequently disable interpretation of
+adapter event notifications.
+
+The target zPCI function must also be specified via the "fh" field. For the
+KVM_S390_ZPCIOP_REG_AEN operation, additional information to establish firmware
+delivery must be provided via the "reg_aen" struct.
+
+The "pad" and "reserved" fields may be used for future extensions and should be
+set to 0s by userspace.
+
5. The kvm_run structure
========================
@@ -6407,6 +6616,26 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
::
+ /* KVM_EXIT_NOTIFY */
+ struct {
+ #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
+ __u32 flags;
+ } notify;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_NOTIFY_VMEXIT is
+enabled, a VM exit generated if no event window occurs in VM non-root mode
+for a specified amount of time. Once KVM_X86_NOTIFY_VMEXIT_USER is set when
+enabling the cap, it would exit to userspace with the exit reason
+KVM_EXIT_NOTIFY for further handling. The "flags" field contains more
+detailed info.
+
+The valid value for 'flags' is:
+
+ - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid
+ in VMCS. It would run into unknown result if resume the target VM.
+
+::
+
/* Fix the size of the union. */
char padding[256];
};
@@ -7348,8 +7577,71 @@ The valid bits in cap.args[0] are:
hypercall instructions. Executing the
incorrect hypercall instruction will
generate a #UD within the guest.
+
+KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
+ they are intercepted) as NOPs regardless of
+ whether or not MONITOR/MWAIT are supported
+ according to guest CPUID. When this quirk
+ is disabled and KVM_X86_DISABLE_EXITS_MWAIT
+ is not set (MONITOR/MWAIT are intercepted),
+ KVM will inject a #UD on MONITOR/MWAIT if
+ they're unsupported per guest CPUID. Note,
+ KVM will modify MONITOR/MWAIT support in
+ guest CPUID on writes to MISC_ENABLE if
+ KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
+ disabled.
=================================== ============================================
+7.32 KVM_CAP_MAX_VCPU_ID
+------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] - maximum APIC ID value set for current VM
+:Returns: 0 on success, -EINVAL if args[0] is beyond KVM_MAX_VCPU_IDS
+ supported in KVM or if it has been set.
+
+This capability allows userspace to specify maximum possible APIC ID
+assigned for current VM session prior to the creation of vCPUs, saving
+memory for data structures indexed by the APIC ID. Userspace is able
+to calculate the limit to APIC ID values from designated
+CPU topology.
+
+The value can be changed only until KVM_ENABLE_CAP is set to a nonzero
+value or until a vCPU is created. Upon creation of the first vCPU,
+if the value was set to zero or KVM_ENABLE_CAP was not invoked, KVM
+uses the return value of KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPU_ID) as
+the maximum APIC ID.
+
+7.33 KVM_CAP_X86_NOTIFY_VMEXIT
+------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is the value of notify window as well as some flags
+:Returns: 0 on success, -EINVAL if args[0] contains invalid flags or notify
+ VM exit is unsupported.
+
+Bits 63:32 of args[0] are used for notify window.
+Bits 31:0 of args[0] are for some flags. Valid bits are::
+
+ #define KVM_X86_NOTIFY_VMEXIT_ENABLED (1 << 0)
+ #define KVM_X86_NOTIFY_VMEXIT_USER (1 << 1)
+
+This capability allows userspace to configure the notify VM exit on/off
+in per-VM scope during VM creation. Notify VM exit is disabled by default.
+When userspace sets KVM_X86_NOTIFY_VMEXIT_ENABLED bit in args[0], VMM will
+enable this feature with the notify window provided, which will generate
+a VM exit if no event window occurs in VM non-root mode for a specified of
+time (notify window).
+
+If KVM_X86_NOTIFY_VMEXIT_USER is set in args[0], upon notify VM exits happen,
+KVM would exit to userspace for handling.
+
+This capability is aimed to mitigate the threat that malicious VMs can
+cause CPU stuck (due to event windows don't open up) and make the CPU
+unavailable to host or other VMs.
+
8. Other capabilities.
======================
@@ -7670,7 +7962,7 @@ architecture-specific interfaces. This capability and the architecture-
specific interfaces must be consistent, i.e. if one says the feature
is supported, than the other should as well and vice versa. For arm64
see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL".
-For x86 see Documentation/virt/kvm/msr.rst "MSR_KVM_STEAL_TIME".
+For x86 see Documentation/virt/kvm/x86/msr.rst "MSR_KVM_STEAL_TIME".
8.25 KVM_CAP_S390_DIAG318
-------------------------
@@ -7956,6 +8248,61 @@ should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
+8.37 KVM_CAP_S390_PROTECTED_DUMP
+--------------------------------
+
+:Capability: KVM_CAP_S390_PROTECTED_DUMP
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM and the Ultravisor support dumping
+PV guests. The `KVM_PV_DUMP` command is available for the
+`KVM_S390_PV_COMMAND` ioctl and the `KVM_PV_INFO` command provides
+dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
+available and supports the `KVM_PV_DUMP_CPU` subcommand.
+
+8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
+-------------------------------------
+
+:Capability: KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
+:Architectures: x86
+:Type: vm
+:Parameters: arg[0] must be 0.
+:Returns: 0 on success, -EPERM if the userspace process does not
+ have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been
+ created.
+
+This capability disables the NX huge pages mitigation for iTLB MULTIHIT.
+
+The capability has no effect if the nx_huge_pages module parameter is not set.
+
+This capability may only be set before any vCPUs are created.
+
+8.39 KVM_CAP_S390_CPU_TOPOLOGY
+------------------------------
+
+:Capability: KVM_CAP_S390_CPU_TOPOLOGY
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM will provide the S390 CPU Topology
+facility which consist of the interpretation of the PTF instruction for
+the function code 2 along with interception and forwarding of both the
+PTF instruction with function codes 0 or 1 and the STSI(15,1,x)
+instruction to the userland hypervisor.
+
+The stfle facility 11, CPU Topology facility, should not be indicated
+to the guest without this capability.
+
+When this capability is present, KVM provides a new attribute group
+on vm fd, KVM_S390_VM_CPU_TOPOLOGY.
+This new attribute allows to get, set or clear the Modified Change
+Topology Report (MTCR) bit of the SCA through the kvm_device_attr
+structure.
+
+When getting the Modified Change Topology Report value, the attr->addr
+must point to a byte where the value will be stored or retrieved from.
+
9. Known KVM API problems
=========================
diff --git a/Documentation/virt/kvm/arm/hyp-abi.rst b/Documentation/virt/kvm/arm/hyp-abi.rst
index 4d43fbc25195..412b276449d3 100644
--- a/Documentation/virt/kvm/arm/hyp-abi.rst
+++ b/Documentation/virt/kvm/arm/hyp-abi.rst
@@ -60,12 +60,13 @@ these functions (see arch/arm{,64}/include/asm/virt.h):
* ::
- x0 = HVC_VHE_RESTART (arm64 only)
+ x0 = HVC_FINALISE_EL2 (arm64 only)
- Attempt to upgrade the kernel's exception level from EL1 to EL2 by enabling
- the VHE mode. This is conditioned by the CPU supporting VHE, the EL2 MMU
- being off, and VHE not being disabled by any other means (command line
- option, for example).
+ Finish configuring EL2 depending on the command-line options,
+ including an attempt to upgrade the kernel's exception level from
+ EL1 to EL2 by enabling the VHE mode. This is conditioned by the CPU
+ supporting VHE, the EL2 MMU being off, and VHE not being disabled by
+ any other means (command line option, for example).
Any other value of r0/x0 triggers a hypervisor-specific handling,
which is not documented here.
diff --git a/Documentation/virt/kvm/s390/index.rst b/Documentation/virt/kvm/s390/index.rst
index 605f488f0cc5..44ec9ab14b59 100644
--- a/Documentation/virt/kvm/s390/index.rst
+++ b/Documentation/virt/kvm/s390/index.rst
@@ -10,3 +10,4 @@ KVM for s390 systems
s390-diag
s390-pv
s390-pv-boot
+ s390-pv-dump
diff --git a/Documentation/virt/kvm/s390/s390-pv-boot.rst b/Documentation/virt/kvm/s390/s390-pv-boot.rst
index 73a6083cb5e7..96c48480a360 100644
--- a/Documentation/virt/kvm/s390/s390-pv-boot.rst
+++ b/Documentation/virt/kvm/s390/s390-pv-boot.rst
@@ -10,7 +10,7 @@ The memory of Protected Virtual Machines (PVMs) is not accessible to
I/O or the hypervisor. In those cases where the hypervisor needs to
access the memory of a PVM, that memory must be made accessible.
Memory made accessible to the hypervisor will be encrypted. See
-Documentation/virt/kvm/s390-pv.rst for details."
+Documentation/virt/kvm/s390/s390-pv.rst for details."
On IPL (boot) a small plaintext bootloader is started, which provides
information about the encrypted components and necessary metadata to
diff --git a/Documentation/virt/kvm/s390/s390-pv-dump.rst b/Documentation/virt/kvm/s390/s390-pv-dump.rst
new file mode 100644
index 000000000000..e542f06048f3
--- /dev/null
+++ b/Documentation/virt/kvm/s390/s390-pv-dump.rst
@@ -0,0 +1,64 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+s390 (IBM Z) Protected Virtualization dumps
+===========================================
+
+Summary
+-------
+
+Dumping a VM is an essential tool for debugging problems inside
+it. This is especially true when a protected VM runs into trouble as
+there's no way to access its memory and registers from the outside
+while it's running.
+
+However when dumping a protected VM we need to maintain its
+confidentiality until the dump is in the hands of the VM owner who
+should be the only one capable of analysing it.
+
+The confidentiality of the VM dump is ensured by the Ultravisor who
+provides an interface to KVM over which encrypted CPU and memory data
+can be requested. The encryption is based on the Customer
+Communication Key which is the key that's used to encrypt VM data in a
+way that the customer is able to decrypt.
+
+
+Dump process
+------------
+
+A dump is done in 3 steps:
+
+**Initiation**
+
+This step initializes the dump process, generates cryptographic seeds
+and extracts dump keys with which the VM dump data will be encrypted.
+
+**Data gathering**
+
+Currently there are two types of data that can be gathered from a VM:
+the memory and the vcpu state.
+
+The vcpu state contains all the important registers, general, floating
+point, vector, control and tod/timers of a vcpu. The vcpu dump can
+contain incomplete data if a vcpu is dumped while an instruction is
+emulated with help of the hypervisor. This is indicated by a flag bit
+in the dump data. For the same reason it is very important to not only
+write out the encrypted vcpu state, but also the unencrypted state
+from the hypervisor.
+
+The memory state is further divided into the encrypted memory and its
+metadata comprised of the encryption tweaks and status flags. The
+encrypted memory can simply be read once it has been exported. The
+time of the export does not matter as no re-encryption is
+needed. Memory that has been swapped out and hence was exported can be
+read from the swap and written to the dump target without need for any
+special actions.
+
+The tweaks / status flags for the exported pages need to be requested
+from the Ultravisor.
+
+**Finalization**
+
+The finalization step will provide the data needed to be able to
+decrypt the vcpu and memory data and end the dump process. When this
+step completes successfully a new dump initiation can be started.
diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index e56fa8b9cfca..10db7924720f 100644
--- a/Documentation/virt/kvm/x86/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
@@ -22,7 +22,7 @@ S390:
number in R1.
For further information on the S390 diagnose call as supported by KVM,
- refer to Documentation/virt/kvm/s390-diag.rst.
+ refer to Documentation/virt/kvm/s390/s390-diag.rst.
PowerPC:
It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers.
diff --git a/Documentation/virt/uml/user_mode_linux_howto_v2.rst b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
index 863f67b72c05..af2a97429692 100644
--- a/Documentation/virt/uml/user_mode_linux_howto_v2.rst
+++ b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
@@ -322,7 +322,7 @@ Shared Options
* ``v6=[0,1]`` to specify if a v6 connection is desired for all
transports which operate over IP. Additionally, for transports that
have some differences in the way they operate over v4 and v6 (for example
- EoL2TPv3), sets the correct mode of operation. In the absense of this
+ EoL2TPv3), sets the correct mode of operation. In the absence of this
option, the socket type is determined based on what do the src and dst
arguments resolve/parse to.