89 files changed, 15617 insertions, 0 deletions
diff --git a/Documentation/arch/arc/arc.rst b/Documentation/arch/arc/arc.rst
new file mode 100644
index 000000000000..6c4d978f3f4e
--- /dev/null
+++ b/Documentation/arch/arc/arc.rst
@@ -0,0 +1,85 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Linux kernel for ARC processors
+*******************************
+
+Other sources of information
+############################
+
+Below are some resources where more information can be found on
+ARC processors and relevant open source projects.
+
+- `<https://embarc.org>`_ - Community portal for open source on ARC.
+  Good place to start to find relevant FOSS projects, toolchain releases,
+  news items and more.
+
+- `<https://github.com/foss-for-synopsys-dwc-arc-processors>`_ -
+  Home for all development activities regarding open source projects for
+  ARC processors. Some of the projects are forks of various upstream projects,
+  where "work in progress" is hosted prior to submission to upstream projects.
+  Other projects are developed by Synopsys and made available to community
+  as open source for use on ARC Processors.
+
+- `Official Synopsys ARC Processors website
+  <https://www.synopsys.com/designware-ip/processor-solutions.html>`_ -
+  location, with access to some IP documentation (`Programmer's Reference
+  Manual, AKA PRM for ARC HS processors
+  <https://www.synopsys.com/dw/doc.php/ds/cc/programmers-reference-manual-ARC-HS.pdf>`_)
+  and free versions of some commercial tools (`Free nSIM
+  <https://www.synopsys.com/cgi-bin/dwarcnsim/req1.cgi>`_ and
+  `MetaWare Light Edition <https://www.synopsys.com/cgi-bin/arcmwtk_lite/reg1.cgi>`_).
+  Please note though, registration is required to access both the documentation and
+  the tools.
+
+Important note on ARC processors configurability
+################################################
+
+ARC processors are highly configurable and several configurable options
+are supported in Linux. Some options are transparent to software
+(i.e cache geometries, some can be detected at runtime and configured
+and used accordingly, while some need to be explicitly selected or configured
+in the kernel's configuration utility (AKA "make menuconfig").
+
+However not all configurable options are supported when an ARC processor
+is to run Linux. SoC design teams should refer to "Appendix E:
+Configuration for ARC Linux" in the ARC HS Databook for configurability
+guidelines.
+
+Following these guidelines and selecting valid configuration options
+up front is critical to help prevent any unwanted issues during
+SoC bringup and software development in general.
+
+Building the Linux kernel for ARC processors
+############################################
+
+The process of kernel building for ARC processors is the same as for any other
+architecture and could be done in 2 ways:
+
+- Cross-compilation: process of compiling for ARC targets on a development
+  host with a different processor architecture (generally x86_64/amd64).
+- Native compilation: process of compiling for ARC on a ARC platform
+  (hardware board or a simulator like QEMU) with complete development environment
+  (GNU toolchain, dtc, make etc) installed on the platform.
+
+In both cases, up-to-date GNU toolchain for ARC for the host is needed.
+Synopsys offers prebuilt toolchain releases which can be used for this purpose,
+available from:
+
+- Synopsys GNU toolchain releases:
+  `<https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/releases>`_
+
+- Linux kernel compilers collection:
+  `<https://mirrors.edge.kernel.org/pub/tools/crosstool>`_
+
+- Bootlin's toolchain collection: `<https://toolchains.bootlin.com>`_
+
+Once the toolchain is installed in the system, make sure its "bin" folder
+is added in your ``PATH`` environment variable. Then set ``ARCH=arc`` &
+``CROSS_COMPILE=arc-linux`` (or whatever matches installed ARC toolchain prefix)
+and then as usual ``make defconfig && make``.
+
+This will produce "vmlinux" file in the root of the kernel source tree
+usable for loading on the target system via JTAG.
+If you need to get an image usable with U-Boot bootloader,
+type ``make uImage`` and ``uImage`` will be produced in ``arch/arc/boot``
+folder.
diff --git a/Documentation/arch/arc/features.rst b/Documentation/arch/arc/features.rst
new file mode 100644
index 000000000000..b793583d688a
--- /dev/null
+++ b/Documentation/arch/arc/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features arc
diff --git a/Documentation/arch/arc/index.rst b/Documentation/arch/arc/index.rst
new file mode 100644
index 000000000000..7b098d4a5e3e
--- /dev/null
+++ b/Documentation/arch/arc/index.rst
@@ -0,0 +1,17 @@
+===================
+ARC architecture
+===================
+
+.. toctree::
+    :maxdepth: 1
+
+    arc
+
+    features
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/arch/ia64/aliasing.rst b/Documentation/arch/ia64/aliasing.rst
new file mode 100644
index 000000000000..36a1e1d4842b
--- /dev/null
+++ b/Documentation/arch/ia64/aliasing.rst
@@ -0,0 +1,246 @@
+==================================
+Memory Attribute Aliasing on IA-64
+==================================
+
+Bjorn Helgaas <bjorn.helgaas@hp.com>
+
+May 4, 2006
+
+
+Memory Attributes
+=================
+
+    Itanium supports several attributes for virtual memory references.
+    The attribute is part of the virtual translation, i.e., it is
+    contained in the TLB entry.  The ones of most interest to the Linux
+    kernel are:
+
+	==		======================
+        WB		Write-back (cacheable)
+	UC		Uncacheable
+	WC		Write-coalescing
+	==		======================
+
+    System memory typically uses the WB attribute.  The UC attribute is
+    used for memory-mapped I/O devices.  The WC attribute is uncacheable
+    like UC is, but writes may be delayed and combined to increase
+    performance for things like frame buffers.
+
+    The Itanium architecture requires that we avoid accessing the same
+    page with both a cacheable mapping and an uncacheable mapping[1].
+
+    The design of the chipset determines which attributes are supported
+    on which regions of the address space.  For example, some chipsets
+    support either WB or UC access to main memory, while others support
+    only WB access.
+
+Memory Map
+==========
+
+    Platform firmware describes the physical memory map and the
+    supported attributes for each region.  At boot-time, the kernel uses
+    the EFI GetMemoryMap() interface.  ACPI can also describe memory
+    devices and the attributes they support, but Linux/ia64 currently
+    doesn't use this information.
+
+    The kernel uses the efi_memmap table returned from GetMemoryMap() to
+    learn the attributes supported by each region of physical address
+    space.  Unfortunately, this table does not completely describe the
+    address space because some machines omit some or all of the MMIO
+    regions from the map.
+
+    The kernel maintains another table, kern_memmap, which describes the
+    memory Linux is actually using and the attribute for each region.
+    This contains only system memory; it does not contain MMIO space.
+
+    The kern_memmap table typically contains only a subset of the system
+    memory described by the efi_memmap.  Linux/ia64 can't use all memory
+    in the system because of constraints imposed by the identity mapping
+    scheme.
+
+    The efi_memmap table is preserved unmodified because the original
+    boot-time information is required for kexec.
+
+Kernel Identity Mappings
+========================
+
+    Linux/ia64 identity mappings are done with large pages, currently
+    either 16MB or 64MB, referred to as "granules."  Cacheable mappings
+    are speculative[2], so the processor can read any location in the
+    page at any time, independent of the programmer's intentions.  This
+    means that to avoid attribute aliasing, Linux can create a cacheable
+    identity mapping only when the entire granule supports cacheable
+    access.
+
+    Therefore, kern_memmap contains only full granule-sized regions that
+    can referenced safely by an identity mapping.
+
+    Uncacheable mappings are not speculative, so the processor will
+    generate UC accesses only to locations explicitly referenced by
+    software.  This allows UC identity mappings to cover granules that
+    are only partially populated, or populated with a combination of UC
+    and WB regions.
+
+User Mappings
+=============
+
+    User mappings are typically done with 16K or 64K pages.  The smaller
+    page size allows more flexibility because only 16K or 64K has to be
+    homogeneous with respect to memory attributes.
+
+Potential Attribute Aliasing Cases
+==================================
+
+    There are several ways the kernel creates new mappings:
+
+mmap of /dev/mem
+----------------
+
+	This uses remap_pfn_range(), which creates user mappings.  These
+	mappings may be either WB or UC.  If the region being mapped
+	happens to be in kern_memmap, meaning that it may also be mapped
+	by a kernel identity mapping, the user mapping must use the same
+	attribute as the kernel mapping.
+
+	If the region is not in kern_memmap, the user mapping should use
+	an attribute reported as being supported in the EFI memory map.
+
+	Since the EFI memory map does not describe MMIO on some
+	machines, this should use an uncacheable mapping as a fallback.
+
+mmap of /sys/class/pci_bus/.../legacy_mem
+-----------------------------------------
+
+	This is very similar to mmap of /dev/mem, except that legacy_mem
+	only allows mmap of the one megabyte "legacy MMIO" area for a
+	specific PCI bus.  Typically this is the first megabyte of
+	physical address space, but it may be different on machines with
+	several VGA devices.
+
+	"X" uses this to access VGA frame buffers.  Using legacy_mem
+	rather than /dev/mem allows multiple instances of X to talk to
+	different VGA cards.
+
+	The /dev/mem mmap constraints apply.
+
+mmap of /proc/bus/pci/.../??.?
+------------------------------
+
+	This is an MMIO mmap of PCI functions, which additionally may or
+	may not be requested as using the WC attribute.
+
+	If WC is requested, and the region in kern_memmap is either WC
+	or UC, and the EFI memory map designates the region as WC, then
+	the WC mapping is allowed.
+
+	Otherwise, the user mapping must use the same attribute as the
+	kernel mapping.
+
+read/write of /dev/mem
+----------------------
+
+	This uses copy_from_user(), which implicitly uses a kernel
+	identity mapping.  This is obviously safe for things in
+	kern_memmap.
+
+	There may be corner cases of things that are not in kern_memmap,
+	but could be accessed this way.  For example, registers in MMIO
+	space are not in kern_memmap, but could be accessed with a UC
+	mapping.  This would not cause attribute aliasing.  But
+	registers typically can be accessed only with four-byte or
+	eight-byte accesses, and the copy_from_user() path doesn't allow
+	any control over the access size, so this would be dangerous.
+
+ioremap()
+---------
+
+	This returns a mapping for use inside the kernel.
+
+	If the region is in kern_memmap, we should use the attribute
+	specified there.
+
+	If the EFI memory map reports that the entire granule supports
+	WB, we should use that (granules that are partially reserved
+	or occupied by firmware do not appear in kern_memmap).
+
+	If the granule contains non-WB memory, but we can cover the
+	region safely with kernel page table mappings, we can use
+	ioremap_page_range() as most other architectures do.
+
+	Failing all of the above, we have to fall back to a UC mapping.
+
+Past Problem Cases
+==================
+
+mmap of various MMIO regions from /dev/mem by "X" on Intel platforms
+--------------------------------------------------------------------
+
+      The EFI memory map may not report these MMIO regions.
+
+      These must be allowed so that X will work.  This means that
+      when the EFI memory map is incomplete, every /dev/mem mmap must
+      succeed.  It may create either WB or UC user mappings, depending
+      on whether the region is in kern_memmap or the EFI memory map.
+
+mmap of 0x0-0x9FFFF /dev/mem by "hwinfo" on HP sx1000 with VGA enabled
+----------------------------------------------------------------------
+
+      The EFI memory map reports the following attributes:
+
+        =============== ======= ==================
+        0x00000-0x9FFFF WB only
+        0xA0000-0xBFFFF UC only (VGA frame buffer)
+        0xC0000-0xFFFFF WB only
+        =============== ======= ==================
+
+      This mmap is done with user pages, not kernel identity mappings,
+      so it is safe to use WB mappings.
+
+      The kernel VGA driver may ioremap the VGA frame buffer at 0xA0000,
+      which uses a granule-sized UC mapping.  This granule will cover some
+      WB-only memory, but since UC is non-speculative, the processor will
+      never generate an uncacheable reference to the WB-only areas unless
+      the driver explicitly touches them.
+
+mmap of 0x0-0xFFFFF legacy_mem by "X"
+-------------------------------------
+
+      If the EFI memory map reports that the entire range supports the
+      same attributes, we can allow the mmap (and we will prefer WB if
+      supported, as is the case with HP sx[12]000 machines with VGA
+      disabled).
+
+      If EFI reports the range as partly WB and partly UC (as on sx[12]000
+      machines with VGA enabled), we must fail the mmap because there's no
+      safe attribute to use.
+
+      If EFI reports some of the range but not all (as on Intel firmware
+      that doesn't report the VGA frame buffer at all), we should fail the
+      mmap and force the user to map just the specific region of interest.
+
+mmap of 0xA0000-0xBFFFF legacy_mem by "X" on HP sx1000 with VGA disabled
+------------------------------------------------------------------------
+
+      The EFI memory map reports the following attributes::
+
+        0x00000-0xFFFFF WB only (no VGA MMIO hole)
+
+      This is a special case of the previous case, and the mmap should
+      fail for the same reason as above.
+
+read of /sys/devices/.../rom
+----------------------------
+
+      For VGA devices, this may cause an ioremap() of 0xC0000.  This
+      used to be done with a UC mapping, because the VGA frame buffer
+      at 0xA0000 prevents use of a WB granule.  The UC mapping causes
+      an MCA on HP sx[12]000 chipsets.
+
+      We should use WB page table mappings to avoid covering the VGA
+      frame buffer.
+
+Notes
+=====
+
+    [1] SDM rev 2.2, vol 2, sec 4.4.1.
+    [2] SDM rev 2.2, vol 2, sec 4.4.6.
diff --git a/Documentation/arch/ia64/efirtc.rst b/Documentation/arch/ia64/efirtc.rst
new file mode 100644
index 000000000000..fd8328408301
--- /dev/null
+++ b/Documentation/arch/ia64/efirtc.rst
@@ -0,0 +1,144 @@
+==========================
+EFI Real Time Clock driver
+==========================
+
+S. Eranian <eranian@hpl.hp.com>
+
+March 2000
+
+1. Introduction
+===============
+
+This document describes the efirtc.c driver has provided for
+the IA-64 platform.
+
+The purpose of this driver is to supply an API for kernel and user applications
+to get access to the Time Service offered by EFI version 0.92.
+
+EFI provides 4 calls one can make once the OS is booted: GetTime(),
+SetTime(), GetWakeupTime(), SetWakeupTime() which are all supported by this
+driver. We describe those calls as well the design of the driver in the
+following sections.
+
+2. Design Decisions
+===================
+
+The original ideas was to provide a very simple driver to get access to,
+at first, the time of day service. This is required in order to access, in a
+portable way, the CMOS clock. A program like /sbin/hwclock uses such a clock
+to initialize the system view of the time during boot.
+
+Because we wanted to minimize the impact on existing user-level apps using
+the CMOS clock, we decided to expose an API that was very similar to the one
+used today with the legacy RTC driver (driver/char/rtc.c). However, because
+EFI provides a simpler services, not all ioctl() are available. Also
+new ioctl()s have been introduced for things that EFI provides but not the
+legacy.
+
+EFI uses a slightly different way of representing the time, noticeably
+the reference date is different. Year is the using the full 4-digit format.
+The Epoch is January 1st 1998. For backward compatibility reasons we don't
+expose this new way of representing time. Instead we use something very
+similar to the struct tm, i.e. struct rtc_time, as used by hwclock.
+One of the reasons for doing it this way is to allow for EFI to still evolve
+without necessarily impacting any of the user applications. The decoupling
+enables flexibility and permits writing wrapper code is ncase things change.
+
+The driver exposes two interfaces, one via the device file and a set of
+ioctl()s. The other is read-only via the /proc filesystem.
+
+As of today we don't offer a /proc/sys interface.
+
+To allow for a uniform interface between the legacy RTC and EFI time service,
+we have created the include/linux/rtc.h header file to contain only the
+"public" API of the two drivers.  The specifics of the legacy RTC are still
+in include/linux/mc146818rtc.h.
+
+
+3. Time of day service
+======================
+
+The part of the driver gives access to the time of day service of EFI.
+Two ioctl()s, compatible with the legacy RTC calls:
+
+	Read the CMOS clock::
+
+		ioctl(d, RTC_RD_TIME, &rtc);
+
+	Write the CMOS clock::
+
+		ioctl(d, RTC_SET_TIME, &rtc);
+
+The rtc is a pointer to a data structure defined in rtc.h which is close
+to a struct tm::
+
+  struct rtc_time {
+          int tm_sec;
+          int tm_min;
+          int tm_hour;
+          int tm_mday;
+          int tm_mon;
+          int tm_year;
+          int tm_wday;
+          int tm_yday;
+          int tm_isdst;
+  };
+
+The driver takes care of converting back an forth between the EFI time and
+this format.
+
+Those two ioctl()s can be exercised with the hwclock command:
+
+For reading::
+
+	# /sbin/hwclock --show
+	Mon Mar  6 15:32:32 2000  -0.910248 seconds
+
+For setting::
+
+	# /sbin/hwclock --systohc
+
+Root privileges are required to be able to set the time of day.
+
+4. Wakeup Alarm service
+=======================
+
+EFI provides an API by which one can program when a machine should wakeup,
+i.e. reboot. This is very different from the alarm provided by the legacy
+RTC which is some kind of interval timer alarm. For this reason we don't use
+the same ioctl()s to get access to the service. Instead we have
+introduced 2 news ioctl()s to the interface of an RTC.
+
+We have added 2 new ioctl()s that are specific to the EFI driver:
+
+	Read the current state of the alarm::
+
+		ioctl(d, RTC_WKALM_RD, &wkt)
+
+	Set the alarm or change its status::
+
+		ioctl(d, RTC_WKALM_SET, &wkt)
+
+The wkt structure encapsulates a struct rtc_time + 2 extra fields to get
+status information::
+
+  struct rtc_wkalrm {
+
+          unsigned char enabled; /* =1 if alarm is enabled */
+          unsigned char pending; /* =1 if alarm is pending  */
+
+          struct rtc_time time;
+  }
+
+As of today, none of the existing user-level apps supports this feature.
+However writing such a program should be hard by simply using those two
+ioctl().
+
+Root privileges are required to be able to set the alarm.
+
+5. References
+=============
+
+Checkout the following Web site for more information on EFI:
+
+http://developer.intel.com/technology/efi/
diff --git a/Documentation/arch/ia64/err_inject.rst b/Documentation/arch/ia64/err_inject.rst
new file mode 100644
index 000000000000..900f71e93a29
--- /dev/null
+++ b/Documentation/arch/ia64/err_inject.rst
@@ -0,0 +1,1067 @@
+========================================
+IPF Machine Check (MC) error inject tool
+========================================
+
+IPF Machine Check (MC) error inject tool is used to inject MC
+errors from Linux. The tool is a test bed for IPF MC work flow including
+hardware correctable error handling, OS recoverable error handling, MC
+event logging, etc.
+
+The tool includes two parts: a kernel driver and a user application
+sample. The driver provides interface to PAL to inject error
+and query error injection capabilities. The driver code is in
+arch/ia64/kernel/err_inject.c. The application sample (shown below)
+provides a combination of various errors and calls the driver's interface
+(sysfs interface) to inject errors or query error injection capabilities.
+
+The tool can be used to test Intel IPF machine MC handling capabilities.
+It's especially useful for people who can not access hardware MC injection
+tool to inject error. It's also very useful to integrate with other
+software test suits to do stressful testing on IPF.
+
+Below is a sample application as part of the whole tool. The sample
+can be used as a working test tool. Or it can be expanded to include
+more features. It also can be a integrated into a library or other user
+application to have more thorough test.
+
+The sample application takes err.conf as error configuration input. GCC
+compiles the code. After you install err_inject driver, you can run
+this sample application to inject errors.
+
+Errata: Itanium 2 Processors Specification Update lists some errata against
+the pal_mc_error_inject PAL procedure. The following err.conf has been tested
+on latest Montecito PAL.
+
+err.conf::
+
+  #This is configuration file for err_inject_tool.
+  #The format of the each line is:
+  #cpu, loop, interval, err_type_info, err_struct_info, err_data_buffer
+  #where
+  #	cpu: logical cpu number the error will be inject in.
+  #	loop: times the error will be injected.
+  #	interval: In second. every so often one error is injected.
+  #	err_type_info, err_struct_info: PAL parameters.
+  #
+  #Note: All values are hex w/o or w/ 0x prefix.
+
+
+  #On cpu2, inject only total 0x10 errors, interval 5 seconds
+  #corrected, data cache, hier-2, physical addr(assigned by tool code).
+  #working on Montecito latest PAL.
+  2, 10, 5, 4101, 95
+
+  #On cpu4, inject and consume total 0x10 errors, interval 5 seconds
+  #corrected, data cache, hier-2, physical addr(assigned by tool code).
+  #working on Montecito latest PAL.
+  4, 10, 5, 4109, 95
+
+  #On cpu15, inject and consume total 0x10 errors, interval 5 seconds
+  #recoverable, DTR0, hier-2.
+  #working on Montecito latest PAL.
+  0xf, 0x10, 5, 4249, 15
+
+The sample application source code:
+
+err_injection_tool.c::
+
+  /*
+   * This program is free software; you can redistribute it and/or modify
+   * it under the terms of the GNU General Public License as published by
+   * the Free Software Foundation; either version 2 of the License, or
+   * (at your option) any later version.
+   *
+   * This program is distributed in the hope that it will be useful, but
+   * WITHOUT ANY WARRANTY; without even the implied warranty of
+   * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+   * NON INFRINGEMENT.  See the GNU General Public License for more
+   * details.
+   *
+   * You should have received a copy of the GNU General Public License
+   * along with this program; if not, write to the Free Software
+   * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+   *
+   * Copyright (C) 2006 Intel Co
+   *	Fenghua Yu <fenghua.yu@intel.com>
+   *
+   */
+  #include <sys/types.h>
+  #include <sys/stat.h>
+  #include <fcntl.h>
+  #include <stdio.h>
+  #include <sched.h>
+  #include <unistd.h>
+  #include <stdlib.h>
+  #include <stdarg.h>
+  #include <string.h>
+  #include <errno.h>
+  #include <time.h>
+  #include <sys/ipc.h>
+  #include <sys/sem.h>
+  #include <sys/wait.h>
+  #include <sys/mman.h>
+  #include <sys/shm.h>
+
+  #define MAX_FN_SIZE 		256
+  #define MAX_BUF_SIZE 		256
+  #define DATA_BUF_SIZE 		256
+  #define NR_CPUS 		512
+  #define MAX_TASK_NUM		2048
+  #define MIN_INTERVAL		5	// seconds
+  #define	ERR_DATA_BUFFER_SIZE 	3	// Three 8-byte.
+  #define PARA_FIELD_NUM		5
+  #define MASK_SIZE		(NR_CPUS/64)
+  #define PATH_FORMAT "/sys/devices/system/cpu/cpu%d/err_inject/"
+
+  int sched_setaffinity(pid_t pid, unsigned int len, unsigned long *mask);
+
+  int verbose;
+  #define vbprintf if (verbose) printf
+
+  int log_info(int cpu, const char *fmt, ...)
+  {
+	FILE *log;
+	char fn[MAX_FN_SIZE];
+	char buf[MAX_BUF_SIZE];
+	va_list args;
+
+	sprintf(fn, "%d.log", cpu);
+	log=fopen(fn, "a+");
+	if (log==NULL) {
+		perror("Error open:");
+		return -1;
+	}
+
+	va_start(args, fmt);
+	vprintf(fmt, args);
+	memset(buf, 0, MAX_BUF_SIZE);
+	vsprintf(buf, fmt, args);
+	va_end(args);
+
+	fwrite(buf, sizeof(buf), 1, log);
+	fclose(log);
+
+	return 0;
+  }
+
+  typedef unsigned long u64;
+  typedef unsigned int  u32;
+
+  typedef union err_type_info_u {
+	struct {
+		u64	mode		: 3,	/* 0-2 */
+			err_inj		: 3,	/* 3-5 */
+			err_sev		: 2,	/* 6-7 */
+			err_struct	: 5,	/* 8-12 */
+			struct_hier	: 3,	/* 13-15 */
+			reserved	: 48;	/* 16-63 */
+	} err_type_info_u;
+	u64	err_type_info;
+  } err_type_info_t;
+
+  typedef union err_struct_info_u {
+	struct {
+		u64	siv		: 1,	/* 0	 */
+			c_t		: 2,	/* 1-2	 */
+			cl_p		: 3,	/* 3-5	 */
+			cl_id		: 3,	/* 6-8	 */
+			cl_dp		: 1,	/* 9	 */
+			reserved1	: 22,	/* 10-31 */
+			tiv		: 1,	/* 32	 */
+			trigger		: 4,	/* 33-36 */
+			trigger_pl 	: 3,	/* 37-39 */
+			reserved2 	: 24;	/* 40-63 */
+	} err_struct_info_cache;
+	struct {
+		u64	siv		: 1,	/* 0	 */
+			tt		: 2,	/* 1-2	 */
+			tc_tr		: 2,	/* 3-4	 */
+			tr_slot		: 8,	/* 5-12	 */
+			reserved1	: 19,	/* 13-31 */
+			tiv		: 1,	/* 32	 */
+			trigger		: 4,	/* 33-36 */
+			trigger_pl 	: 3,	/* 37-39 */
+			reserved2 	: 24;	/* 40-63 */
+	} err_struct_info_tlb;
+	struct {
+		u64	siv		: 1,	/* 0	 */
+			regfile_id	: 4,	/* 1-4	 */
+			reg_num		: 7,	/* 5-11	 */
+			reserved1	: 20,	/* 12-31 */
+			tiv		: 1,	/* 32	 */
+			trigger		: 4,	/* 33-36 */
+			trigger_pl 	: 3,	/* 37-39 */
+			reserved2 	: 24;	/* 40-63 */
+	} err_struct_info_register;
+	struct {
+		u64	reserved;
+	} err_struct_info_bus_processor_interconnect;
+	u64	err_struct_info;
+  } err_struct_info_t;
+
+  typedef union err_data_buffer_u {
+	struct {
+		u64	trigger_addr;		/* 0-63		*/
+		u64	inj_addr;		/* 64-127 	*/
+		u64	way		: 5,	/* 128-132	*/
+			index		: 20,	/* 133-152	*/
+					: 39;	/* 153-191	*/
+	} err_data_buffer_cache;
+	struct {
+		u64	trigger_addr;		/* 0-63		*/
+		u64	inj_addr;		/* 64-127 	*/
+		u64	way		: 5,	/* 128-132	*/
+			index		: 20,	/* 133-152	*/
+			reserved	: 39;	/* 153-191	*/
+	} err_data_buffer_tlb;
+	struct {
+		u64	trigger_addr;		/* 0-63		*/
+	} err_data_buffer_register;
+	struct {
+		u64	reserved;		/* 0-63		*/
+	} err_data_buffer_bus_processor_interconnect;
+	u64 err_data_buffer[ERR_DATA_BUFFER_SIZE];
+  } err_data_buffer_t;
+
+  typedef union capabilities_u {
+	struct {
+		u64	i		: 1,
+			d		: 1,
+			rv		: 1,
+			tag		: 1,
+			data		: 1,
+			mesi		: 1,
+			dp		: 1,
+			reserved1	: 3,
+			pa		: 1,
+			va		: 1,
+			wi		: 1,
+			reserved2	: 20,
+			trigger		: 1,
+			trigger_pl	: 1,
+			reserved3	: 30;
+	} capabilities_cache;
+	struct {
+		u64	d		: 1,
+			i		: 1,
+			rv		: 1,
+			tc		: 1,
+			tr		: 1,
+			reserved1	: 27,
+			trigger		: 1,
+			trigger_pl	: 1,
+			reserved2	: 30;
+	} capabilities_tlb;
+	struct {
+		u64	gr_b0		: 1,
+			gr_b1		: 1,
+			fr		: 1,
+			br		: 1,
+			pr		: 1,
+			ar		: 1,
+			cr		: 1,
+			rr		: 1,
+			pkr		: 1,
+			dbr		: 1,
+			ibr		: 1,
+			pmc		: 1,
+			pmd		: 1,
+			reserved1	: 3,
+			regnum		: 1,
+			reserved2	: 15,
+			trigger		: 1,
+			trigger_pl	: 1,
+			reserved3	: 30;
+	} capabilities_register;
+	struct {
+		u64	reserved;
+	} capabilities_bus_processor_interconnect;
+  } capabilities_t;
+
+  typedef struct resources_s {
+	u64	ibr0		: 1,
+		ibr2		: 1,
+		ibr4		: 1,
+		ibr6		: 1,
+		dbr0		: 1,
+		dbr2		: 1,
+		dbr4		: 1,
+		dbr6		: 1,
+		reserved	: 48;
+  } resources_t;
+
+
+  long get_page_size(void)
+  {
+	long page_size=sysconf(_SC_PAGESIZE);
+	return page_size;
+  }
+
+  #define PAGE_SIZE (get_page_size()==-1?0x4000:get_page_size())
+  #define SHM_SIZE (2*PAGE_SIZE*NR_CPUS)
+  #define SHM_VA 0x2000000100000000
+
+  int shmid;
+  void *shmaddr;
+
+  int create_shm(void)
+  {
+	key_t key;
+	char fn[MAX_FN_SIZE];
+
+	/* cpu0 is always existing */
+	sprintf(fn, PATH_FORMAT, 0);
+	if ((key = ftok(fn, 's')) == -1) {
+		perror("ftok");
+		return -1;
+	}
+
+	shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT);
+	if (shmid == -1) {
+		if (errno==EEXIST) {
+			shmid = shmget(key, SHM_SIZE, 0);
+			if (shmid == -1) {
+				perror("shmget");
+				return -1;
+			}
+		}
+		else {
+			perror("shmget");
+			return -1;
+		}
+	}
+	vbprintf("shmid=%d", shmid);
+
+	/* connect to the segment: */
+	shmaddr = shmat(shmid, (void *)SHM_VA, 0);
+	if (shmaddr == (void*)-1) {
+		perror("shmat");
+		return -1;
+	}
+
+	memset(shmaddr, 0, SHM_SIZE);
+	mlock(shmaddr, SHM_SIZE);
+
+	return 0;
+  }
+
+  int free_shm()
+  {
+	munlock(shmaddr, SHM_SIZE);
+          shmdt(shmaddr);
+	semctl(shmid, 0, IPC_RMID);
+
+	return 0;
+  }
+
+  #ifdef _SEM_SEMUN_UNDEFINED
+  union semun
+  {
+	int val;
+	struct semid_ds *buf;
+	unsigned short int *array;
+	struct seminfo *__buf;
+  };
+  #endif
+
+  u32 mode=1; /* 1: physical mode; 2: virtual mode. */
+  int one_lock=1;
+  key_t key[NR_CPUS];
+  int semid[NR_CPUS];
+
+  int create_sem(int cpu)
+  {
+	union semun arg;
+	char fn[MAX_FN_SIZE];
+	int sid;
+
+	sprintf(fn, PATH_FORMAT, cpu);
+	sprintf(fn, "%s/%s", fn, "err_type_info");
+	if ((key[cpu] = ftok(fn, 'e')) == -1) {
+		perror("ftok");
+		return -1;
+	}
+
+	if (semid[cpu]!=0)
+		return 0;
+
+	/* clear old semaphore */
+	if ((sid = semget(key[cpu], 1, 0)) != -1)
+		semctl(sid, 0, IPC_RMID);
+
+	/* get one semaphore */
+	if ((semid[cpu] = semget(key[cpu], 1, IPC_CREAT | IPC_EXCL)) == -1) {
+		perror("semget");
+		printf("Please remove semaphore with key=0x%lx, then run the tool.\n",
+			(u64)key[cpu]);
+		return -1;
+	}
+
+	vbprintf("semid[%d]=0x%lx, key[%d]=%lx\n",cpu,(u64)semid[cpu],cpu,
+		(u64)key[cpu]);
+	/* initialize the semaphore to 1: */
+	arg.val = 1;
+	if (semctl(semid[cpu], 0, SETVAL, arg) == -1) {
+		perror("semctl");
+		return -1;
+	}
+
+	return 0;
+  }
+
+  static int lock(int cpu)
+  {
+	struct sembuf lock;
+
+	lock.sem_num = cpu;
+	lock.sem_op = 1;
+	semop(semid[cpu], &lock, 1);
+
+          return 0;
+  }
+
+  static int unlock(int cpu)
+  {
+	struct sembuf unlock;
+
+	unlock.sem_num = cpu;
+	unlock.sem_op = -1;
+	semop(semid[cpu], &unlock, 1);
+
+          return 0;
+  }
+
+  void free_sem(int cpu)
+  {
+	semctl(semid[cpu], 0, IPC_RMID);
+  }
+
+  int wr_multi(char *fn, unsigned long *data, int size)
+  {
+	int fd;
+	char buf[MAX_BUF_SIZE];
+	int ret;
+
+	if (size==1)
+		sprintf(buf, "%lx", *data);
+	else if (size==3)
+		sprintf(buf, "%lx,%lx,%lx", data[0], data[1], data[2]);
+	else {
+		fprintf(stderr,"write to file with wrong size!\n");
+		return -1;
+	}
+
+	fd=open(fn, O_RDWR);
+	if (!fd) {
+		perror("Error:");
+		return -1;
+	}
+	ret=write(fd, buf, sizeof(buf));
+	close(fd);
+	return ret;
+  }
+
+  int wr(char *fn, unsigned long data)
+  {
+	return wr_multi(fn, &data, 1);
+  }
+
+  int rd(char *fn, unsigned long *data)
+  {
+	int fd;
+	char buf[MAX_BUF_SIZE];
+
+	fd=open(fn, O_RDONLY);
+	if (fd<0) {
+		perror("Error:");
+		return -1;
+	}
+	read(fd, buf, MAX_BUF_SIZE);
+	*data=strtoul(buf, NULL, 16);
+	close(fd);
+	return 0;
+  }
+
+  int rd_status(char *path, int *status)
+  {
+	char fn[MAX_FN_SIZE];
+	sprintf(fn, "%s/status", path);
+	if (rd(fn, (u64*)status)<0) {
+		perror("status reading error.\n");
+		return -1;
+	}
+
+	return 0;
+  }
+
+  int rd_capabilities(char *path, u64 *capabilities)
+  {
+	char fn[MAX_FN_SIZE];
+	sprintf(fn, "%s/capabilities", path);
+	if (rd(fn, capabilities)<0) {
+		perror("capabilities reading error.\n");
+		return -1;
+	}
+
+	return 0;
+  }
+
+  int rd_all(char *path)
+  {
+	unsigned long err_type_info, err_struct_info, err_data_buffer;
+	int status;
+	unsigned long capabilities, resources;
+	char fn[MAX_FN_SIZE];
+
+	sprintf(fn, "%s/err_type_info", path);
+	if (rd(fn, &err_type_info)<0) {
+		perror("err_type_info reading error.\n");
+		return -1;
+	}
+	printf("err_type_info=%lx\n", err_type_info);
+
+	sprintf(fn, "%s/err_struct_info", path);
+	if (rd(fn, &err_struct_info)<0) {
+		perror("err_struct_info reading error.\n");
+		return -1;
+	}
+	printf("err_struct_info=%lx\n", err_struct_info);
+
+	sprintf(fn, "%s/err_data_buffer", path);
+	if (rd(fn, &err_data_buffer)<0) {
+		perror("err_data_buffer reading error.\n");
+		return -1;
+	}
+	printf("err_data_buffer=%lx\n", err_data_buffer);
+
+	sprintf(fn, "%s/status", path);
+	if (rd("status", (u64*)&status)<0) {
+		perror("status reading error.\n");
+		return -1;
+	}
+	printf("status=%d\n", status);
+
+	sprintf(fn, "%s/capabilities", path);
+	if (rd(fn,&capabilities)<0) {
+		perror("capabilities reading error.\n");
+		return -1;
+	}
+	printf("capabilities=%lx\n", capabilities);
+
+	sprintf(fn, "%s/resources", path);
+	if (rd(fn, &resources)<0) {
+		perror("resources reading error.\n");
+		return -1;
+	}
+	printf("resources=%lx\n", resources);
+
+	return 0;
+  }
+
+  int query_capabilities(char *path, err_type_info_t err_type_info,
+			u64 *capabilities)
+  {
+	char fn[MAX_FN_SIZE];
+	err_struct_info_t err_struct_info;
+	err_data_buffer_t err_data_buffer;
+
+	err_struct_info.err_struct_info=0;
+	memset(err_data_buffer.err_data_buffer, -1, ERR_DATA_BUFFER_SIZE*8);
+
+	sprintf(fn, "%s/err_type_info", path);
+	wr(fn, err_type_info.err_type_info);
+	sprintf(fn, "%s/err_struct_info", path);
+	wr(fn, 0x0);
+	sprintf(fn, "%s/err_data_buffer", path);
+	wr_multi(fn, err_data_buffer.err_data_buffer, ERR_DATA_BUFFER_SIZE);
+
+	// Fire pal_mc_error_inject procedure.
+	sprintf(fn, "%s/call_start", path);
+	wr(fn, mode);
+
+	if (rd_capabilities(path, capabilities)<0)
+		return -1;
+
+	return 0;
+  }
+
+  int query_all_capabilities()
+  {
+	int status;
+	err_type_info_t err_type_info;
+	int err_sev, err_struct, struct_hier;
+	int cap=0;
+	u64 capabilities;
+	char path[MAX_FN_SIZE];
+
+	err_type_info.err_type_info=0;			// Initial
+	err_type_info.err_type_info_u.mode=0;		// Query mode;
+	err_type_info.err_type_info_u.err_inj=0;
+
+	printf("All capabilities implemented in pal_mc_error_inject:\n");
+	sprintf(path, PATH_FORMAT ,0);
+	for (err_sev=0;err_sev<3;err_sev++)
+		for (err_struct=0;err_struct<5;err_struct++)
+			for (struct_hier=0;struct_hier<5;struct_hier++)
+	{
+		status=-1;
+		capabilities=0;
+		err_type_info.err_type_info_u.err_sev=err_sev;
+		err_type_info.err_type_info_u.err_struct=err_struct;
+		err_type_info.err_type_info_u.struct_hier=struct_hier;
+
+		if (query_capabilities(path, err_type_info, &capabilities)<0)
+			continue;
+
+		if (rd_status(path, &status)<0)
+			continue;
+
+		if (status==0) {
+			cap=1;
+			printf("For err_sev=%d, err_struct=%d, struct_hier=%d: ",
+				err_sev, err_struct, struct_hier);
+			printf("capabilities 0x%lx\n", capabilities);
+		}
+	}
+	if (!cap) {
+		printf("No capabilities supported.\n");
+		return 0;
+	}
+
+	return 0;
+  }
+
+  int err_inject(int cpu, char *path, err_type_info_t err_type_info,
+		err_struct_info_t err_struct_info,
+		err_data_buffer_t err_data_buffer)
+  {
+	int status;
+	char fn[MAX_FN_SIZE];
+
+	log_info(cpu, "err_type_info=%lx, err_struct_info=%lx, ",
+		err_type_info.err_type_info,
+		err_struct_info.err_struct_info);
+	log_info(cpu,"err_data_buffer=[%lx,%lx,%lx]\n",
+		err_data_buffer.err_data_buffer[0],
+		err_data_buffer.err_data_buffer[1],
+		err_data_buffer.err_data_buffer[2]);
+	sprintf(fn, "%s/err_type_info", path);
+	wr(fn, err_type_info.err_type_info);
+	sprintf(fn, "%s/err_struct_info", path);
+	wr(fn, err_struct_info.err_struct_info);
+	sprintf(fn, "%s/err_data_buffer", path);
+	wr_multi(fn, err_data_buffer.err_data_buffer, ERR_DATA_BUFFER_SIZE);
+
+	// Fire pal_mc_error_inject procedure.
+	sprintf(fn, "%s/call_start", path);
+	wr(fn,mode);
+
+	if (rd_status(path, &status)<0) {
+		vbprintf("fail: read status\n");
+		return -100;
+	}
+
+	if (status!=0) {
+		log_info(cpu, "fail: status=%d\n", status);
+		return status;
+	}
+
+	return status;
+  }
+
+  static int construct_data_buf(char *path, err_type_info_t err_type_info,
+		err_struct_info_t err_struct_info,
+		err_data_buffer_t *err_data_buffer,
+		void *va1)
+  {
+	char fn[MAX_FN_SIZE];
+	u64 virt_addr=0, phys_addr=0;
+
+	vbprintf("va1=%lx\n", (u64)va1);
+	memset(&err_data_buffer->err_data_buffer_cache, 0, ERR_DATA_BUFFER_SIZE*8);
+
+	switch (err_type_info.err_type_info_u.err_struct) {
+		case 1: // Cache
+			switch (err_struct_info.err_struct_info_cache.cl_id) {
+				case 1: //Virtual addr
+					err_data_buffer->err_data_buffer_cache.inj_addr=(u64)va1;
+					break;
+				case 2: //Phys addr
+					sprintf(fn, "%s/virtual_to_phys", path);
+					virt_addr=(u64)va1;
+					if (wr(fn,virt_addr)<0)
+						return -1;
+					rd(fn, &phys_addr);
+					err_data_buffer->err_data_buffer_cache.inj_addr=phys_addr;
+					break;
+				default:
+					printf("Not supported cl_id\n");
+					break;
+			}
+			break;
+		case 2: //  TLB
+			break;
+		case 3: //  Register file
+			break;
+		case 4: //  Bus/system interconnect
+		default:
+			printf("Not supported err_struct\n");
+			break;
+	}
+
+	return 0;
+  }
+
+  typedef struct {
+	u64 cpu;
+	u64 loop;
+	u64 interval;
+	u64 err_type_info;
+	u64 err_struct_info;
+	u64 err_data_buffer[ERR_DATA_BUFFER_SIZE];
+  } parameters_t;
+
+  parameters_t line_para;
+  int para;
+
+  static int empty_data_buffer(u64 *err_data_buffer)
+  {
+	int empty=1;
+	int i;
+
+	for (i=0;i<ERR_DATA_BUFFER_SIZE; i++)
+	   if (err_data_buffer[i]!=-1)
+		empty=0;
+
+	return empty;
+  }
+
+  int err_inj()
+  {
+	err_type_info_t err_type_info;
+	err_struct_info_t err_struct_info;
+	err_data_buffer_t err_data_buffer;
+	int count;
+	FILE *fp;
+	unsigned long cpu, loop, interval, err_type_info_conf, err_struct_info_conf;
+	u64 err_data_buffer_conf[ERR_DATA_BUFFER_SIZE];
+	int num;
+	int i;
+	char path[MAX_FN_SIZE];
+	parameters_t parameters[MAX_TASK_NUM]={};
+	pid_t child_pid[MAX_TASK_NUM];
+	time_t current_time;
+	int status;
+
+	if (!para) {
+	    fp=fopen("err.conf", "r");
+	    if (fp==NULL) {
+		perror("Error open err.conf");
+		return -1;
+	    }
+
+	    num=0;
+	    while (!feof(fp)) {
+		char buf[256];
+		memset(buf,0,256);
+		fgets(buf, 256, fp);
+		count=sscanf(buf, "%lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx\n",
+				&cpu, &loop, &interval,&err_type_info_conf,
+				&err_struct_info_conf,
+				&err_data_buffer_conf[0],
+				&err_data_buffer_conf[1],
+				&err_data_buffer_conf[2]);
+		if (count!=PARA_FIELD_NUM+3) {
+			err_data_buffer_conf[0]=-1;
+			err_data_buffer_conf[1]=-1;
+			err_data_buffer_conf[2]=-1;
+			count=sscanf(buf, "%lx, %lx, %lx, %lx, %lx\n",
+				&cpu, &loop, &interval,&err_type_info_conf,
+				&err_struct_info_conf);
+			if (count!=PARA_FIELD_NUM)
+				continue;
+		}
+
+		parameters[num].cpu=cpu;
+		parameters[num].loop=loop;
+		parameters[num].interval= interval>MIN_INTERVAL
+					  ?interval:MIN_INTERVAL;
+		parameters[num].err_type_info=err_type_info_conf;
+		parameters[num].err_struct_info=err_struct_info_conf;
+		memcpy(parameters[num++].err_data_buffer,
+			err_data_buffer_conf,ERR_DATA_BUFFER_SIZE*8) ;
+
+		if (num>=MAX_TASK_NUM)
+			break;
+	    }
+	}
+	else {
+		parameters[0].cpu=line_para.cpu;
+		parameters[0].loop=line_para.loop;
+		parameters[0].interval= line_para.interval>MIN_INTERVAL
+					  ?line_para.interval:MIN_INTERVAL;
+		parameters[0].err_type_info=line_para.err_type_info;
+		parameters[0].err_struct_info=line_para.err_struct_info;
+		memcpy(parameters[0].err_data_buffer,
+			line_para.err_data_buffer,ERR_DATA_BUFFER_SIZE*8) ;
+
+		num=1;
+	}
+
+	/* Create semaphore: If one_lock, one semaphore for all processors.
+	   Otherwise, one semaphore for each processor. */
+	if (one_lock) {
+		if (create_sem(0)) {
+			printf("Can not create semaphore...exit\n");
+			free_sem(0);
+			return -1;
+		}
+	}
+	else {
+		for (i=0;i<num;i++) {
+		   if (create_sem(parameters[i].cpu)) {
+			printf("Can not create semaphore for cpu%d...exit\n",i);
+			free_sem(parameters[num].cpu);
+			return -1;
+		   }
+		}
+	}
+
+	/* Create a shm segment which will be used to inject/consume errors on.*/
+	if (create_shm()==-1) {
+		printf("Error to create shm...exit\n");
+		return -1;
+	}
+
+	for (i=0;i<num;i++) {
+		pid_t pid;
+
+		current_time=time(NULL);
+		log_info(parameters[i].cpu, "\nBegine at %s", ctime(&current_time));
+		log_info(parameters[i].cpu, "Configurations:\n");
+		log_info(parameters[i].cpu,"On cpu%ld: loop=%lx, interval=%lx(s)",
+			parameters[i].cpu,
+			parameters[i].loop,
+			parameters[i].interval);
+		log_info(parameters[i].cpu," err_type_info=%lx,err_struct_info=%lx\n",
+			parameters[i].err_type_info,
+			parameters[i].err_struct_info);
+
+		sprintf(path, PATH_FORMAT, (int)parameters[i].cpu);
+		err_type_info.err_type_info=parameters[i].err_type_info;
+		err_struct_info.err_struct_info=parameters[i].err_struct_info;
+		memcpy(err_data_buffer.err_data_buffer,
+			parameters[i].err_data_buffer,
+			ERR_DATA_BUFFER_SIZE*8);
+
+		pid=fork();
+		if (pid==0) {
+			unsigned long mask[MASK_SIZE];
+			int j, k;
+
+			void *va1, *va2;
+
+			/* Allocate two memory areas va1 and va2 in shm */
+			va1=shmaddr+parameters[i].cpu*PAGE_SIZE;
+			va2=shmaddr+parameters[i].cpu*PAGE_SIZE+PAGE_SIZE;
+
+			vbprintf("va1=%lx, va2=%lx\n", (u64)va1, (u64)va2);
+			memset(va1, 0x1, PAGE_SIZE);
+			memset(va2, 0x2, PAGE_SIZE);
+
+			if (empty_data_buffer(err_data_buffer.err_data_buffer))
+				/* If not specified yet, construct data buffer
+				 * with va1
+				 */
+				construct_data_buf(path, err_type_info,
+					err_struct_info, &err_data_buffer,va1);
+
+			for (j=0;j<MASK_SIZE;j++)
+				mask[j]=0;
+
+			cpu=parameters[i].cpu;
+			k = cpu%64;
+			j = cpu/64;
+			mask[j] = 1UL << k;
+
+			if (sched_setaffinity(0, MASK_SIZE*8, mask)==-1) {
+				perror("Error sched_setaffinity:");
+				return -1;
+			}
+
+			for (j=0; j<parameters[i].loop; j++) {
+				log_info(parameters[i].cpu,"Injection ");
+				log_info(parameters[i].cpu,"on cpu%ld: #%d/%ld ",
+
+					parameters[i].cpu,j+1, parameters[i].loop);
+
+				/* Hold the lock */
+				if (one_lock)
+					lock(0);
+				else
+				/* Hold lock on this cpu */
+					lock(parameters[i].cpu);
+
+				if ((status=err_inject(parameters[i].cpu,
+					   path, err_type_info,
+					   err_struct_info, err_data_buffer))
+					   ==0) {
+					/* consume the error for "inject only"*/
+					memcpy(va2, va1, PAGE_SIZE);
+					memcpy(va1, va2, PAGE_SIZE);
+					log_info(parameters[i].cpu,
+						"successful\n");
+				}
+				else {
+					log_info(parameters[i].cpu,"fail:");
+					log_info(parameters[i].cpu,
+						"status=%d\n", status);
+					unlock(parameters[i].cpu);
+					break;
+				}
+				if (one_lock)
+				/* Release the lock */
+					unlock(0);
+				/* Release lock on this cpu */
+				else
+					unlock(parameters[i].cpu);
+
+				if (j < parameters[i].loop-1)
+					sleep(parameters[i].interval);
+			}
+			current_time=time(NULL);
+			log_info(parameters[i].cpu, "Done at %s", ctime(&current_time));
+			return 0;
+		}
+		else if (pid<0) {
+			perror("Error fork:");
+			continue;
+		}
+		child_pid[i]=pid;
+	}
+	for (i=0;i<num;i++)
+		waitpid(child_pid[i], NULL, 0);
+
+	if (one_lock)
+		free_sem(0);
+	else
+		for (i=0;i<num;i++)
+			free_sem(parameters[i].cpu);
+
+	printf("All done.\n");
+
+	return 0;
+  }
+
+  void help()
+  {
+	printf("err_inject_tool:\n");
+	printf("\t-q: query all capabilities. default: off\n");
+	printf("\t-m: procedure mode. 1: physical 2: virtual. default: 1\n");
+	printf("\t-i: inject errors. default: off\n");
+	printf("\t-l: one lock per cpu. default: one lock for all\n");
+	printf("\t-e: error parameters:\n");
+	printf("\t\tcpu,loop,interval,err_type_info,err_struct_info[,err_data_buffer[0],err_data_buffer[1],err_data_buffer[2]]\n");
+	printf("\t\t   cpu: logical cpu number the error will be inject in.\n");
+	printf("\t\t   loop: times the error will be injected.\n");
+	printf("\t\t   interval: In second. every so often one error is injected.\n");
+	printf("\t\t   err_type_info, err_struct_info: PAL parameters.\n");
+	printf("\t\t   err_data_buffer: PAL parameter. Optional. If not present,\n");
+	printf("\t\t                    it's constructed by tool automatically. Be\n");
+	printf("\t\t                    careful to provide err_data_buffer and make\n");
+	printf("\t\t                    sure it's working with the environment.\n");
+	printf("\t    Note:no space between error parameters.\n");
+	printf("\t    default: Take error parameters from err.conf instead of command line.\n");
+	printf("\t-v: verbose. default: off\n");
+	printf("\t-h: help\n\n");
+	printf("The tool will take err.conf file as ");
+	printf("input to inject single or multiple errors ");
+	printf("on one or multiple cpus in parallel.\n");
+  }
+
+  int main(int argc, char **argv)
+  {
+	char c;
+	int do_err_inj=0;
+	int do_query_all=0;
+	int count;
+	u32 m;
+
+	/* Default one lock for all cpu's */
+	one_lock=1;
+	while ((c = getopt(argc, argv, "m:iqvhle:")) != EOF)
+		switch (c) {
+			case 'm':	/* Procedure mode. 1: phys 2: virt */
+				count=sscanf(optarg, "%x", &m);
+				if (count!=1 || (m!=1 && m!=2)) {
+					printf("Wrong mode number.\n");
+					help();
+					return -1;
+				}
+				mode=m;
+				break;
+			case 'i':	/* Inject errors */
+				do_err_inj=1;
+				break;
+			case 'q':	/* Query */
+				do_query_all=1;
+				break;
+			case 'v':	/* Verbose */
+				verbose=1;
+				break;
+			case 'l':	/* One lock per cpu */
+				one_lock=0;
+				break;
+			case 'e':	/* error arguments */
+				/* Take parameters:
+				 * #cpu, loop, interval, err_type_info, err_struct_info[, err_data_buffer]
+				 * err_data_buffer is optional. Recommend not to specify
+				 * err_data_buffer. Better to use tool to generate it.
+				 */
+				count=sscanf(optarg,
+					"%lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx\n",
+					&line_para.cpu,
+					&line_para.loop,
+					&line_para.interval,
+					&line_para.err_type_info,
+					&line_para.err_struct_info,
+					&line_para.err_data_buffer[0],
+					&line_para.err_data_buffer[1],
+					&line_para.err_data_buffer[2]);
+				if (count!=PARA_FIELD_NUM+3) {
+				    line_para.err_data_buffer[0]=-1,
+				    line_para.err_data_buffer[1]=-1,
+				    line_para.err_data_buffer[2]=-1;
+				    count=sscanf(optarg, "%lx, %lx, %lx, %lx, %lx\n",
+						&line_para.cpu,
+						&line_para.loop,
+						&line_para.interval,
+						&line_para.err_type_info,
+						&line_para.err_struct_info);
+				    if (count!=PARA_FIELD_NUM) {
+					printf("Wrong error arguments.\n");
+					help();
+					return -1;
+				    }
+				}
+				para=1;
+				break;
+			continue;
+				break;
+			case 'h':
+				help();
+				return 0;
+			default:
+				break;
+		}
+
+	if (do_query_all)
+		query_all_capabilities();
+	if (do_err_inj)
+		err_inj();
+
+	if (!do_query_all &&  !do_err_inj)
+		help();
+
+	return 0;
+  }
diff --git a/Documentation/arch/ia64/features.rst b/Documentation/arch/ia64/features.rst
new file mode 100644
index 000000000000..d7226fdcf5f8
--- /dev/null
+++ b/Documentation/arch/ia64/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features ia64
diff --git a/Documentation/arch/ia64/fsys.rst b/Documentation/arch/ia64/fsys.rst
new file mode 100644
index 000000000000..a702d2cc94b6
--- /dev/null
+++ b/Documentation/arch/ia64/fsys.rst
@@ -0,0 +1,303 @@
+===================================
+Light-weight System Calls for IA-64
+===================================
+
+		        Started: 13-Jan-2003
+
+		    Last update: 27-Sep-2003
+
+	              David Mosberger-Tang
+		      <davidm@hpl.hp.com>
+
+Using the "epc" instruction effectively introduces a new mode of
+execution to the ia64 linux kernel.  We call this mode the
+"fsys-mode".  To recap, the normal states of execution are:
+
+  - kernel mode:
+	Both the register stack and the memory stack have been
+	switched over to kernel memory.  The user-level state is saved
+	in a pt-regs structure at the top of the kernel memory stack.
+
+  - user mode:
+	Both the register stack and the kernel stack are in
+	user memory.  The user-level state is contained in the
+	CPU registers.
+
+  - bank 0 interruption-handling mode:
+	This is the non-interruptible state which all
+	interruption-handlers start execution in.  The user-level
+	state remains in the CPU registers and some kernel state may
+	be stored in bank 0 of registers r16-r31.
+
+In contrast, fsys-mode has the following special properties:
+
+  - execution is at privilege level 0 (most-privileged)
+
+  - CPU registers may contain a mixture of user-level and kernel-level
+    state (it is the responsibility of the kernel to ensure that no
+    security-sensitive kernel-level state is leaked back to
+    user-level)
+
+  - execution is interruptible and preemptible (an fsys-mode handler
+    can disable interrupts and avoid all other interruption-sources
+    to avoid preemption)
+
+  - neither the memory-stack nor the register-stack can be trusted while
+    in fsys-mode (they point to the user-level stacks, which may
+    be invalid, or completely bogus addresses)
+
+In summary, fsys-mode is much more similar to running in user-mode
+than it is to running in kernel-mode.  Of course, given that the
+privilege level is at level 0, this means that fsys-mode requires some
+care (see below).
+
+
+How to tell fsys-mode
+=====================
+
+Linux operates in fsys-mode when (a) the privilege level is 0 (most
+privileged) and (b) the stacks have NOT been switched to kernel memory
+yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
+three macros::
+
+	user_mode(regs)
+	user_stack(task,regs)
+	fsys_mode(task,regs)
+
+The "regs" argument is a pointer to a pt_regs structure.  The "task"
+argument is a pointer to the task structure to which the "regs"
+pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
+to by "regs" was executing in user mode (privilege level 3).
+user_stack() returns TRUE if the state pointed to by "regs" was
+executing on the user-level stack(s).  Finally, fsys_mode() returns
+TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
+The fsys_mode() macro is equivalent to the expression::
+
+	!user_mode(regs) && user_stack(task,regs)
+
+How to write an fsyscall handler
+================================
+
+The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
+(fsyscall_table).  This table contains one entry for each system call.
+By default, a system call is handled by fsys_fallback_syscall().  This
+routine takes care of entering (full) kernel mode and calling the
+normal Linux system call handler.  For performance-critical system
+calls, it is possible to write a hand-tuned fsyscall_handler.  For
+example, fsys.S contains fsys_getpid(), which is a hand-tuned version
+of the getpid() system call.
+
+The entry and exit-state of an fsyscall handler is as follows:
+
+Machine state on entry to fsyscall handler
+------------------------------------------
+
+  ========= ===============================================================
+  r10	    0
+  r11	    saved ar.pfs (a user-level value)
+  r15	    system call number
+  r16	    "current" task pointer (in normal kernel-mode, this is in r13)
+  r32-r39   system call arguments
+  b6	    return address (a user-level value)
+  ar.pfs    previous frame-state (a user-level value)
+  PSR.be    cleared to zero (i.e., little-endian byte order is in effect)
+  -         all other registers may contain values passed in from user-mode
+  ========= ===============================================================
+
+Required machine state on exit to fsyscall handler
+--------------------------------------------------
+
+  ========= ===========================================================
+  r11	    saved ar.pfs (as passed into the fsyscall handler)
+  r15	    system call number (as passed into the fsyscall handler)
+  r32-r39   system call arguments (as passed into the fsyscall handler)
+  b6	    return address (as passed into the fsyscall handler)
+  ar.pfs    previous frame-state (as passed into the fsyscall handler)
+  ========= ===========================================================
+
+Fsyscall handlers can execute with very little overhead, but with that
+speed comes a set of restrictions:
+
+ * Fsyscall-handlers MUST check for any pending work in the flags
+   member of the thread-info structure and if any of the
+   TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
+   doing a full system call (by calling fsys_fallback_syscall).
+
+ * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
+   r15, b6, and ar.pfs) because they will be needed in case of a
+   system call restart.  Of course, all "preserved" registers also
+   must be preserved, in accordance to the normal calling conventions.
+
+ * Fsyscall-handlers MUST check argument registers for containing a
+   NaT value before using them in any way that could trigger a
+   NaT-consumption fault.  If a system call argument is found to
+   contain a NaT value, an fsyscall-handler may return immediately
+   with r8=EINVAL, r10=-1.
+
+ * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
+   any other operation that would trigger mandatory RSE
+   (register-stack engine) traffic.
+
+ * Fsyscall-handlers MUST NOT write to any stacked registers because
+   it is not safe to assume that user-level called a handler with the
+   proper number of arguments.
+
+ * Fsyscall-handlers need to be careful when accessing per-CPU variables:
+   unless proper safe-guards are taken (e.g., interruptions are avoided),
+   execution may be pre-empted and resumed on another CPU at any given
+   time.
+
+ * Fsyscall-handlers must be careful not to leak sensitive kernel'
+   information back to user-level.  In particular, before returning to
+   user-level, care needs to be taken to clear any scratch registers
+   that could contain sensitive information (note that the current
+   task pointer is not considered sensitive: it's already exposed
+   through ar.k6).
+
+ * Fsyscall-handlers MUST NOT access user-memory without first
+   validating access-permission (this can be done typically via
+   probe.r.fault and/or probe.w.fault) and without guarding against
+   memory access exceptions (this can be done with the EX() macros
+   defined by asmmacro.h).
+
+The above restrictions may seem draconian, but remember that it's
+possible to trade off some of the restrictions by paying a slightly
+higher overhead.  For example, if an fsyscall-handler could benefit
+from the shadow register bank, it could temporarily disable PSR.i and
+PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
+needed.  In other words, following the above rules yields extremely
+fast system call execution (while fully preserving system call
+semantics), but there is also a lot of flexibility in handling more
+complicated cases.
+
+Signal handling
+===============
+
+The delivery of (asynchronous) signals must be delayed until fsys-mode
+is exited.  This is accomplished with the help of the lower-privilege
+transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
+checks whether the interrupted task was in fsys-mode and, if so, sets
+PSR.lp and returns immediately.  When fsys-mode is exited via the
+"br.ret" instruction that lowers the privilege level, a trap will
+occur.  The trap handler clears PSR.lp again and returns immediately.
+The kernel exit path then checks for and delivers any pending signals.
+
+PSR Handling
+============
+
+The "epc" instruction doesn't change the contents of PSR at all.  This
+is in contrast to a regular interruption, which clears almost all
+bits.  Because of that, some care needs to be taken to ensure things
+work as expected.  The following discussion describes how each PSR bit
+is handled.
+
+======= =======================================================================
+PSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used
+	to ensure the CPU is in little-endian mode before the first
+	load/store instruction is executed.  PSR.be is normally NOT
+	restored upon return from an fsys-mode handler.  In other
+	words, user-level code must not rely on PSR.be being preserved
+	across a system call.
+PSR.up	Unchanged.
+PSR.ac	Unchanged.
+PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
+PSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
+PSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
+PSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
+PSR.pk	Unchanged.
+PSR.dt	Unchanged.
+PSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers!
+PSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
+PSR.sp	Unchanged.
+PSR.pp	Unchanged.
+PSR.di	Unchanged.
+PSR.si	Unchanged.
+PSR.db	Unchanged.  The kernel prevents user-level from setting a hardware
+	breakpoint that triggers at any privilege level other than
+	3 (user-mode).
+PSR.lp	Unchanged.
+PSR.tb	Lazy redirect.  If a taken-branch trap occurs while in
+	fsys-mode, the trap-handler modifies the saved machine state
+	such that execution resumes in the gate page at
+	syscall_via_break(), with privilege level 3.  Note: the
+	taken branch would occur on the branch invoking the
+	fsyscall-handler, at which point, by definition, a syscall
+	restart is still safe.  If the system call number is invalid,
+	the fsys-mode handler will return directly to user-level.  This
+	return will trigger a taken-branch trap, but since the trap is
+	taken _after_ restoring the privilege level, the CPU has already
+	left fsys-mode, so no special treatment is needed.
+PSR.rt	Unchanged.
+PSR.cpl	Cleared to 0.
+PSR.is	Unchanged (guaranteed to be 0 on entry to the gate page).
+PSR.mc	Unchanged.
+PSR.it	Unchanged (guaranteed to be 1).
+PSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit.
+PSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit.
+PSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit.
+PSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to
+	be taken.  The trap handler then modifies the saved machine
+	state such that execution resumes in the gate page at
+	syscall_via_break(), with privilege level 3.
+PSR.ri	Unchanged.
+PSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode
+	handler performed a speculative load that gets NaTted.  If so, this
+	would be the normal & expected behavior, so no special treatment is
+	needed.
+PSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
+	Doing so requires clearing PSR.i and PSR.ic as well.
+PSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit.
+======= =======================================================================
+
+Using fast system calls
+=======================
+
+To use fast system calls, userspace applications need simply call
+__kernel_syscall_via_epc().  For example
+
+-- example fgettimeofday() call --
+
+-- fgettimeofday.S --
+
+::
+
+  #include <asm/asmmacro.h>
+
+  GLOBAL_ENTRY(fgettimeofday)
+  .prologue
+  .save ar.pfs, r11
+  mov r11 = ar.pfs
+  .body
+
+  mov r2 = 0xa000000000020660;;  // gate address
+			       // found by inspection of System.map for the
+			       // __kernel_syscall_via_epc() function.  See
+			       // below for how to do this for real.
+
+  mov b7 = r2
+  mov r15 = 1087		       // gettimeofday syscall
+  ;;
+  br.call.sptk.many b6 = b7
+  ;;
+
+  .restore sp
+
+  mov ar.pfs = r11
+  br.ret.sptk.many rp;;	      // return to caller
+  END(fgettimeofday)
+
+-- end fgettimeofday.S --
+
+In reality, getting the gate address is accomplished by two extra
+values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
+
+ * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
+ * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
+
+The ELF DSO is a pre-linked library that is mapped in by the kernel at
+the gate page.  It is a proper ELF shared object so, with a dynamic
+loader that recognises the library, you should be able to make calls to
+the exported functions within it as with any other shared library.
+AT_SYSINFO points into the kernel DSO at the
+__kernel_syscall_via_epc() function for historical reasons (it was
+used before the kernel DSO) and as a convenience.
diff --git a/Documentation/arch/ia64/ia64.rst b/Documentation/arch/ia64/ia64.rst
new file mode 100644
index 000000000000..b725019a9492
--- /dev/null
+++ b/Documentation/arch/ia64/ia64.rst
@@ -0,0 +1,49 @@
+===========================================
+Linux kernel release for the IA-64 Platform
+===========================================
+
+   These are the release notes for Linux since version 2.4 for IA-64
+   platform.  This document provides information specific to IA-64
+   ONLY, to get additional information about the Linux kernel also
+   read the original Linux README provided with the kernel.
+
+Installing the Kernel
+=====================
+
+ - IA-64 kernel installation is the same as the other platforms, see
+   original README for details.
+
+
+Software Requirements
+=====================
+
+   Compiling and running this kernel requires an IA-64 compliant GCC
+   compiler.  And various software packages also compiled with an
+   IA-64 compliant GCC compiler.
+
+
+Configuring the Kernel
+======================
+
+   Configuration is the same, see original README for details.
+
+
+Compiling the Kernel:
+
+ - Compiling this kernel doesn't differ from other platform so read
+   the original README for details BUT make sure you have an IA-64
+   compliant GCC compiler.
+
+IA-64 Specifics
+===============
+
+ - General issues:
+
+    * Hardly any performance tuning has been done. Obvious targets
+      include the library routines (IP checksum, etc.). Less
+      obvious targets include making sure we don't flush the TLB
+      needlessly, etc.
+
+    * SMP locks cleanup/optimization
+
+    * IA32 support.  Currently experimental.  It mostly works.
diff --git a/Documentation/arch/ia64/index.rst b/Documentation/arch/ia64/index.rst
new file mode 100644
index 000000000000..761f2154dfa2
--- /dev/null
+++ b/Documentation/arch/ia64/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+IA-64 Architecture
+==================
+
+.. toctree::
+   :maxdepth: 1
+
+   ia64
+   aliasing
+   efirtc
+   err_inject
+   fsys
+   irq-redir
+   mca
+   serial
+
+   features
diff --git a/Documentation/arch/ia64/irq-redir.rst b/Documentation/arch/ia64/irq-redir.rst
new file mode 100644
index 000000000000..6bbbbe4f73ef
--- /dev/null
+++ b/Documentation/arch/ia64/irq-redir.rst
@@ -0,0 +1,80 @@
+==============================
+IRQ affinity on IA64 platforms
+==============================
+
+07.01.2002, Erich Focht <efocht@ess.nec.de>
+
+
+By writing to /proc/irq/IRQ#/smp_affinity the interrupt routing can be
+controlled. The behavior on IA64 platforms is slightly different from
+that described in Documentation/core-api/irq/irq-affinity.rst for i386 systems.
+
+Because of the usage of SAPIC mode and physical destination mode the
+IRQ target is one particular CPU and cannot be a mask of several
+CPUs. Only the first non-zero bit is taken into account.
+
+
+Usage examples
+==============
+
+The target CPU has to be specified as a hexadecimal CPU mask. The
+first non-zero bit is the selected CPU. This format has been kept for
+compatibility reasons with i386.
+
+Set the delivery mode of interrupt 41 to fixed and route the
+interrupts to CPU #3 (logical CPU number) (2^3=0x08)::
+
+     echo "8" >/proc/irq/41/smp_affinity
+
+Set the default route for IRQ number 41 to CPU 6 in lowest priority
+delivery mode (redirectable)::
+
+     echo "r 40" >/proc/irq/41/smp_affinity
+
+The output of the command::
+
+     cat /proc/irq/IRQ#/smp_affinity
+
+gives the target CPU mask for the specified interrupt vector. If the CPU
+mask is preceded by the character "r", the interrupt is redirectable
+(i.e. lowest priority mode routing is used), otherwise its route is
+fixed.
+
+
+
+Initialization and default behavior
+===================================
+
+If the platform features IRQ redirection (info provided by SAL) all
+IO-SAPIC interrupts are initialized with CPU#0 as their default target
+and the routing is the so called "lowest priority mode" (actually
+fixed SAPIC mode with hint). The XTP chipset registers are used as hints
+for the IRQ routing. Currently in Linux XTP registers can have three
+values:
+
+	- minimal for an idle task,
+	- normal if any other task runs,
+	- maximal if the CPU is going to be switched off.
+
+The IRQ is routed to the CPU with lowest XTP register value, the
+search begins at the default CPU. Therefore most of the interrupts
+will be handled by CPU #0.
+
+If the platform doesn't feature interrupt redirection IOSAPIC fixed
+routing is used. The target CPUs are distributed in a round robin
+manner. IRQs will be routed only to the selected target CPUs. Check
+with::
+
+        cat /proc/interrupts
+
+
+
+Comments
+========
+
+On large (multi-node) systems it is recommended to route the IRQs to
+the node to which the corresponding device is connected.
+For systems like the NEC AzusA we get IRQ node-affinity for free. This
+is because usually the chipsets on each node redirect the interrupts
+only to their own CPUs (as they cannot see the XTP registers on the
+other nodes).
diff --git a/Documentation/arch/ia64/mca.rst b/Documentation/arch/ia64/mca.rst
new file mode 100644
index 000000000000..08270bba44a4
--- /dev/null
+++ b/Documentation/arch/ia64/mca.rst
@@ -0,0 +1,198 @@
+=============================================================
+An ad-hoc collection of notes on IA64 MCA and INIT processing
+=============================================================
+
+Feel free to update it with notes about any area that is not clear.
+
+---
+
+MCA/INIT are completely asynchronous.  They can occur at any time, when
+the OS is in any state.  Including when one of the cpus is already
+holding a spinlock.  Trying to get any lock from MCA/INIT state is
+asking for deadlock.  Also the state of structures that are protected
+by locks is indeterminate, including linked lists.
+
+---
+
+The complicated ia64 MCA process.  All of this is mandated by Intel's
+specification for ia64 SAL, error recovery and unwind, it is not as
+if we have a choice here.
+
+* MCA occurs on one cpu, usually due to a double bit memory error.
+  This is the monarch cpu.
+
+* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
+  to all the other cpus, the slaves.
+
+* Slave cpus that receive the MCA interrupt call down into SAL, they
+  end up spinning disabled while the MCA is being serviced.
+
+* If any slave cpu was already spinning disabled when the MCA occurred
+  then it cannot service the MCA interrupt.  SAL waits ~20 seconds then
+  sends an unmaskable INIT event to the slave cpus that have not
+  already rendezvoused.
+
+* Because MCA/INIT can be delivered at any time, including when the cpu
+  is down in PAL in physical mode, the registers at the time of the
+  event are _completely_ undefined.  In particular the MCA/INIT
+  handlers cannot rely on the thread pointer, PAL physical mode can
+  (and does) modify TP.  It is allowed to do that as long as it resets
+  TP on return.  However MCA/INIT events expose us to these PAL
+  internal TP changes.  Hence curr_task().
+
+* If an MCA/INIT event occurs while the kernel was running (not user
+  space) and the kernel has called PAL then the MCA/INIT handler cannot
+  assume that the kernel stack is in a fit state to be used.  Mainly
+  because PAL may or may not maintain the stack pointer internally.
+  Because the MCA/INIT handlers cannot trust the kernel stack, they
+  have to use their own, per-cpu stacks.  The MCA/INIT stacks are
+  preformatted with just enough task state to let the relevant handlers
+  do their job.
+
+* Unlike most other architectures, the ia64 struct task is embedded in
+  the kernel stack[1].  So switching to a new kernel stack means that
+  we switch to a new task as well.  Because various bits of the kernel
+  assume that current points into the struct task, switching to a new
+  stack also means a new value for current.
+
+* Once all slaves have rendezvoused and are spinning disabled, the
+  monarch is entered.  The monarch now tries to diagnose the problem
+  and decide if it can recover or not.
+
+* Part of the monarch's job is to look at the state of all the other
+  tasks.  The only way to do that on ia64 is to call the unwinder,
+  as mandated by Intel.
+
+* The starting point for the unwind depends on whether a task is
+  running or not.  That is, whether it is on a cpu or is blocked.  The
+  monarch has to determine whether or not a task is on a cpu before it
+  knows how to start unwinding it.  The tasks that received an MCA or
+  INIT event are no longer running, they have been converted to blocked
+  tasks.  But (and its a big but), the cpus that received the MCA
+  rendezvous interrupt are still running on their normal kernel stacks!
+
+* To distinguish between these two cases, the monarch must know which
+  tasks are on a cpu and which are not.  Hence each slave cpu that
+  switches to an MCA/INIT stack, registers its new stack using
+  set_curr_task(), so the monarch can tell that the _original_ task is
+  no longer running on that cpu.  That gives us a decent chance of
+  getting a valid backtrace of the _original_ task.
+
+* MCA/INIT can be nested, to a depth of 2 on any cpu.  In the case of a
+  nested error, we want diagnostics on the MCA/INIT handler that
+  failed, not on the task that was originally running.  Again this
+  requires set_curr_task() so the MCA/INIT handlers can register their
+  own stack as running on that cpu.  Then a recursive error gets a
+  trace of the failing handler's "task".
+
+[1]
+    My (Keith Owens) original design called for ia64 to separate its
+    struct task and the kernel stacks.  Then the MCA/INIT data would be
+    chained stacks like i386 interrupt stacks.  But that required
+    radical surgery on the rest of ia64, plus extra hard wired TLB
+    entries with its associated performance degradation.  David
+    Mosberger vetoed that approach.  Which meant that separate kernel
+    stacks meant separate "tasks" for the MCA/INIT handlers.
+
+---
+
+INIT is less complicated than MCA.  Pressing the nmi button or using
+the equivalent command on the management console sends INIT to all
+cpus.  SAL picks one of the cpus as the monarch and the rest are
+slaves.  All the OS INIT handlers are entered at approximately the same
+time.  The OS monarch prints the state of all tasks and returns, after
+which the slaves return and the system resumes.
+
+At least that is what is supposed to happen.  Alas there are broken
+versions of SAL out there.  Some drive all the cpus as monarchs.  Some
+drive them all as slaves.  Some drive one cpu as monarch, wait for that
+cpu to return from the OS then drive the rest as slaves.  Some versions
+of SAL cannot even cope with returning from the OS, they spin inside
+SAL on resume.  The OS INIT code has workarounds for some of these
+broken SAL symptoms, but some simply cannot be fixed from the OS side.
+
+---
+
+The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
+violations.  Unfortunately MCA/INIT start off as massive layer
+violations (can occur at _any_ time) and they build from there.
+
+At least ia64 makes an attempt at recovering from hardware errors, but
+it is a difficult problem because of the asynchronous nature of these
+errors.  When processing an unmaskable interrupt we sometimes need
+special code to cope with our inability to take any locks.
+
+---
+
+How is ia64 MCA/INIT different from x86 NMI?
+
+* x86 NMI typically gets delivered to one cpu.  MCA/INIT gets sent to
+  all cpus.
+
+* x86 NMI cannot be nested.  MCA/INIT can be nested, to a depth of 2
+  per cpu.
+
+* x86 has a separate struct task which points to one of multiple kernel
+  stacks.  ia64 has the struct task embedded in the single kernel
+  stack, so switching stack means switching task.
+
+* x86 does not call the BIOS so the NMI handler does not have to worry
+  about any registers having changed.  MCA/INIT can occur while the cpu
+  is in PAL in physical mode, with undefined registers and an undefined
+  kernel stack.
+
+* i386 backtrace is not very sensitive to whether a process is running
+  or not.  ia64 unwind is very, very sensitive to whether a process is
+  running or not.
+
+---
+
+What happens when MCA/INIT is delivered what a cpu is running user
+space code?
+
+The user mode registers are stored in the RSE area of the MCA/INIT on
+entry to the OS and are restored from there on return to SAL, so user
+mode registers are preserved across a recoverable MCA/INIT.  Since the
+OS has no idea what unwind data is available for the user space stack,
+MCA/INIT never tries to backtrace user space.  Which means that the OS
+does not bother making the user space process look like a blocked task,
+i.e. the OS does not copy pt_regs and switch_stack to the user space
+stack.  Also the OS has no idea how big the user space RSE and memory
+stacks are, which makes it too risky to copy the saved state to a user
+mode stack.
+
+---
+
+How do we get a backtrace on the tasks that were running when MCA/INIT
+was delivered?
+
+mca.c:::ia64_mca_modify_original_stack().  That identifies and
+verifies the original kernel stack, copies the dirty registers from
+the MCA/INIT stack's RSE to the original stack's RSE, copies the
+skeleton struct pt_regs and switch_stack to the original stack, fills
+in the skeleton structures from the PAL minstate area and updates the
+original stack's thread.ksp.  That makes the original stack look
+exactly like any other blocked task, i.e. it now appears to be
+sleeping.  To get a backtrace, just start with thread.ksp for the
+original task and unwind like any other sleeping task.
+
+---
+
+How do we identify the tasks that were running when MCA/INIT was
+delivered?
+
+If the previous task has been verified and converted to a blocked
+state, then sos->prev_task on the MCA/INIT stack is updated to point to
+the previous task.  You can look at that field in dumps or debuggers.
+To help distinguish between the handler and the original tasks,
+handlers have _TIF_MCA_INIT set in thread_info.flags.
+
+The sos data is always in the MCA/INIT handler stack, at offset
+MCA_SOS_OFFSET.  You can get that value from mca_asm.h or calculate it
+as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
+ia64_sal_os_state), with 16 byte alignment for all structures.
+
+Also the comm field of the MCA/INIT task is modified to include the pid
+of the original task, for humans to use.  For example, a comm field of
+'MCA 12159' means that pid 12159 was running when the MCA was
+delivered.
diff --git a/Documentation/arch/ia64/serial.rst b/Documentation/arch/ia64/serial.rst
new file mode 100644
index 000000000000..1de70c305a79
--- /dev/null
+++ b/Documentation/arch/ia64/serial.rst
@@ -0,0 +1,165 @@
+==============
+Serial Devices
+==============
+
+Serial Device Naming
+====================
+
+    As of 2.6.10, serial devices on ia64 are named based on the
+    order of ACPI and PCI enumeration.  The first device in the
+    ACPI namespace (if any) becomes /dev/ttyS0, the second becomes
+    /dev/ttyS1, etc., and PCI devices are named sequentially
+    starting after the ACPI devices.
+
+    Prior to 2.6.10, there were confusing exceptions to this:
+
+	- Firmware on some machines (mostly from HP) provides an HCDP
+	  table[1] that tells the kernel about devices that can be used
+	  as a serial console.  If the user specified "console=ttyS0"
+	  or the EFI ConOut path contained only UART devices, the
+	  kernel registered the device described by the HCDP as
+	  /dev/ttyS0.
+
+	- If there was no HCDP, we assumed there were UARTs at the
+	  legacy COM port addresses (I/O ports 0x3f8 and 0x2f8), so
+	  the kernel registered those as /dev/ttyS0 and /dev/ttyS1.
+
+    Any additional ACPI or PCI devices were registered sequentially
+    after /dev/ttyS0 as they were discovered.
+
+    With an HCDP, device names changed depending on EFI configuration
+    and "console=" arguments.  Without an HCDP, device names didn't
+    change, but we registered devices that might not really exist.
+
+    For example, an HP rx1600 with a single built-in serial port
+    (described in the ACPI namespace) plus an MP[2] (a PCI device) has
+    these ports:
+
+      ==========  ==========     ============    ============   =======
+      Type        MMIO           pre-2.6.10      pre-2.6.10     2.6.10+
+		  address
+				 (EFI console    (EFI console
+                                 on builtin)     on MP port)
+      ==========  ==========     ============    ============   =======
+      builtin     0xff5e0000        ttyS0           ttyS1         ttyS0
+      MP UPS      0xf8031000        ttyS1           ttyS2         ttyS1
+      MP Console  0xf8030000        ttyS2           ttyS0         ttyS2
+      MP 2        0xf8030010        ttyS3           ttyS3         ttyS3
+      MP 3        0xf8030038        ttyS4           ttyS4         ttyS4
+      ==========  ==========     ============    ============   =======
+
+Console Selection
+=================
+
+    EFI knows what your console devices are, but it doesn't tell the
+    kernel quite enough to actually locate them.  The DIG64 HCDP
+    table[1] does tell the kernel where potential serial console
+    devices are, but not all firmware supplies it.  Also, EFI supports
+    multiple simultaneous consoles and doesn't tell the kernel which
+    should be the "primary" one.
+
+    So how do you tell Linux which console device to use?
+
+	- If your firmware supplies the HCDP, it is simplest to
+	  configure EFI with a single device (either a UART or a VGA
+	  card) as the console.  Then you don't need to tell Linux
+	  anything; the kernel will automatically use the EFI console.
+
+	  (This works only in 2.6.6 or later; prior to that you had
+	  to specify "console=ttyS0" to get a serial console.)
+
+	- Without an HCDP, Linux defaults to a VGA console unless you
+	  specify a "console=" argument.
+
+    NOTE: Don't assume that a serial console device will be /dev/ttyS0.
+    It might be ttyS1, ttyS2, etc.  Make sure you have the appropriate
+    entries in /etc/inittab (for getty) and /etc/securetty (to allow
+    root login).
+
+Early Serial Console
+====================
+
+    The kernel can't start using a serial console until it knows where
+    the device lives.  Normally this happens when the driver enumerates
+    all the serial devices, which can happen a minute or more after the
+    kernel starts booting.
+
+    2.6.10 and later kernels have an "early uart" driver that works
+    very early in the boot process.  The kernel will automatically use
+    this if the user supplies an argument like "console=uart,io,0x3f8",
+    or if the EFI console path contains only a UART device and the
+    firmware supplies an HCDP.
+
+Troubleshooting Serial Console Problems
+=======================================
+
+    No kernel output after elilo prints "Uncompressing Linux... done":
+
+	- You specified "console=ttyS0" but Linux changed the device
+	  to which ttyS0 refers.  Configure exactly one EFI console
+	  device[3] and remove the "console=" option.
+
+	- The EFI console path contains both a VGA device and a UART.
+	  EFI and elilo use both, but Linux defaults to VGA.  Remove
+	  the VGA device from the EFI console path[3].
+
+	- Multiple UARTs selected as EFI console devices.  EFI and
+	  elilo use all selected devices, but Linux uses only one.
+	  Make sure only one UART is selected in the EFI console
+	  path[3].
+
+	- You're connected to an HP MP port[2] but have a non-MP UART
+	  selected as EFI console device.  EFI uses the MP as a
+	  console device even when it isn't explicitly selected.
+	  Either move the console cable to the non-MP UART, or change
+	  the EFI console path[3] to the MP UART.
+
+    Long pause (60+ seconds) between "Uncompressing Linux... done" and
+    start of kernel output:
+
+	- No early console because you used "console=ttyS<n>".  Remove
+	  the "console=" option if your firmware supplies an HCDP.
+
+	- If you don't have an HCDP, the kernel doesn't know where
+	  your console lives until the driver discovers serial
+	  devices.  Use "console=uart,io,0x3f8" (or appropriate
+	  address for your machine).
+
+    Kernel and init script output works fine, but no "login:" prompt:
+
+	- Add getty entry to /etc/inittab for console tty.  Look for
+	  the "Adding console on ttyS<n>" message that tells you which
+	  device is the console.
+
+    "login:" prompt, but can't login as root:
+
+	- Add entry to /etc/securetty for console tty.
+
+    No ACPI serial devices found in 2.6.17 or later:
+
+	- Turn on CONFIG_PNP and CONFIG_PNPACPI.  Prior to 2.6.17, ACPI
+	  serial devices were discovered by 8250_acpi.  In 2.6.17,
+	  8250_acpi was replaced by the combination of 8250_pnp and
+	  CONFIG_PNPACPI.
+
+
+
+[1]
+    http://www.dig64.org/specifications/agreement
+    The table was originally defined as the "HCDP" for "Headless
+    Console/Debug Port."  The current version is the "PCDP" for
+    "Primary Console and Debug Port Devices."
+
+[2]
+    The HP MP (management processor) is a PCI device that provides
+    several UARTs.  One of the UARTs is often used as a console; the
+    EFI Boot Manager identifies it as "Acpi(HWP0002,700)/Pci(...)/Uart".
+    The external connection is usually a 25-pin connector, and a
+    special dongle converts that to three 9-pin connectors, one of
+    which is labelled "Console."
+
+[3]
+    EFI console devices are configured using the EFI Boot Manager
+    "Boot option maintenance" menu.  You may have to interrupt the
+    boot sequence to use this menu, and you will have to reset the
+    box after changing console configuration.
diff --git a/Documentation/arch/index.rst b/Documentation/arch/index.rst
new file mode 100644
index 000000000000..80ee31016584
--- /dev/null
+++ b/Documentation/arch/index.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+CPU Architectures
+=================
+
+These books provide programming details about architecture-specific
+implementation.
+
+.. toctree::
+   :maxdepth: 2
+
+   arc/index
+   ../arm/index
+   ../arm64/index
+   ia64/index
+   ../loongarch/index
+   m68k/index
+   ../mips/index
+   nios2/index
+   openrisc/index
+   parisc/index
+   ../powerpc/index
+   ../riscv/index
+   ../s390/index
+   sh/index
+   sparc/index
+   x86/index
+   xtensa/index
diff --git a/Documentation/arch/m68k/buddha-driver.rst b/Documentation/arch/m68k/buddha-driver.rst
new file mode 100644
index 000000000000..20e401413991
--- /dev/null
+++ b/Documentation/arch/m68k/buddha-driver.rst
@@ -0,0 +1,209 @@
+=====================================
+Amiga Buddha and Catweasel IDE Driver
+=====================================
+
+The Amiga Buddha and Catweasel IDE Driver (part of ide.c) was written by
+Geert Uytterhoeven based on the following specifications:
+
+------------------------------------------------------------------------
+
+Register map of the Buddha IDE controller and the
+Buddha-part of the Catweasel Zorro-II version
+
+The Autoconfiguration has been implemented just as Commodore
+described  in  their  manuals, no tricks have been used (for
+example leaving some address lines out of the equations...).
+If you want to configure the board yourself (for example let
+a  Linux  kernel  configure the card), look at the Commodore
+Docs.  Reading the nibbles should give this information::
+
+  Vendor number: 4626 ($1212)
+  product number: 0 (42 for Catweasel Z-II)
+  Serial number: 0
+  Rom-vector: $1000
+
+The  card  should be a Z-II board, size 64K, not for freemem
+list, Rom-Vektor is valid, no second Autoconfig-board on the
+same card, no space preference, supports "Shutup_forever".
+
+Setting  the  base address should be done in two steps, just
+as  the Amiga Kickstart does:  The lower nibble of the 8-Bit
+address is written to $4a, then the whole Byte is written to
+$48, while it doesn't matter how often you're writing to $4a
+as  long as $48 is not touched.  After $48 has been written,
+the  whole card disappears from $e8 and is mapped to the new
+address just written.  Make sure $4a is written before $48,
+otherwise your chance is only 1:16 to find the board :-).
+
+The local memory-map is even active when mapped to $e8:
+
+==============  ===========================================
+$0-$7e		Autokonfig-space, see Z-II docs.
+
+$80-$7fd	reserved
+
+$7fe		Speed-select Register: Read & Write
+		(description see further down)
+
+$800-$8ff	IDE-Select 0 (Port 0, Register set 0)
+
+$900-$9ff	IDE-Select 1 (Port 0, Register set 1)
+
+$a00-$aff	IDE-Select 2 (Port 1, Register set 0)
+
+$b00-$bff	IDE-Select 3 (Port 1, Register set 1)
+
+$c00-$cff	IDE-Select 4 (Port 2, Register set 0,
+                Catweasel only!)
+
+$d00-$dff	IDE-Select 5 (Port 3, Register set 1,
+		Catweasel only!)
+
+$e00-$eff	local expansion port, on Catweasel Z-II the
+		Catweasel registers are also mapped here.
+		Never touch, use multidisk.device!
+
+$f00		read only, Byte-access: Bit 7 shows the
+		level of the IRQ-line of IDE port 0.
+
+$f01-$f3f	mirror of $f00
+
+$f40		read only, Byte-access: Bit 7 shows the
+		level of the IRQ-line of IDE port 1.
+
+$f41-$f7f	mirror of $f40
+
+$f80		read only, Byte-access: Bit 7 shows the
+		level of the IRQ-line of IDE port 2.
+		(Catweasel only!)
+
+$f81-$fbf	mirror of $f80
+
+$fc0		write-only: Writing any value to this
+		register enables IRQs to be passed from the
+		IDE ports to the Zorro bus. This mechanism
+		has been implemented to be compatible with
+		harddisks that are either defective or have
+		a buggy firmware and pull the IRQ line up
+		while starting up. If interrupts would
+		always be passed to the bus, the computer
+		might not start up. Once enabled, this flag
+		can not be disabled again. The level of the
+		flag can not be determined by software
+		(what for? Write to me if it's necessary!).
+
+$fc1-$fff	mirror of $fc0
+
+$1000-$ffff	Buddha-Rom with offset $1000 in the rom
+		chip. The addresses $0 to $fff of the rom
+		chip cannot be read. Rom is Byte-wide and
+		mapped to even addresses.
+==============  ===========================================
+
+The  IDE ports issue an INT2.  You can read the level of the
+IRQ-lines  of  the  IDE-ports by reading from the three (two
+for  Buddha-only)  registers  $f00, $f40 and $f80.  This way
+more  than one I/O request can be handled and you can easily
+determine  what  driver  has  to serve the INT2.  Buddha and
+Catweasel  expansion  boards  can issue an INT6.  A separate
+memory  map  is available for the I/O module and the sysop's
+I/O module.
+
+The IDE ports are fed by the address lines A2 to A4, just as
+the  Amiga  1200  and  Amiga  4000  IDE ports are.  This way
+existing  drivers  can be easily ported to Buddha.  A move.l
+polls  two  words  out of the same address of IDE port since
+every  word  is  mirrored  once.  movem is not possible, but
+it's  not  necessary  either,  because  you can only speedup
+68000  systems  with  this  technique.   A 68020 system with
+fastmem is faster with move.l.
+
+If you're using the mirrored registers of the IDE-ports with
+A6=1,  the Buddha doesn't care about the speed that you have
+selected  in  the  speed  register (see further down).  With
+A6=1  (for example $840 for port 0, register set 0), a 780ns
+access  is being made.  These registers should be used for a
+command   access   to  the  harddisk/CD-Rom,  since  command
+accesses  are Byte-wide and have to be made slower according
+to the ATA-X3T9 manual.
+
+Now  for the speed-register:  The register is byte-wide, and
+only  the  upper  three  bits are used (Bits 7 to 5).  Bit 4
+must  always  be set to 1 to be compatible with later Buddha
+versions  (if  I'll  ever  update this one).  I presume that
+I'll  never use the lower four bits, but they have to be set
+to 1 by definition.
+
+The  values in this table have to be shifted 5 bits to the
+left and or'd with $1f (this sets the lower 5 bits).
+
+All  the timings have in common:  Select and IOR/IOW rise at
+the  same  time.   IOR  and  IOW have a propagation delay of
+about  30ns  to  the clocks on the Zorro bus, that's why the
+values  are no multiple of 71.  One clock-cycle is 71ns long
+(exactly 70,5 at 14,18 Mhz on PAL systems).
+
+value 0 (Default after reset)
+  497ns Select (7 clock cycles) , IOR/IOW after 172ns (2 clock cycles)
+  (same timing as the Amiga 1200 does on it's IDE port without
+  accelerator card)
+
+value 1
+  639ns Select (9 clock cycles), IOR/IOW after 243ns (3 clock cycles)
+
+value 2
+  781ns Select (11 clock cycles), IOR/IOW after 314ns (4 clock cycles)
+
+value 3
+  355ns Select (5 clock cycles), IOR/IOW after 101ns (1 clock cycle)
+
+value 4
+  355ns Select (5 clock cycles), IOR/IOW after 172ns (2 clock cycles)
+
+value 5
+  355ns Select (5 clock cycles), IOR/IOW after 243ns (3 clock cycles)
+
+value 6
+  1065ns Select (15 clock cycles), IOR/IOW after 314ns (4 clock cycles)
+
+value 7
+  355ns Select, (5 clock cycles), IOR/IOW after 101ns (1 clock cycle)
+
+When accessing IDE registers with A6=1 (for example $84x),
+the timing will always be mode 0 8-bit compatible, no matter
+what you have selected in the speed register:
+
+781ns select, IOR/IOW after 4 clock cycles (=314ns) aktive.
+
+All  the  timings with a very short select-signal (the 355ns
+fast  accesses)  depend  on the accelerator card used in the
+system:  Sometimes two more clock cycles are inserted by the
+bus  interface,  making  the  whole access 497ns long.  This
+doesn't  affect  the  reliability  of the controller nor the
+performance  of  the  card,  since  this doesn't happen very
+often.
+
+All  the  timings  are  calculated  and  only  confirmed  by
+measurements  that allowed me to count the clock cycles.  If
+the  system  is clocked by an oscillator other than 28,37516
+Mhz  (for  example  the  NTSC-frequency  28,63636 Mhz), each
+clock  cycle is shortened to a bit less than 70ns (not worth
+mentioning).   You  could think of a small performance boost
+by  overclocking  the  system,  but  you would either need a
+multisync  monitor,  or  a  graphics card, and your internal
+diskdrive would go crazy, that's why you shouldn't tune your
+Amiga this way.
+
+Giving  you  the  possibility  to  write  software  that  is
+compatible  with both the Buddha and the Catweasel Z-II, The
+Buddha  acts  just  like  a  Catweasel  Z-II  with no device
+connected  to  the  third  IDE-port.   The IRQ-register $f80
+always  shows a "no IRQ here" on the Buddha, and accesses to
+the  third  IDE  port  are  going into data's Nirwana on the
+Buddha.
+
+Jens Schönfeld february 19th, 1997
+
+updated may 27th, 1997
+
+eMail: sysop@nostlgic.tng.oche.de
diff --git a/Documentation/arch/m68k/features.rst b/Documentation/arch/m68k/features.rst
new file mode 100644
index 000000000000..5107a2119472
--- /dev/null
+++ b/Documentation/arch/m68k/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features m68k
diff --git a/Documentation/arch/m68k/index.rst b/Documentation/arch/m68k/index.rst
new file mode 100644
index 000000000000..0f890dbb5fe2
--- /dev/null
+++ b/Documentation/arch/m68k/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+m68k Architecture
+=================
+
+.. toctree::
+   :maxdepth: 2
+
+   kernel-options
+   buddha-driver
+
+   features
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/arch/m68k/kernel-options.rst b/Documentation/arch/m68k/kernel-options.rst
new file mode 100644
index 000000000000..2008a20b4329
--- /dev/null
+++ b/Documentation/arch/m68k/kernel-options.rst
@@ -0,0 +1,911 @@
+===================================
+Command Line Options for Linux/m68k
+===================================
+
+Last Update: 2 May 1999
+
+Linux/m68k version: 2.2.6
+
+Author: Roman.Hodek@informatik.uni-erlangen.de (Roman Hodek)
+
+Update: jds@kom.auc.dk (Jes Sorensen) and faq@linux-m68k.org (Chris Lawrence)
+
+0) Introduction
+===============
+
+Often I've been asked which command line options the Linux/m68k
+kernel understands, or how the exact syntax for the ... option is, or
+... about the option ... . I hope, this document supplies all the
+answers...
+
+Note that some options might be outdated, their descriptions being
+incomplete or missing. Please update the information and send in the
+patches.
+
+
+1) Overview of the Kernel's Option Processing
+=============================================
+
+The kernel knows three kinds of options on its command line:
+
+  1) kernel options
+  2) environment settings
+  3) arguments for init
+
+To which of these classes an argument belongs is determined as
+follows: If the option is known to the kernel itself, i.e. if the name
+(the part before the '=') or, in some cases, the whole argument string
+is known to the kernel, it belongs to class 1. Otherwise, if the
+argument contains an '=', it is of class 2, and the definition is put
+into init's environment. All other arguments are passed to init as
+command line options.
+
+This document describes the valid kernel options for Linux/m68k in
+the version mentioned at the start of this file. Later revisions may
+add new such options, and some may be missing in older versions.
+
+In general, the value (the part after the '=') of an option is a
+list of values separated by commas. The interpretation of these values
+is up to the driver that "owns" the option. This association of
+options with drivers is also the reason that some are further
+subdivided.
+
+
+2) General Kernel Options
+=========================
+
+2.1) root=
+----------
+
+:Syntax: root=/dev/<device>
+:or:     root=<hex_number>
+
+This tells the kernel which device it should mount as the root
+filesystem. The device must be a block device with a valid filesystem
+on it.
+
+The first syntax gives the device by name. These names are converted
+into a major/minor number internally in the kernel in an unusual way.
+Normally, this "conversion" is done by the device files in /dev, but
+this isn't possible here, because the root filesystem (with /dev)
+isn't mounted yet... So the kernel parses the name itself, with some
+hardcoded name to number mappings. The name must always be a
+combination of two or three letters, followed by a decimal number.
+Valid names are::
+
+  /dev/ram: -> 0x0100 (initial ramdisk)
+  /dev/hda: -> 0x0300 (first IDE disk)
+  /dev/hdb: -> 0x0340 (second IDE disk)
+  /dev/sda: -> 0x0800 (first SCSI disk)
+  /dev/sdb: -> 0x0810 (second SCSI disk)
+  /dev/sdc: -> 0x0820 (third SCSI disk)
+  /dev/sdd: -> 0x0830 (forth SCSI disk)
+  /dev/sde: -> 0x0840 (fifth SCSI disk)
+  /dev/fd : -> 0x0200 (floppy disk)
+
+The name must be followed by a decimal number, that stands for the
+partition number. Internally, the value of the number is just
+added to the device number mentioned in the table above. The
+exceptions are /dev/ram and /dev/fd, where /dev/ram refers to an
+initial ramdisk loaded by your bootstrap program (please consult the
+instructions for your bootstrap program to find out how to load an
+initial ramdisk). As of kernel version 2.0.18 you must specify
+/dev/ram as the root device if you want to boot from an initial
+ramdisk. For the floppy devices, /dev/fd, the number stands for the
+floppy drive number (there are no partitions on floppy disks). I.e.,
+/dev/fd0 stands for the first drive, /dev/fd1 for the second, and so
+on. Since the number is just added, you can also force the disk format
+by adding a number greater than 3. If you look into your /dev
+directory, use can see the /dev/fd0D720 has major 2 and minor 16. You
+can specify this device for the root FS by writing "root=/dev/fd16" on
+the kernel command line.
+
+[Strange and maybe uninteresting stuff ON]
+
+This unusual translation of device names has some strange
+consequences: If, for example, you have a symbolic link from /dev/fd
+to /dev/fd0D720 as an abbreviation for floppy driver #0 in DD format,
+you cannot use this name for specifying the root device, because the
+kernel cannot see this symlink before mounting the root FS and it
+isn't in the table above. If you use it, the root device will not be
+set at all, without an error message. Another example: You cannot use a
+partition on e.g. the sixth SCSI disk as the root filesystem, if you
+want to specify it by name. This is, because only the devices up to
+/dev/sde are in the table above, but not /dev/sdf. Although, you can
+use the sixth SCSI disk for the root FS, but you have to specify the
+device by number... (see below). Or, even more strange, you can use the
+fact that there is no range checking of the partition number, and your
+knowledge that each disk uses 16 minors, and write "root=/dev/sde17"
+(for /dev/sdf1).
+
+[Strange and maybe uninteresting stuff OFF]
+
+If the device containing your root partition isn't in the table
+above, you can also specify it by major and minor numbers. These are
+written in hex, with no prefix and no separator between. E.g., if you
+have a CD with contents appropriate as a root filesystem in the first
+SCSI CD-ROM drive, you boot from it by "root=0b00". Here, hex "0b" =
+decimal 11 is the major of SCSI CD-ROMs, and the minor 0 stands for
+the first of these. You can find out all valid major numbers by
+looking into include/linux/major.h.
+
+In addition to major and minor numbers, if the device containing your
+root partition uses a partition table format with unique partition
+identifiers, then you may use them.  For instance,
+"root=PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF".  It is also
+possible to reference another partition on the same device using a
+known partition UUID as the starting point.  For example,
+if partition 5 of the device has the UUID of
+00112233-4455-6677-8899-AABBCCDDEEFF then partition 3 may be found as
+follows:
+
+  PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF/PARTNROFF=-2
+
+Authoritative information can be found in
+"Documentation/admin-guide/kernel-parameters.rst".
+
+
+2.2) ro, rw
+-----------
+
+:Syntax: ro
+:or:     rw
+
+These two options tell the kernel whether it should mount the root
+filesystem read-only or read-write. The default is read-only, except
+for ramdisks, which default to read-write.
+
+
+2.3) debug
+----------
+
+:Syntax: debug
+
+This raises the kernel log level to 10 (the default is 7). This is the
+same level as set by the "dmesg" command, just that the maximum level
+selectable by dmesg is 8.
+
+
+2.4) debug=
+-----------
+
+:Syntax: debug=<device>
+
+This option causes certain kernel messages be printed to the selected
+debugging device. This can aid debugging the kernel, since the
+messages can be captured and analyzed on some other machine. Which
+devices are possible depends on the machine type. There are no checks
+for the validity of the device name. If the device isn't implemented,
+nothing happens.
+
+Messages logged this way are in general stack dumps after kernel
+memory faults or bad kernel traps, and kernel panics. To be exact: all
+messages of level 0 (panic messages) and all messages printed while
+the log level is 8 or more (their level doesn't matter). Before stack
+dumps, the kernel sets the log level to 10 automatically. A level of
+at least 8 can also be set by the "debug" command line option (see
+2.3) and at run time with "dmesg -n 8".
+
+Devices possible for Amiga:
+
+ - "ser":
+	  built-in serial port; parameters: 9600bps, 8N1
+ - "mem":
+	  Save the messages to a reserved area in chip mem. After
+          rebooting, they can be read under AmigaOS with the tool
+          'dmesg'.
+
+Devices possible for Atari:
+
+ - "ser1":
+	   ST-MFP serial port ("Modem1"); parameters: 9600bps, 8N1
+ - "ser2":
+	   SCC channel B serial port ("Modem2"); parameters: 9600bps, 8N1
+ - "ser" :
+	   default serial port
+           This is "ser2" for a Falcon, and "ser1" for any other machine
+ - "midi":
+	   The MIDI port; parameters: 31250bps, 8N1
+ - "par" :
+	   parallel port
+
+           The printing routine for this implements a timeout for the
+           case there's no printer connected (else the kernel would
+           lock up). The timeout is not exact, but usually a few
+           seconds.
+
+
+2.6) ramdisk_size=
+------------------
+
+:Syntax: ramdisk_size=<size>
+
+This option instructs the kernel to set up a ramdisk of the given
+size in KBytes. Do not use this option if the ramdisk contents are
+passed by bootstrap! In this case, the size is selected automatically
+and should not be overwritten.
+
+The only application is for root filesystems on floppy disks, that
+should be loaded into memory. To do that, select the corresponding
+size of the disk as ramdisk size, and set the root device to the disk
+drive (with "root=").
+
+
+2.7) swap=
+
+  I can't find any sign of this option in 2.2.6.
+
+2.8) buff=
+-----------
+
+  I can't find any sign of this option in 2.2.6.
+
+
+3) General Device Options (Amiga and Atari)
+===========================================
+
+3.1) ether=
+-----------
+
+:Syntax: ether=[<irq>[,<base_addr>[,<mem_start>[,<mem_end>]]]],<dev-name>
+
+<dev-name> is the name of a net driver, as specified in
+drivers/net/Space.c in the Linux source. Most prominent are eth0, ...
+eth3, sl0, ... sl3, ppp0, ..., ppp3, dummy, and lo.
+
+The non-ethernet drivers (sl, ppp, dummy, lo) obviously ignore the
+settings by this options. Also, the existing ethernet drivers for
+Linux/m68k (ariadne, a2065, hydra) don't use them because Zorro boards
+are really Plug-'n-Play, so the "ether=" option is useless altogether
+for Linux/m68k.
+
+
+3.2) hd=
+--------
+
+:Syntax: hd=<cylinders>,<heads>,<sectors>
+
+This option sets the disk geometry of an IDE disk. The first hd=
+option is for the first IDE disk, the second for the second one.
+(I.e., you can give this option twice.) In most cases, you won't have
+to use this option, since the kernel can obtain the geometry data
+itself. It exists just for the case that this fails for one of your
+disks.
+
+
+3.3) max_scsi_luns=
+-------------------
+
+:Syntax: max_scsi_luns=<n>
+
+Sets the maximum number of LUNs (logical units) of SCSI devices to
+be scanned. Valid values for <n> are between 1 and 8. Default is 8 if
+"Probe all LUNs on each SCSI device" was selected during the kernel
+configuration, else 1.
+
+
+3.4) st=
+--------
+
+:Syntax: st=<buffer_size>,[<write_thres>,[<max_buffers>]]
+
+Sets several parameters of the SCSI tape driver. <buffer_size> is
+the number of 512-byte buffers reserved for tape operations for each
+device. <write_thres> sets the number of blocks which must be filled
+to start an actual write operation to the tape. Maximum value is the
+total number of buffers. <max_buffer> limits the total number of
+buffers allocated for all tape devices.
+
+
+3.5) dmasound=
+--------------
+
+:Syntax: dmasound=[<buffers>,<buffer-size>[,<catch-radius>]]
+
+This option controls some configurations of the Linux/m68k DMA sound
+driver (Amiga and Atari): <buffers> is the number of buffers you want
+to use (minimum 4, default 4), <buffer-size> is the size of each
+buffer in kilobytes (minimum 4, default 32) and <catch-radius> says
+how much percent of error will be tolerated when setting a frequency
+(maximum 10, default 0). For example with 3% you can play 8000Hz
+AU-Files on the Falcon with its hardware frequency of 8195Hz and thus
+don't need to expand the sound.
+
+
+
+4) Options for Atari Only
+=========================
+
+4.1) video=
+-----------
+
+:Syntax: video=<fbname>:<sub-options...>
+
+The <fbname> parameter specifies the name of the frame buffer,
+eg. most atari users will want to specify `atafb` here. The
+<sub-options> is a comma-separated list of the sub-options listed
+below.
+
+NB:
+    Please notice that this option was renamed from `atavideo` to
+    `video` during the development of the 1.3.x kernels, thus you
+    might need to update your boot-scripts if upgrading to 2.x from
+    an 1.2.x kernel.
+
+NBB:
+    The behavior of video= was changed in 2.1.57 so the recommended
+    option is to specify the name of the frame buffer.
+
+4.1.1) Video Mode
+-----------------
+
+This sub-option may be any of the predefined video modes, as listed
+in atari/atafb.c in the Linux/m68k source tree. The kernel will
+activate the given video mode at boot time and make it the default
+mode, if the hardware allows. Currently defined names are:
+
+ - stlow           : 320x200x4
+ - stmid, default5 : 640x200x2
+ - sthigh, default4: 640x400x1
+ - ttlow           : 320x480x8, TT only
+ - ttmid, default1 : 640x480x4, TT only
+ - tthigh, default2: 1280x960x1, TT only
+ - vga2            : 640x480x1, Falcon only
+ - vga4            : 640x480x2, Falcon only
+ - vga16, default3 : 640x480x4, Falcon only
+ - vga256          : 640x480x8, Falcon only
+ - falh2           : 896x608x1, Falcon only
+ - falh16          : 896x608x4, Falcon only
+
+If no video mode is given on the command line, the kernel tries the
+modes names "default<n>" in turn, until one is possible with the
+hardware in use.
+
+A video mode setting doesn't make sense, if the external driver is
+activated by a "external:" sub-option.
+
+4.1.2) inverse
+--------------
+
+Invert the display. This affects only text consoles.
+Usually, the background is chosen to be black. With this
+option, you can make the background white.
+
+4.1.3) font
+-----------
+
+:Syntax: font:<fontname>
+
+Specify the font to use in text modes. Currently you can choose only
+between `VGA8x8`, `VGA8x16` and `PEARL8x8`. `VGA8x8` is default, if the
+vertical size of the display is less than 400 pixel rows. Otherwise, the
+`VGA8x16` font is the default.
+
+4.1.4) `hwscroll_`
+------------------
+
+:Syntax: `hwscroll_<n>`
+
+The number of additional lines of video memory to reserve for
+speeding up the scrolling ("hardware scrolling"). Hardware scrolling
+is possible only if the kernel can set the video base address in steps
+fine enough. This is true for STE, MegaSTE, TT, and Falcon. It is not
+possible with plain STs and graphics cards (The former because the
+base address must be on a 256 byte boundary there, the latter because
+the kernel doesn't know how to set the base address at all.)
+
+By default, <n> is set to the number of visible text lines on the
+display. Thus, the amount of video memory is doubled, compared to no
+hardware scrolling. You can turn off the hardware scrolling altogether
+by setting <n> to 0.
+
+4.1.5) internal:
+----------------
+
+:Syntax: internal:<xres>;<yres>[;<xres_max>;<yres_max>;<offset>]
+
+This option specifies the capabilities of some extended internal video
+hardware, like e.g. OverScan. <xres> and <yres> give the (extended)
+dimensions of the screen.
+
+If your OverScan needs a black border, you have to write the last
+three arguments of the "internal:". <xres_max> is the maximum line
+length the hardware allows, <yres_max> the maximum number of lines.
+<offset> is the offset of the visible part of the screen memory to its
+physical start, in bytes.
+
+Often, extended interval video hardware has to be activated somehow.
+For this, see the "sw_*" options below.
+
+4.1.6) external:
+----------------
+
+:Syntax:
+  external:<xres>;<yres>;<depth>;<org>;<scrmem>[;<scrlen>[;<vgabase>
+  [;<colw>[;<coltype>[;<xres_virtual>]]]]]
+
+.. I had to break this line...
+
+This is probably the most complicated parameter... It specifies that
+you have some external video hardware (a graphics board), and how to
+use it under Linux/m68k. The kernel cannot know more about the hardware
+than you tell it here! The kernel also is unable to set or change any
+video modes, since it doesn't know about any board internal. So, you
+have to switch to that video mode before you start Linux, and cannot
+switch to another mode once Linux has started.
+
+The first 3 parameters of this sub-option should be obvious: <xres>,
+<yres> and <depth> give the dimensions of the screen and the number of
+planes (depth). The depth is the logarithm to base 2 of the number
+of colors possible. (Or, the other way round: The number of colors is
+2^depth).
+
+You have to tell the kernel furthermore how the video memory is
+organized. This is done by a letter as <org> parameter:
+
+ 'n':
+      "normal planes", i.e. one whole plane after another
+ 'i':
+      "interleaved planes", i.e. 16 bit of the first plane, than 16 bit
+      of the next, and so on... This mode is used only with the
+      built-in Atari video modes, I think there is no card that
+      supports this mode.
+ 'p':
+      "packed pixels", i.e. <depth> consecutive bits stand for all
+      planes of one pixel; this is the most common mode for 8 planes
+      (256 colors) on graphic cards
+ 't':
+      "true color" (more or less packed pixels, but without a color
+      lookup table); usually depth is 24
+
+For monochrome modes (i.e., <depth> is 1), the <org> letter has a
+different meaning:
+
+ 'n':
+      normal colors, i.e. 0=white, 1=black
+ 'i':
+      inverted colors, i.e. 0=black, 1=white
+
+The next important information about the video hardware is the base
+address of the video memory. That is given in the <scrmem> parameter,
+as a hexadecimal number with a "0x" prefix. You have to find out this
+address in the documentation of your hardware.
+
+The next parameter, <scrlen>, tells the kernel about the size of the
+video memory. If it's missing, the size is calculated from <xres>,
+<yres>, and <depth>. For now, it is not useful to write a value here.
+It would be used only for hardware scrolling (which isn't possible
+with the external driver, because the kernel cannot set the video base
+address), or for virtual resolutions under X (which the X server
+doesn't support yet). So, it's currently best to leave this field
+empty, either by ending the "external:" after the video address or by
+writing two consecutive semicolons, if you want to give a <vgabase>
+(it is allowed to leave this parameter empty).
+
+The <vgabase> parameter is optional. If it is not given, the kernel
+cannot read or write any color registers of the video hardware, and
+thus you have to set appropriate colors before you start Linux. But if
+your card is somehow VGA compatible, you can tell the kernel the base
+address of the VGA register set, so it can change the color lookup
+table. You have to look up this address in your board's documentation.
+To avoid misunderstandings: <vgabase> is the _base_ address, i.e. a 4k
+aligned address. For read/writing the color registers, the kernel
+uses the addresses vgabase+0x3c7...vgabase+0x3c9. The <vgabase>
+parameter is written in hexadecimal with a "0x" prefix, just as
+<scrmem>.
+
+<colw> is meaningful only if <vgabase> is specified. It tells the
+kernel how wide each of the color register is, i.e. the number of bits
+per single color (red/green/blue). Default is 6, another quite usual
+value is 8.
+
+Also <coltype> is used together with <vgabase>. It tells the kernel
+about the color register model of your gfx board. Currently, the types
+"vga" (which is also the default) and "mv300" (SANG MV300) are
+implemented.
+
+Parameter <xres_virtual> is required for ProMST or ET4000 cards where
+the physical linelength differs from the visible length. With ProMST,
+xres_virtual must be set to 2048. For ET4000, xres_virtual depends on the
+initialisation of the video-card.
+If you're missing a corresponding yres_virtual: the external part is legacy,
+therefore we don't support hardware-dependent functions like hardware-scroll,
+panning or blanking.
+
+4.1.7) eclock:
+--------------
+
+The external pixel clock attached to the Falcon VIDEL shifter. This
+currently works only with the ScreenWonder!
+
+4.1.8) monitorcap:
+-------------------
+
+:Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax>
+
+This describes the capabilities of a multisync monitor. Don't use it
+with a fixed-frequency monitor! For now, only the Falcon frame buffer
+uses the settings of "monitorcap:".
+
+<vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies
+your monitor can work with, in Hz. <hmin> and <hmax> are the same for
+the horizontal frequency, in kHz.
+
+  The defaults are 58;62;31;32 (VGA compatible).
+
+  The defaults for TV/SC1224/SC1435 cover both PAL and NTSC standards.
+
+4.1.9) keep
+------------
+
+If this option is given, the framebuffer device doesn't do any video
+mode calculations and settings on its own. The only Atari fb device
+that does this currently is the Falcon.
+
+What you reach with this: Settings for unknown video extensions
+aren't overridden by the driver, so you can still use the mode found
+when booting, when the driver doesn't know to set this mode itself.
+But this also means, that you can't switch video modes anymore...
+
+An example where you may want to use "keep" is the ScreenBlaster for
+the Falcon.
+
+
+4.2) atamouse=
+--------------
+
+:Syntax: atamouse=<x-threshold>,[<y-threshold>]
+
+With this option, you can set the mouse movement reporting threshold.
+This is the number of pixels of mouse movement that have to accumulate
+before the IKBD sends a new mouse packet to the kernel. Higher values
+reduce the mouse interrupt load and thus reduce the chance of keyboard
+overruns. Lower values give a slightly faster mouse responses and
+slightly better mouse tracking.
+
+You can set the threshold in x and y separately, but usually this is
+of little practical use. If there's just one number in the option, it
+is used for both dimensions. The default value is 2 for both
+thresholds.
+
+
+4.3) ataflop=
+-------------
+
+:Syntax: ataflop=<drive type>[,<trackbuffering>[,<steprateA>[,<steprateB>]]]
+
+   The drive type may be 0, 1, or 2, for DD, HD, and ED, resp. This
+   setting affects how many buffers are reserved and which formats are
+   probed (see also below). The default is 1 (HD). Only one drive type
+   can be selected. If you have two disk drives, select the "better"
+   type.
+
+   The second parameter <trackbuffer> tells the kernel whether to use
+   track buffering (1) or not (0). The default is machine-dependent:
+   no for the Medusa and yes for all others.
+
+   With the two following parameters, you can change the default
+   steprate used for drive A and B, resp.
+
+
+4.4) atascsi=
+-------------
+
+:Syntax: atascsi=<can_queue>[,<cmd_per_lun>[,<scat-gat>[,<host-id>[,<tagged>]]]]
+
+This option sets some parameters for the Atari native SCSI driver.
+Generally, any number of arguments can be omitted from the end. And
+for each of the numbers, a negative value means "use default". The
+defaults depend on whether TT-style or Falcon-style SCSI is used.
+Below, defaults are noted as n/m, where the first value refers to
+TT-SCSI and the latter to Falcon-SCSI. If an illegal value is given
+for one parameter, an error message is printed and that one setting is
+ignored (others aren't affected).
+
+  <can_queue>:
+    This is the maximum number of SCSI commands queued internally to the
+    Atari SCSI driver. A value of 1 effectively turns off the driver
+    internal multitasking (if it causes problems). Legal values are >=
+    1. <can_queue> can be as high as you like, but values greater than
+    <cmd_per_lun> times the number of SCSI targets (LUNs) you have
+    don't make sense. Default: 16/8.
+
+  <cmd_per_lun>:
+    Maximum number of SCSI commands issued to the driver for one
+    logical unit (LUN, usually one SCSI target). Legal values start
+    from 1. If tagged queuing (see below) is not used, values greater
+    than 2 don't make sense, but waste memory. Otherwise, the maximum
+    is the number of command tags available to the driver (currently
+    32). Default: 8/1. (Note: Values > 1 seem to cause problems on a
+    Falcon, cause not yet known.)
+
+    The <cmd_per_lun> value at a great part determines the amount of
+    memory SCSI reserves for itself. The formula is rather
+    complicated, but I can give you some hints:
+
+      no scatter-gather:
+	cmd_per_lun * 232 bytes
+      full scatter-gather:
+	cmd_per_lun * approx. 17 Kbytes
+
+  <scat-gat>:
+    Size of the scatter-gather table, i.e. the number of requests
+    consecutive on the disk that can be merged into one SCSI command.
+    Legal values are between 0 and 255. Default: 255/0. Note: This
+    value is forced to 0 on a Falcon, since scatter-gather isn't
+    possible with the ST-DMA. Not using scatter-gather hurts
+    performance significantly.
+
+  <host-id>:
+    The SCSI ID to be used by the initiator (your Atari). This is
+    usually 7, the highest possible ID. Every ID on the SCSI bus must
+    be unique. Default: determined at run time: If the NV-RAM checksum
+    is valid, and bit 7 in byte 30 of the NV-RAM is set, the lower 3
+    bits of this byte are used as the host ID. (This method is defined
+    by Atari and also used by some TOS HD drivers.) If the above
+    isn't given, the default ID is 7. (both, TT and Falcon).
+
+  <tagged>:
+    0 means turn off tagged queuing support, all other values > 0 mean
+    use tagged queuing for targets that support it. Default: currently
+    off, but this may change when tagged queuing handling has been
+    proved to be reliable.
+
+    Tagged queuing means that more than one command can be issued to
+    one LUN, and the SCSI device itself orders the requests so they
+    can be performed in optimal order. Not all SCSI devices support
+    tagged queuing (:-().
+
+4.5 switches=
+-------------
+
+:Syntax: switches=<list of switches>
+
+With this option you can switch some hardware lines that are often
+used to enable/disable certain hardware extensions. Examples are
+OverScan, overclocking, ...
+
+The <list of switches> is a comma-separated list of the following
+items:
+
+  ikbd:
+	set RTS of the keyboard ACIA high
+  midi:
+	set RTS of the MIDI ACIA high
+  snd6:
+	set bit 6 of the PSG port A
+  snd7:
+	set bit 6 of the PSG port A
+
+It doesn't make sense to mention a switch more than once (no
+difference to only once), but you can give as many switches as you
+want to enable different features. The switch lines are set as early
+as possible during kernel initialization (even before determining the
+present hardware.)
+
+All of the items can also be prefixed with `ov_`, i.e. `ov_ikbd`,
+`ov_midi`, ... These options are meant for switching on an OverScan
+video extension. The difference to the bare option is that the
+switch-on is done after video initialization, and somehow synchronized
+to the HBLANK. A speciality is that ov_ikbd and ov_midi are switched
+off before rebooting, so that OverScan is disabled and TOS boots
+correctly.
+
+If you give an option both, with and without the `ov_` prefix, the
+earlier initialization (`ov_`-less) takes precedence. But the
+switching-off on reset still happens in this case.
+
+5) Options for Amiga Only:
+==========================
+
+5.1) video=
+-----------
+
+:Syntax: video=<fbname>:<sub-options...>
+
+The <fbname> parameter specifies the name of the frame buffer, valid
+options are `amifb`, `cyber`, 'virge', `retz3` and `clgen`, provided
+that the respective frame buffer devices have been compiled into the
+kernel (or compiled as loadable modules). The behavior of the <fbname>
+option was changed in 2.1.57 so it is now recommended to specify this
+option.
+
+The <sub-options> is a comma-separated list of the sub-options listed
+below. This option is organized similar to the Atari version of the
+"video"-option (4.1), but knows fewer sub-options.
+
+5.1.1) video mode
+-----------------
+
+Again, similar to the video mode for the Atari (see 4.1.1). Predefined
+modes depend on the used frame buffer device.
+
+OCS, ECS and AGA machines all use the color frame buffer. The following
+predefined video modes are available:
+
+NTSC modes:
+ - ntsc            : 640x200, 15 kHz, 60 Hz
+ - ntsc-lace       : 640x400, 15 kHz, 60 Hz interlaced
+
+PAL modes:
+ - pal             : 640x256, 15 kHz, 50 Hz
+ - pal-lace        : 640x512, 15 kHz, 50 Hz interlaced
+
+ECS modes:
+ - multiscan       : 640x480, 29 kHz, 57 Hz
+ - multiscan-lace  : 640x960, 29 kHz, 57 Hz interlaced
+ - euro36          : 640x200, 15 kHz, 72 Hz
+ - euro36-lace     : 640x400, 15 kHz, 72 Hz interlaced
+ - euro72          : 640x400, 29 kHz, 68 Hz
+ - euro72-lace     : 640x800, 29 kHz, 68 Hz interlaced
+ - super72         : 800x300, 23 kHz, 70 Hz
+ - super72-lace    : 800x600, 23 kHz, 70 Hz interlaced
+ - dblntsc-ff      : 640x400, 27 kHz, 57 Hz
+ - dblntsc-lace    : 640x800, 27 kHz, 57 Hz interlaced
+ - dblpal-ff       : 640x512, 27 kHz, 47 Hz
+ - dblpal-lace     : 640x1024, 27 kHz, 47 Hz interlaced
+ - dblntsc         : 640x200, 27 kHz, 57 Hz doublescan
+ - dblpal          : 640x256, 27 kHz, 47 Hz doublescan
+
+VGA modes:
+ - vga             : 640x480, 31 kHz, 60 Hz
+ - vga70           : 640x400, 31 kHz, 70 Hz
+
+Please notice that the ECS and VGA modes require either an ECS or AGA
+chipset, and that these modes are limited to 2-bit color for the ECS
+chipset and 8-bit color for the AGA chipset.
+
+5.1.2) depth
+------------
+
+:Syntax: depth:<nr. of bit-planes>
+
+Specify the number of bit-planes for the selected video-mode.
+
+5.1.3) inverse
+--------------
+
+Use inverted display (black on white). Functionally the same as the
+"inverse" sub-option for the Atari.
+
+5.1.4) font
+-----------
+
+:Syntax: font:<fontname>
+
+Specify the font to use in text modes. Functionally the same as the
+"font" sub-option for the Atari, except that `PEARL8x8` is used instead
+of `VGA8x8` if the vertical size of the display is less than 400 pixel
+rows.
+
+5.1.5) monitorcap:
+-------------------
+
+:Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax>
+
+This describes the capabilities of a multisync monitor. For now, only
+the color frame buffer uses the settings of "monitorcap:".
+
+<vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies
+your monitor can work with, in Hz. <hmin> and <hmax> are the same for
+the horizontal frequency, in kHz.
+
+The defaults are 50;90;15;38 (Generic Amiga multisync monitor).
+
+
+5.2) fd_def_df0=
+----------------
+
+:Syntax: fd_def_df0=<value>
+
+Sets the df0 value for "silent" floppy drives. The value should be in
+hexadecimal with "0x" prefix.
+
+
+5.3) wd33c93=
+-------------
+
+:Syntax: wd33c93=<sub-options...>
+
+These options affect the A590/A2091, A3000 and GVP Series II SCSI
+controllers.
+
+The <sub-options> is a comma-separated list of the sub-options listed
+below.
+
+5.3.1) nosync
+-------------
+
+:Syntax: nosync:bitmask
+
+bitmask is a byte where the 1st 7 bits correspond with the 7
+possible SCSI devices. Set a bit to prevent sync negotiation on that
+device. To maintain backwards compatibility, a command-line such as
+"wd33c93=255" will be automatically translated to
+"wd33c93=nosync:0xff". The default is to disable sync negotiation for
+all devices, eg. nosync:0xff.
+
+5.3.2) period
+-------------
+
+:Syntax: period:ns
+
+`ns` is the minimum # of nanoseconds in a SCSI data transfer
+period. Default is 500; acceptable values are 250 - 1000.
+
+5.3.3) disconnect
+-----------------
+
+:Syntax: disconnect:x
+
+Specify x = 0 to never allow disconnects, 2 to always allow them.
+x = 1 does 'adaptive' disconnects, which is the default and generally
+the best choice.
+
+5.3.4) debug
+------------
+
+:Syntax: debug:x
+
+If `DEBUGGING_ON` is defined, x is a bit mask that causes various
+types of debug output to printed - see the DB_xxx defines in
+wd33c93.h.
+
+5.3.5) clock
+------------
+
+:Syntax: clock:x
+
+x = clock input in MHz for WD33c93 chip. Normal values would be from
+8 through 20. The default value depends on your hostadapter(s),
+default for the A3000 internal controller is 14, for the A2091 it's 8
+and for the GVP hostadapters it's either 8 or 14, depending on the
+hostadapter and the SCSI-clock jumper present on some GVP
+hostadapters.
+
+5.3.6) next
+-----------
+
+No argument. Used to separate blocks of keywords when there's more
+than one wd33c93-based host adapter in the system.
+
+5.3.7) nodma
+------------
+
+:Syntax: nodma:x
+
+If x is 1 (or if the option is just written as "nodma"), the WD33c93
+controller will not use DMA (= direct memory access) to access the
+Amiga's memory.  This is useful for some systems (like A3000's and
+A4000's with the A3640 accelerator, revision 3.0) that have problems
+using DMA to chip memory.  The default is 0, i.e. to use DMA if
+possible.
+
+
+5.4) gvp11=
+-----------
+
+:Syntax: gvp11=<addr-mask>
+
+The earlier versions of the GVP driver did not handle DMA
+address-mask settings correctly which made it necessary for some
+people to use this option, in order to get their GVP controller
+running under Linux. These problems have hopefully been solved and the
+use of this option is now highly unrecommended!
+
+Incorrect use can lead to unpredictable behavior, so please only use
+this option if you *know* what you are doing and have a reason to do
+so. In any case if you experience problems and need to use this
+option, please inform us about it by mailing to the Linux/68k kernel
+mailing list.
+
+The address mask set by this option specifies which addresses are
+valid for DMA with the GVP Series II SCSI controller. An address is
+valid, if no bits are set except the bits that are set in the mask,
+too.
+
+Some versions of the GVP can only DMA into a 24 bit address range,
+some can address a 25 bit address range while others can use the whole
+32 bit address range for DMA. The correct setting depends on your
+controller and should be autodetected by the driver. An example is the
+24 bit region which is specified by a mask of 0x00fffffe.
diff --git a/Documentation/arch/nios2/features.rst b/Documentation/arch/nios2/features.rst
new file mode 100644
index 000000000000..8449e63f69b2
--- /dev/null
+++ b/Documentation/arch/nios2/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features nios2
diff --git a/Documentation/arch/nios2/index.rst b/Documentation/arch/nios2/index.rst
new file mode 100644
index 000000000000..4468fe1a1037
--- /dev/null
+++ b/Documentation/arch/nios2/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Nios II Specific Documentation
+==============================
+
+.. toctree::
+   :maxdepth: 2
+   :numbered:
+
+   nios2
+   features
diff --git a/Documentation/arch/nios2/nios2.rst b/Documentation/arch/nios2/nios2.rst
new file mode 100644
index 000000000000..43da3f7cee76
--- /dev/null
+++ b/Documentation/arch/nios2/nios2.rst
@@ -0,0 +1,24 @@
+=================================
+Linux on the Nios II architecture
+=================================
+
+This is a port of Linux to Nios II (nios2) processor.
+
+In order to compile for Nios II, you need a version of GCC with support for the generic
+system call ABI. Please see this link for more information on how compiling and booting
+software for the Nios II platform:
+http://www.rocketboards.org/foswiki/Documentation/NiosIILinuxUserManual
+
+For reference, please see the following link:
+http://www.altera.com/literature/lit-nio2.jsp
+
+What is Nios II?
+================
+Nios II is a 32-bit embedded-processor architecture designed specifically for the
+Altera family of FPGAs. In order to support Linux, Nios II needs to be configured
+with MMU and hardware multiplier enabled.
+
+Nios II ABI
+===========
+Please refer to chapter "Application Binary Interface" in Nios II Processor Reference
+Handbook.
diff --git a/Documentation/arch/openrisc/features.rst b/Documentation/arch/openrisc/features.rst
new file mode 100644
index 000000000000..3f7c40d219f2
--- /dev/null
+++ b/Documentation/arch/openrisc/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features openrisc
diff --git a/Documentation/arch/openrisc/index.rst b/Documentation/arch/openrisc/index.rst
new file mode 100644
index 000000000000..6879f998b87a
--- /dev/null
+++ b/Documentation/arch/openrisc/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+OpenRISC Architecture
+=====================
+
+.. toctree::
+   :maxdepth: 2
+
+   openrisc_port
+   todo
+
+   features
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/arch/openrisc/openrisc_port.rst b/Documentation/arch/openrisc/openrisc_port.rst
new file mode 100644
index 000000000000..657ac4af7be6
--- /dev/null
+++ b/Documentation/arch/openrisc/openrisc_port.rst
@@ -0,0 +1,121 @@
+==============
+OpenRISC Linux
+==============
+
+This is a port of Linux to the OpenRISC class of microprocessors; the initial
+target architecture, specifically, is the 32-bit OpenRISC 1000 family (or1k).
+
+For information about OpenRISC processors and ongoing development:
+
+	=======		=============================
+	website		https://openrisc.io
+	email		openrisc@lists.librecores.org
+	=======		=============================
+
+---------------------------------------------------------------------
+
+Build instructions for OpenRISC toolchain and Linux
+===================================================
+
+In order to build and run Linux for OpenRISC, you'll need at least a basic
+toolchain and, perhaps, the architectural simulator.  Steps to get these bits
+in place are outlined here.
+
+1) Toolchain
+
+Toolchain binaries can be obtained from openrisc.io or our github releases page.
+Instructions for building the different toolchains can be found on openrisc.io
+or Stafford's toolchain build and release scripts.
+
+	==========	=================================================
+	binaries	https://github.com/openrisc/or1k-gcc/releases
+	toolchains	https://openrisc.io/software
+	building	https://github.com/stffrdhrn/or1k-toolchain-build
+	==========	=================================================
+
+2) Building
+
+Build the Linux kernel as usual::
+
+	make ARCH=openrisc CROSS_COMPILE="or1k-linux-" defconfig
+	make ARCH=openrisc CROSS_COMPILE="or1k-linux-"
+
+3) Running on FPGA (optional)
+
+The OpenRISC community typically uses FuseSoC to manage building and programming
+an SoC into an FPGA.  The below is an example of programming a De0 Nano
+development board with the OpenRISC SoC.  During the build FPGA RTL is code
+downloaded from the FuseSoC IP cores repository and built using the FPGA vendor
+tools.  Binaries are loaded onto the board with openocd.
+
+::
+
+	git clone https://github.com/olofk/fusesoc
+	cd fusesoc
+	sudo pip install -e .
+
+	fusesoc init
+	fusesoc build de0_nano
+	fusesoc pgm de0_nano
+
+	openocd -f interface/altera-usb-blaster.cfg \
+		-f board/or1k_generic.cfg
+
+	telnet localhost 4444
+	> init
+	> halt; load_image vmlinux ; reset
+
+4) Running on a Simulator (optional)
+
+QEMU is a processor emulator which we recommend for simulating the OpenRISC
+platform.  Please follow the OpenRISC instructions on the QEMU website to get
+Linux running on QEMU.  You can build QEMU yourself, but your Linux distribution
+likely provides binary packages to support OpenRISC.
+
+	=============	======================================================
+	qemu openrisc	https://wiki.qemu.org/Documentation/Platforms/OpenRISC
+	=============	======================================================
+
+---------------------------------------------------------------------
+
+Terminology
+===========
+
+In the code, the following particles are used on symbols to limit the scope
+to more or less specific processor implementations:
+
+========= =======================================
+openrisc: the OpenRISC class of processors
+or1k:     the OpenRISC 1000 family of processors
+or1200:   the OpenRISC 1200 processor
+========= =======================================
+
+---------------------------------------------------------------------
+
+History
+========
+
+18-11-2003	Matjaz Breskvar (phoenix@bsemi.com)
+	initial port of linux to OpenRISC/or32 architecture.
+        all the core stuff is implemented and seams usable.
+
+08-12-2003	Matjaz Breskvar (phoenix@bsemi.com)
+	complete change of TLB miss handling.
+	rewrite of exceptions handling.
+	fully functional sash-3.6 in default initrd.
+	a much improved version with changes all around.
+
+10-04-2004	Matjaz Breskvar (phoenix@bsemi.com)
+	alot of bugfixes all over.
+	ethernet support, functional http and telnet servers.
+	running many standard linux apps.
+
+26-06-2004	Matjaz Breskvar (phoenix@bsemi.com)
+	port to 2.6.x
+
+30-11-2004	Matjaz Breskvar (phoenix@bsemi.com)
+	lots of bugfixes and enhancments.
+	added opencores framebuffer driver.
+
+09-10-2010    Jonas Bonn (jonas@southpole.se)
+	major rewrite to bring up to par with upstream Linux 2.6.36
diff --git a/Documentation/arch/openrisc/todo.rst b/Documentation/arch/openrisc/todo.rst
new file mode 100644
index 000000000000..420b18b87eda
--- /dev/null
+++ b/Documentation/arch/openrisc/todo.rst
@@ -0,0 +1,15 @@
+====
+TODO
+====
+
+The OpenRISC Linux port is fully functional and has been tracking upstream
+since 2.6.35.  There are, however, remaining items to be completed within
+the coming months.  Here's a list of known-to-be-less-than-stellar items
+that are due for investigation shortly, i.e. our TODO list:
+
+-  Implement the rest of the DMA API... dma_map_sg, etc.
+
+-  Finish the renaming cleanup... there are references to or32 in the code
+   which was an older name for the architecture.  The name we've settled on is
+   or1k and this change is slowly trickling through the stack.  For the time
+   being, or32 is equivalent to or1k.
diff --git a/Documentation/arch/parisc/debugging.rst b/Documentation/arch/parisc/debugging.rst
new file mode 100644
index 000000000000..de1b60402c5b
--- /dev/null
+++ b/Documentation/arch/parisc/debugging.rst
@@ -0,0 +1,46 @@
+=================
+PA-RISC Debugging
+=================
+
+okay, here are some hints for debugging the lower-level parts of
+linux/parisc.
+
+
+1. Absolute addresses
+=====================
+
+A lot of the assembly code currently runs in real mode, which means
+absolute addresses are used instead of virtual addresses as in the
+rest of the kernel.  To translate an absolute address to a virtual
+address you can lookup in System.map, add __PAGE_OFFSET (0x10000000
+currently).
+
+
+2. HPMCs
+========
+
+When real-mode code tries to access non-existent memory, you'll get
+an HPMC instead of a kernel oops.  To debug an HPMC, try to find
+the System Responder/Requestor addresses.  The System Requestor
+address should match (one of the) processor HPAs (high addresses in
+the I/O range); the System Responder address is the address real-mode
+code tried to access.
+
+Typical values for the System Responder address are addresses larger
+than __PAGE_OFFSET (0x10000000) which mean a virtual address didn't
+get translated to a physical address before real-mode code tried to
+access it.
+
+
+3. Q bit fun
+============
+
+Certain, very critical code has to clear the Q bit in the PSW.  What
+happens when the Q bit is cleared is the CPU does not update the
+registers interruption handlers read to find out where the machine
+was interrupted - so if you get an interruption between the instruction
+that clears the Q bit and the RFI that sets it again you don't know
+where exactly it happened.  If you're lucky the IAOQ will point to the
+instruction that cleared the Q bit, if you're not it points anywhere
+at all.  Usually Q bit problems will show themselves in unexplainable
+system hangs or running off the end of physical memory.
diff --git a/Documentation/arch/parisc/features.rst b/Documentation/arch/parisc/features.rst
new file mode 100644
index 000000000000..501d7c450037
--- /dev/null
+++ b/Documentation/arch/parisc/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features parisc
diff --git a/Documentation/arch/parisc/index.rst b/Documentation/arch/parisc/index.rst
new file mode 100644
index 000000000000..240685751825
--- /dev/null
+++ b/Documentation/arch/parisc/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+PA-RISC Architecture
+====================
+
+.. toctree::
+   :maxdepth: 2
+
+   debugging
+   registers
+
+   features
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/arch/parisc/registers.rst b/Documentation/arch/parisc/registers.rst
new file mode 100644
index 000000000000..59c8ecf3e856
--- /dev/null
+++ b/Documentation/arch/parisc/registers.rst
@@ -0,0 +1,154 @@
+================================
+Register Usage for Linux/PA-RISC
+================================
+
+[ an asterisk is used for planned usage which is currently unimplemented ]
+
+General Registers as specified by ABI
+=====================================
+
+Control Registers
+-----------------
+
+===============================	===============================================
+CR 0 (Recovery Counter)		used for ptrace
+CR 1-CR 7(undefined)		unused
+CR 8 (Protection ID)		per-process value*
+CR 9, 12, 13 (PIDS)		unused
+CR10 (CCR)			lazy FPU saving*
+CR11				as specified by ABI (SAR)
+CR14 (interruption vector)	initialized to fault_vector
+CR15 (EIEM)			initialized to all ones*
+CR16 (Interval Timer)		read for cycle count/write starts Interval Tmr
+CR17-CR22			interruption parameters
+CR19				Interrupt Instruction Register
+CR20				Interrupt Space Register
+CR21				Interrupt Offset Register
+CR22				Interrupt PSW
+CR23 (EIRR)			read for pending interrupts/write clears bits
+CR24 (TR 0)			Kernel Space Page Directory Pointer
+CR25 (TR 1)			User   Space Page Directory Pointer
+CR26 (TR 2)			not used
+CR27 (TR 3)			Thread descriptor pointer
+CR28 (TR 4)			not used
+CR29 (TR 5)			not used
+CR30 (TR 6)			current / 0
+CR31 (TR 7)			Temporary register, used in various places
+===============================	===============================================
+
+Space Registers (kernel mode)
+-----------------------------
+
+===============================	===============================================
+SR0				temporary space register
+SR4-SR7 			set to 0
+SR1				temporary space register
+SR2				kernel should not clobber this
+SR3				used for userspace accesses (current process)
+===============================	===============================================
+
+Space Registers (user mode)
+---------------------------
+
+===============================	===============================================
+SR0				temporary space register
+SR1                             temporary space register
+SR2                             holds space of linux gateway page
+SR3                             holds user address space value while in kernel
+SR4-SR7                         Defines short address space for user/kernel
+===============================	===============================================
+
+
+Processor Status Word
+---------------------
+
+===============================	===============================================
+W (64-bit addresses)		0
+E (Little-endian)		0
+S (Secure Interval Timer)	0
+T (Taken Branch Trap)		0
+H (Higher-privilege trap)	0
+L (Lower-privilege trap)	0
+N (Nullify next instruction)	used by C code
+X (Data memory break disable)	0
+B (Taken Branch)		used by C code
+C (code address translation)	1, 0 while executing real-mode code
+V (divide step correction)	used by C code
+M (HPMC mask)			0, 1 while executing HPMC handler*
+C/B (carry/borrow bits)		used by C code
+O (ordered references)		1*
+F (performance monitor)		0
+R (Recovery Counter trap)	0
+Q (collect interruption state)	1 (0 in code directly preceding an rfi)
+P (Protection Identifiers)	1*
+D (Data address translation)	1, 0 while executing real-mode code
+I (external interrupt mask)	used by cli()/sti() macros
+===============================	===============================================
+
+"Invisible" Registers
+---------------------
+
+===============================	===============================================
+PSW default W value		0
+PSW default E value		0
+Shadow Registers		used by interruption handler code
+TOC enable bit			1
+===============================	===============================================
+
+-------------------------------------------------------------------------
+
+The PA-RISC architecture defines 7 registers as "shadow registers".
+Those are used in RETURN FROM INTERRUPTION AND RESTORE instruction to reduce
+the state save and restore time by eliminating the need for general register
+(GR) saves and restores in interruption handlers.
+Shadow registers are the GRs 1, 8, 9, 16, 17, 24, and 25.
+
+-------------------------------------------------------------------------
+
+Register usage notes, originally from John Marvin, with some additional
+notes from Randolph Chung.
+
+For the general registers:
+
+r1,r2,r19-r26,r28,r29 & r31 can be used without saving them first. And of
+course, you need to save them if you care about them, before calling
+another procedure. Some of the above registers do have special meanings
+that you should be aware of:
+
+    r1:
+	The addil instruction is hardwired to place its result in r1,
+	so if you use that instruction be aware of that.
+
+    r2:
+	This is the return pointer. In general you don't want to
+	use this, since you need the pointer to get back to your
+	caller. However, it is grouped with this set of registers
+	since the caller can't rely on the value being the same
+	when you return, i.e. you can copy r2 to another register
+	and return through that register after trashing r2, and
+	that should not cause a problem for the calling routine.
+
+    r19-r22:
+	these are generally regarded as temporary registers.
+	Note that in 64 bit they are arg7-arg4.
+
+    r23-r26:
+	these are arg3-arg0, i.e. you can use them if you
+	don't care about the values that were passed in anymore.
+
+    r28,r29:
+	are ret0 and ret1. They are what you pass return values
+	in. r28 is the primary return. When returning small structures
+	r29 may also be used to pass data back to the caller.
+
+    r30:
+	stack pointer
+
+    r31:
+	the ble instruction puts the return pointer in here.
+
+
+    r3-r18,r27,r30 need to be saved and restored. r3-r18 are just
+    general purpose registers. r27 is the data pointer, and is
+    used to make references to global variables easier. r30 is
+    the stack pointer.
diff --git a/Documentation/arch/sh/booting.rst b/Documentation/arch/sh/booting.rst
new file mode 100644
index 000000000000..d851c49a01bf
--- /dev/null
+++ b/Documentation/arch/sh/booting.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+DeviceTree Booting
+------------------
+
+  Device-tree compatible SH bootloaders are expected to provide the physical
+  address of the device tree blob in r4. Since legacy bootloaders did not
+  guarantee any particular initial register state, kernels built to
+  inter-operate with old bootloaders must either use a builtin DTB or
+  select a legacy board option (something other than CONFIG_SH_DEVICE_TREE)
+  that does not use device tree. Support for the latter is being phased out
+  in favor of device tree.
diff --git a/Documentation/arch/sh/features.rst b/Documentation/arch/sh/features.rst
new file mode 100644
index 000000000000..f722af3b6c99
--- /dev/null
+++ b/Documentation/arch/sh/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features sh
diff --git a/Documentation/arch/sh/index.rst b/Documentation/arch/sh/index.rst
new file mode 100644
index 000000000000..c64776738cf6
--- /dev/null
+++ b/Documentation/arch/sh/index.rst
@@ -0,0 +1,56 @@
+=======================
+SuperH Interfaces Guide
+=======================
+
+:Author: Paul Mundt
+
+.. toctree::
+    :maxdepth: 1
+
+    booting
+    new-machine
+    register-banks
+
+    features
+
+Memory Management
+=================
+
+SH-4
+----
+
+Store Queue API
+~~~~~~~~~~~~~~~
+
+.. kernel-doc:: arch/sh/kernel/cpu/sh4/sq.c
+   :export:
+
+Machine Specific Interfaces
+===========================
+
+mach-dreamcast
+--------------
+
+.. kernel-doc:: arch/sh/boards/mach-dreamcast/rtc.c
+   :internal:
+
+mach-x3proto
+------------
+
+.. kernel-doc:: arch/sh/boards/mach-x3proto/ilsel.c
+   :export:
+
+Busses
+======
+
+SuperHyway
+----------
+
+.. kernel-doc:: drivers/sh/superhyway/superhyway.c
+   :export:
+
+Maple
+-----
+
+.. kernel-doc:: drivers/sh/maple/maple.c
+   :export:
diff --git a/Documentation/arch/sh/new-machine.rst b/Documentation/arch/sh/new-machine.rst
new file mode 100644
index 000000000000..e501c52b3b30
--- /dev/null
+++ b/Documentation/arch/sh/new-machine.rst
@@ -0,0 +1,277 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Adding a new board to LinuxSH
+=============================
+
+               Paul Mundt <lethal@linux-sh.org>
+
+This document attempts to outline what steps are necessary to add support
+for new boards to the LinuxSH port under the new 2.5 and 2.6 kernels. This
+also attempts to outline some of the noticeable changes between the 2.4
+and the 2.5/2.6 SH backend.
+
+1. New Directory Structure
+==========================
+
+The first thing to note is the new directory structure. Under 2.4, most
+of the board-specific code (with the exception of stboards) ended up
+in arch/sh/kernel/ directly, with board-specific headers ending up in
+include/asm-sh/. For the new kernel, things are broken out by board type,
+companion chip type, and CPU type. Looking at a tree view of this directory
+hierarchy looks like the following:
+
+Board-specific code::
+
+    .
+    |-- arch
+    |   `-- sh
+    |       `-- boards
+    |           |-- adx
+    |           |   `-- board-specific files
+    |           |-- bigsur
+    |           |   `-- board-specific files
+    |           |
+    |           ... more boards here ...
+    |
+    `-- include
+	`-- asm-sh
+	    |-- adx
+	    |   `-- board-specific headers
+	    |-- bigsur
+	    |   `-- board-specific headers
+	    |
+	    .. more boards here ...
+
+Next, for companion chips::
+
+    .
+    `-- arch
+	`-- sh
+	    `-- cchips
+		`-- hd6446x
+		    `-- hd64461
+			`-- cchip-specific files
+
+... and so on. Headers for the companion chips are treated the same way as
+board-specific headers. Thus, include/asm-sh/hd64461 is home to all of the
+hd64461-specific headers.
+
+Finally, CPU family support is also abstracted::
+
+    .
+    |-- arch
+    |   `-- sh
+    |       |-- kernel
+    |       |   `-- cpu
+    |       |       |-- sh2
+    |       |       |   `-- SH-2 generic files
+    |       |       |-- sh3
+    |       |       |   `-- SH-3 generic files
+    |       |       `-- sh4
+    |       |           `-- SH-4 generic files
+    |       `-- mm
+    |           `-- This is also broken out per CPU family, so each family can
+    |               have their own set of cache/tlb functions.
+    |
+    `-- include
+	`-- asm-sh
+	    |-- cpu-sh2
+	    |   `-- SH-2 specific headers
+	    |-- cpu-sh3
+	    |   `-- SH-3 specific headers
+	    `-- cpu-sh4
+		`-- SH-4 specific headers
+
+It should be noted that CPU subtypes are _not_ abstracted. Thus, these still
+need to be dealt with by the CPU family specific code.
+
+2. Adding a New Board
+=====================
+
+The first thing to determine is whether the board you are adding will be
+isolated, or whether it will be part of a family of boards that can mostly
+share the same board-specific code with minor differences.
+
+In the first case, this is just a matter of making a directory for your
+board in arch/sh/boards/ and adding rules to hook your board in with the
+build system (more on this in the next section). However, for board families
+it makes more sense to have a common top-level arch/sh/boards/ directory
+and then populate that with sub-directories for each member of the family.
+Both the Solution Engine and the hp6xx boards are an example of this.
+
+After you have setup your new arch/sh/boards/ directory, remember that you
+should also add a directory in include/asm-sh for headers localized to this
+board (if there are going to be more than one). In order to interoperate
+seamlessly with the build system, it's best to have this directory the same
+as the arch/sh/boards/ directory name, though if your board is again part of
+a family, the build system has ways of dealing with this (via incdir-y
+overloading), and you can feel free to name the directory after the family
+member itself.
+
+There are a few things that each board is required to have, both in the
+arch/sh/boards and the include/asm-sh/ hierarchy. In order to better
+explain this, we use some examples for adding an imaginary board. For
+setup code, we're required at the very least to provide definitions for
+get_system_type() and platform_setup(). For our imaginary board, this
+might look something like::
+
+    /*
+    * arch/sh/boards/vapor/setup.c - Setup code for imaginary board
+    */
+    #include <linux/init.h>
+
+    const char *get_system_type(void)
+    {
+	    return "FooTech Vaporboard";
+    }
+
+    int __init platform_setup(void)
+    {
+	    /*
+	    * If our hardware actually existed, we would do real
+	    * setup here. Though it's also sane to leave this empty
+	    * if there's no real init work that has to be done for
+	    * this board.
+	    */
+
+	    /* Start-up imaginary PCI ... */
+
+	    /* And whatever else ... */
+
+	    return 0;
+    }
+
+Our new imaginary board will also have to tie into the machvec in order for it
+to be of any use.
+
+machvec functions fall into a number of categories:
+
+ - I/O functions to IO memory (inb etc) and PCI/main memory (readb etc).
+ - I/O mapping functions (ioport_map, ioport_unmap, etc).
+ - a 'heartbeat' function.
+ - PCI and IRQ initialization routines.
+ - Consistent allocators (for boards that need special allocators,
+   particularly for allocating out of some board-specific SRAM for DMA
+   handles).
+
+There are machvec functions added and removed over time, so always be sure to
+consult include/asm-sh/machvec.h for the current state of the machvec.
+
+The kernel will automatically wrap in generic routines for undefined function
+pointers in the machvec at boot time, as machvec functions are referenced
+unconditionally throughout most of the tree. Some boards have incredibly
+sparse machvecs (such as the dreamcast and sh03), whereas others must define
+virtually everything (rts7751r2d).
+
+Adding a new machine is relatively trivial (using vapor as an example):
+
+If the board-specific definitions are quite minimalistic, as is the case for
+the vast majority of boards, simply having a single board-specific header is
+sufficient.
+
+ - add a new file include/asm-sh/vapor.h which contains prototypes for
+   any machine specific IO functions prefixed with the machine name, for
+   example vapor_inb. These will be needed when filling out the machine
+   vector.
+
+   Note that these prototypes are generated automatically by setting
+   __IO_PREFIX to something sensible. A typical example would be::
+
+	#define __IO_PREFIX vapor
+	#include <asm/io_generic.h>
+
+   somewhere in the board-specific header. Any boards being ported that still
+   have a legacy io.h should remove it entirely and switch to the new model.
+
+ - Add machine vector definitions to the board's setup.c. At a bare minimum,
+   this must be defined as something like::
+
+	struct sh_machine_vector mv_vapor __initmv = {
+		.mv_name = "vapor",
+	};
+	ALIAS_MV(vapor)
+
+ - finally add a file arch/sh/boards/vapor/io.c, which contains definitions of
+   the machine specific io functions (if there are enough to warrant it).
+
+3. Hooking into the Build System
+================================
+
+Now that we have the corresponding directories setup, and all of the
+board-specific code is in place, it's time to look at how to get the
+whole mess to fit into the build system.
+
+Large portions of the build system are now entirely dynamic, and merely
+require the proper entry here and there in order to get things done.
+
+The first thing to do is to add an entry to arch/sh/Kconfig, under the
+"System type" menu::
+
+    config SH_VAPOR
+	    bool "Vapor"
+	    help
+	    select Vapor if configuring for a FooTech Vaporboard.
+
+next, this has to be added into arch/sh/Makefile. All boards require a
+machdir-y entry in order to be built. This entry needs to be the name of
+the board directory as it appears in arch/sh/boards, even if it is in a
+sub-directory (in which case, all parent directories below arch/sh/boards/
+need to be listed). For our new board, this entry can look like::
+
+    machdir-$(CONFIG_SH_VAPOR)	+= vapor
+
+provided that we've placed everything in the arch/sh/boards/vapor/ directory.
+
+Next, the build system assumes that your include/asm-sh directory will also
+be named the same. If this is not the case (as is the case with multiple
+boards belonging to a common family), then the directory name needs to be
+implicitly appended to incdir-y. The existing code manages this for the
+Solution Engine and hp6xx boards, so see these for an example.
+
+Once that is taken care of, it's time to add an entry for the mach type.
+This is done by adding an entry to the end of the arch/sh/tools/mach-types
+list. The method for doing this is self explanatory, and so we won't waste
+space restating it here. After this is done, you will be able to use
+implicit checks for your board if you need this somewhere throughout the
+common code, such as::
+
+	/* Make sure we're on the FooTech Vaporboard */
+	if (!mach_is_vapor())
+		return -ENODEV;
+
+also note that the mach_is_boardname() check will be implicitly forced to
+lowercase, regardless of the fact that the mach-types entries are all
+uppercase. You can read the script if you really care, but it's pretty ugly,
+so you probably don't want to do that.
+
+Now all that's left to do is providing a defconfig for your new board. This
+way, other people who end up with this board can simply use this config
+for reference instead of trying to guess what settings are supposed to be
+used on it.
+
+Also, as soon as you have copied over a sample .config for your new board
+(assume arch/sh/configs/vapor_defconfig), you can also use this directly as a
+build target, and it will be implicitly listed as such in the help text.
+
+Looking at the 'make help' output, you should now see something like:
+
+Architecture specific targets (sh):
+
+  =======================   =============================================
+  zImage                    Compressed kernel image (arch/sh/boot/zImage)
+  adx_defconfig             Build for adx
+  cqreek_defconfig          Build for cqreek
+  dreamcast_defconfig       Build for dreamcast
+  ...
+  vapor_defconfig           Build for vapor
+  =======================   =============================================
+
+which then allows you to do::
+
+    $ make ARCH=sh CROSS_COMPILE=sh4-linux- vapor_defconfig vmlinux
+
+which will in turn copy the defconfig for this board, run it through
+oldconfig (prompting you for any new options since the time of creation),
+and start you on your way to having a functional kernel for your new
+board.
diff --git a/Documentation/arch/sh/register-banks.rst b/Documentation/arch/sh/register-banks.rst
new file mode 100644
index 000000000000..2bef5c8fcbbc
--- /dev/null
+++ b/Documentation/arch/sh/register-banks.rst
@@ -0,0 +1,40 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+Notes on register bank usage in the kernel
+==========================================
+
+Introduction
+------------
+
+The SH-3 and SH-4 CPU families traditionally include a single partial register
+bank (selected by SR.RB, only r0 ... r7 are banked), whereas other families
+may have more full-featured banking or simply no such capabilities at all.
+
+SR.RB banking
+-------------
+
+In the case of this type of banking, banked registers are mapped directly to
+r0 ... r7 if SR.RB is set to the bank we are interested in, otherwise ldc/stc
+can still be used to reference the banked registers (as r0_bank ... r7_bank)
+when in the context of another bank. The developer must keep the SR.RB value
+in mind when writing code that utilizes these banked registers, for obvious
+reasons. Userspace is also not able to poke at the bank1 values, so these can
+be used rather effectively as scratch registers by the kernel.
+
+Presently the kernel uses several of these registers.
+
+	- r0_bank, r1_bank (referenced as k0 and k1, used for scratch
+	  registers when doing exception handling).
+
+	- r2_bank (used to track the EXPEVT/INTEVT code)
+
+		- Used by do_IRQ() and friends for doing irq mapping based off
+		  of the interrupt exception vector jump table offset
+
+	- r6_bank (global interrupt mask)
+
+		- The SR.IMASK interrupt handler makes use of this to set the
+		  interrupt priority level (used by local_irq_enable())
+
+	- r7_bank (current)
diff --git a/Documentation/arch/sparc/adi.rst b/Documentation/arch/sparc/adi.rst
new file mode 100644
index 000000000000..dbcd8b6e7bc3
--- /dev/null
+++ b/Documentation/arch/sparc/adi.rst
@@ -0,0 +1,286 @@
+================================
+Application Data Integrity (ADI)
+================================
+
+SPARC M7 processor adds the Application Data Integrity (ADI) feature.
+ADI allows a task to set version tags on any subset of its address
+space. Once ADI is enabled and version tags are set for ranges of
+address space of a task, the processor will compare the tag in pointers
+to memory in these ranges to the version set by the application
+previously. Access to memory is granted only if the tag in given pointer
+matches the tag set by the application. In case of mismatch, processor
+raises an exception.
+
+Following steps must be taken by a task to enable ADI fully:
+
+1. Set the user mode PSTATE.mcde bit. This acts as master switch for
+   the task's entire address space to enable/disable ADI for the task.
+
+2. Set TTE.mcd bit on any TLB entries that correspond to the range of
+   addresses ADI is being enabled on. MMU checks the version tag only
+   on the pages that have TTE.mcd bit set.
+
+3. Set the version tag for virtual addresses using stxa instruction
+   and one of the MCD specific ASIs. Each stxa instruction sets the
+   given tag for one ADI block size number of bytes. This step must
+   be repeated for entire page to set tags for entire page.
+
+ADI block size for the platform is provided by the hypervisor to kernel
+in machine description tables. Hypervisor also provides the number of
+top bits in the virtual address that specify the version tag.  Once
+version tag has been set for a memory location, the tag is stored in the
+physical memory and the same tag must be present in the ADI version tag
+bits of the virtual address being presented to the MMU. For example on
+SPARC M7 processor, MMU uses bits 63-60 for version tags and ADI block
+size is same as cacheline size which is 64 bytes. A task that sets ADI
+version to, say 10, on a range of memory, must access that memory using
+virtual addresses that contain 0xa in bits 63-60.
+
+ADI is enabled on a set of pages using mprotect() with PROT_ADI flag.
+When ADI is enabled on a set of pages by a task for the first time,
+kernel sets the PSTATE.mcde bit for the task. Version tags for memory
+addresses are set with an stxa instruction on the addresses using
+ASI_MCD_PRIMARY or ASI_MCD_ST_BLKINIT_PRIMARY. ADI block size is
+provided by the hypervisor to the kernel.  Kernel returns the value of
+ADI block size to userspace using auxiliary vector along with other ADI
+info. Following auxiliary vectors are provided by the kernel:
+
+	============	===========================================
+	AT_ADI_BLKSZ	ADI block size. This is the granularity and
+			alignment, in bytes, of ADI versioning.
+	AT_ADI_NBITS	Number of ADI version bits in the VA
+	============	===========================================
+
+
+IMPORTANT NOTES
+===============
+
+- Version tag values of 0x0 and 0xf are reserved. These values match any
+  tag in virtual address and never generate a mismatch exception.
+
+- Version tags are set on virtual addresses from userspace even though
+  tags are stored in physical memory. Tags are set on a physical page
+  after it has been allocated to a task and a pte has been created for
+  it.
+
+- When a task frees a memory page it had set version tags on, the page
+  goes back to free page pool. When this page is re-allocated to a task,
+  kernel clears the page using block initialization ASI which clears the
+  version tags as well for the page. If a page allocated to a task is
+  freed and allocated back to the same task, old version tags set by the
+  task on that page will no longer be present.
+
+- ADI tag mismatches are not detected for non-faulting loads.
+
+- Kernel does not set any tags for user pages and it is entirely a
+  task's responsibility to set any version tags. Kernel does ensure the
+  version tags are preserved if a page is swapped out to the disk and
+  swapped back in. It also preserves that version tags if a page is
+  migrated.
+
+- ADI works for any size pages. A userspace task need not be aware of
+  page size when using ADI. It can simply select a virtual address
+  range, enable ADI on the range using mprotect() and set version tags
+  for the entire range. mprotect() ensures range is aligned to page size
+  and is a multiple of page size.
+
+- ADI tags can only be set on writable memory. For example, ADI tags can
+  not be set on read-only mappings.
+
+
+
+ADI related traps
+=================
+
+With ADI enabled, following new traps may occur:
+
+Disrupting memory corruption
+----------------------------
+
+	When a store accesses a memory location that has TTE.mcd=1,
+	the task is running with ADI enabled (PSTATE.mcde=1), and the ADI
+	tag in the address used (bits 63:60) does not match the tag set on
+	the corresponding cacheline, a memory corruption trap occurs. By
+	default, it is a disrupting trap and is sent to the hypervisor
+	first. Hypervisor creates a sun4v error report and sends a
+	resumable error (TT=0x7e) trap to the kernel. The kernel sends
+	a SIGSEGV to the task that resulted in this trap with the following
+	info::
+
+		siginfo.si_signo = SIGSEGV;
+		siginfo.errno = 0;
+		siginfo.si_code = SEGV_ADIDERR;
+		siginfo.si_addr = addr; /* PC where first mismatch occurred */
+		siginfo.si_trapno = 0;
+
+
+Precise memory corruption
+-------------------------
+
+	When a store accesses a memory location that has TTE.mcd=1,
+	the task is running with ADI enabled (PSTATE.mcde=1), and the ADI
+	tag in the address used (bits 63:60) does not match the tag set on
+	the corresponding cacheline, a memory corruption trap occurs. If
+	MCD precise exception is enabled (MCDPERR=1), a precise
+	exception is sent to the kernel with TT=0x1a. The kernel sends
+	a SIGSEGV to the task that resulted in this trap with the following
+	info::
+
+		siginfo.si_signo = SIGSEGV;
+		siginfo.errno = 0;
+		siginfo.si_code = SEGV_ADIPERR;
+		siginfo.si_addr = addr;	/* address that caused trap */
+		siginfo.si_trapno = 0;
+
+	NOTE:
+		ADI tag mismatch on a load always results in precise trap.
+
+
+MCD disabled
+------------
+
+	When a task has not enabled ADI and attempts to set ADI version
+	on a memory address, processor sends an MCD disabled trap. This
+	trap is handled by hypervisor first and the hypervisor vectors this
+	trap through to the kernel as Data Access Exception trap with
+	fault type set to 0xa (invalid ASI). When this occurs, the kernel
+	sends the task SIGSEGV signal with following info::
+
+		siginfo.si_signo = SIGSEGV;
+		siginfo.errno = 0;
+		siginfo.si_code = SEGV_ACCADI;
+		siginfo.si_addr = addr;	/* address that caused trap */
+		siginfo.si_trapno = 0;
+
+
+Sample program to use ADI
+-------------------------
+
+Following sample program is meant to illustrate how to use the ADI
+functionality::
+
+  #include <unistd.h>
+  #include <stdio.h>
+  #include <stdlib.h>
+  #include <elf.h>
+  #include <sys/ipc.h>
+  #include <sys/shm.h>
+  #include <sys/mman.h>
+  #include <asm/asi.h>
+
+  #ifndef AT_ADI_BLKSZ
+  #define AT_ADI_BLKSZ	48
+  #endif
+  #ifndef AT_ADI_NBITS
+  #define AT_ADI_NBITS	49
+  #endif
+
+  #ifndef PROT_ADI
+  #define PROT_ADI	0x10
+  #endif
+
+  #define BUFFER_SIZE     32*1024*1024UL
+
+  main(int argc, char* argv[], char* envp[])
+  {
+          unsigned long i, mcde, adi_blksz, adi_nbits;
+          char *shmaddr, *tmp_addr, *end, *veraddr, *clraddr;
+          int shmid, version;
+	Elf64_auxv_t *auxv;
+
+	adi_blksz = 0;
+
+	while(*envp++ != NULL);
+	for (auxv = (Elf64_auxv_t *)envp; auxv->a_type != AT_NULL; auxv++) {
+		switch (auxv->a_type) {
+		case AT_ADI_BLKSZ:
+			adi_blksz = auxv->a_un.a_val;
+			break;
+		case AT_ADI_NBITS:
+			adi_nbits = auxv->a_un.a_val;
+			break;
+		}
+	}
+	if (adi_blksz == 0) {
+		fprintf(stderr, "Oops! ADI is not supported\n");
+		exit(1);
+	}
+
+	printf("ADI capabilities:\n");
+	printf("\tBlock size = %ld\n", adi_blksz);
+	printf("\tNumber of bits = %ld\n", adi_nbits);
+
+          if ((shmid = shmget(2, BUFFER_SIZE,
+                                  IPC_CREAT | SHM_R | SHM_W)) < 0) {
+                  perror("shmget failed");
+                  exit(1);
+          }
+
+          shmaddr = shmat(shmid, NULL, 0);
+          if (shmaddr == (char *)-1) {
+                  perror("shm attach failed");
+                  shmctl(shmid, IPC_RMID, NULL);
+                  exit(1);
+          }
+
+	if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE|PROT_ADI)) {
+		perror("mprotect failed");
+		goto err_out;
+	}
+
+          /* Set the ADI version tag on the shm segment
+           */
+          version = 10;
+          tmp_addr = shmaddr;
+          end = shmaddr + BUFFER_SIZE;
+          while (tmp_addr < end) {
+                  asm volatile(
+                          "stxa %1, [%0]0x90\n\t"
+                          :
+                          : "r" (tmp_addr), "r" (version));
+                  tmp_addr += adi_blksz;
+          }
+	asm volatile("membar #Sync\n\t");
+
+          /* Create a versioned address from the normal address by placing
+	 * version tag in the upper adi_nbits bits
+           */
+          tmp_addr = (void *) ((unsigned long)shmaddr << adi_nbits);
+          tmp_addr = (void *) ((unsigned long)tmp_addr >> adi_nbits);
+          veraddr = (void *) (((unsigned long)version << (64-adi_nbits))
+                          | (unsigned long)tmp_addr);
+
+          printf("Starting the writes:\n");
+          for (i = 0; i < BUFFER_SIZE; i++) {
+                  veraddr[i] = (char)(i);
+                  if (!(i % (1024 * 1024)))
+                          printf(".");
+          }
+          printf("\n");
+
+          printf("Verifying data...");
+	fflush(stdout);
+          for (i = 0; i < BUFFER_SIZE; i++)
+                  if (veraddr[i] != (char)i)
+                          printf("\nIndex %lu mismatched\n", i);
+          printf("Done.\n");
+
+          /* Disable ADI and clean up
+           */
+	if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE)) {
+		perror("mprotect failed");
+		goto err_out;
+	}
+
+          if (shmdt((const void *)shmaddr) != 0)
+                  perror("Detach failure");
+          shmctl(shmid, IPC_RMID, NULL);
+
+          exit(0);
+
+  err_out:
+          if (shmdt((const void *)shmaddr) != 0)
+                  perror("Detach failure");
+          shmctl(shmid, IPC_RMID, NULL);
+          exit(1);
+  }
diff --git a/Documentation/arch/sparc/console.rst b/Documentation/arch/sparc/console.rst
new file mode 100644
index 000000000000..73132db83ece
--- /dev/null
+++ b/Documentation/arch/sparc/console.rst
@@ -0,0 +1,9 @@
+Steps for sending 'break' on sunhv console
+==========================================
+
+On Baremetal:
+   1. press   Esc + 'B'
+
+On LDOM:
+   1. press    Ctrl + ']'
+   2. telnet> send  break
diff --git a/Documentation/arch/sparc/features.rst b/Documentation/arch/sparc/features.rst
new file mode 100644
index 000000000000..c0c92468b0fe
--- /dev/null
+++ b/Documentation/arch/sparc/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features sparc
diff --git a/Documentation/arch/sparc/index.rst b/Documentation/arch/sparc/index.rst
new file mode 100644
index 000000000000..ae884224eec2
--- /dev/null
+++ b/Documentation/arch/sparc/index.rst
@@ -0,0 +1,13 @@
+==================
+Sparc Architecture
+==================
+
+.. toctree::
+   :maxdepth: 1
+
+   console
+   adi
+
+   oradax/oracle-dax
+
+   features
diff --git a/Documentation/arch/sparc/oradax/dax-hv-api.txt b/Documentation/arch/sparc/oradax/dax-hv-api.txt
new file mode 100644
index 000000000000..7ecd0bf4957b
--- /dev/null
+++ b/Documentation/arch/sparc/oradax/dax-hv-api.txt
@@ -0,0 +1,1433 @@
+Excerpt from UltraSPARC Virtual Machine Specification
+Compiled from version 3.0.20+15
+Publication date 2017-09-25 08:21
+Copyright © 2008, 2015 Oracle and/or its affiliates. All rights reserved.
+Extracted via "pdftotext -f 547 -l 572 -layout sun4v_20170925.pdf"
+Authors:
+	 Charles Kunzman
+	 Sam Glidden
+	 Mark Cianchetti
+
+
+Chapter 36. Coprocessor services
+        The following APIs provide access via the Hypervisor to hardware assisted data processing functionality.
+        These APIs may only be provided by certain platforms, and may not be available to all virtual machines
+        even on supported platforms. Restrictions on the use of these APIs may be imposed in order to support
+        live-migration and other system management activities.
+
+36.1. Data Analytics Accelerator
+        The Data Analytics Accelerator (DAX) functionality is a collection of hardware coprocessors that provide
+        high speed processoring of database-centric operations. The coprocessors may support one or more of
+        the following data query operations: search, extraction, compression, decompression, and translation. The
+        functionality offered may vary by virtual machine implementation.
+
+        The DAX is a virtual device to sun4v guests, with supported data operations indicated by the virtual device
+        compatibility property. Functionality is accessed through the submission of Command Control Blocks
+        (CCBs) via the ccb_submit API function. The operations are processed asynchronously, with the status
+        of the submitted operations reported through a Completion Area linked to each CCB. Each CCB has a
+        separate Completion Area and, unless execution order is specifically restricted through the use of serial-
+        conditional flags, the execution order of submitted CCBs is arbitrary. Likewise, the time to completion
+        for a given CCB is never guaranteed.
+
+        Guest software may implement a software timeout on CCB operations, and if the timeout is exceeded, the
+        operation may be cancelled or killed via the ccb_kill API function. It is recommended for guest software
+        to implement a software timeout to account for certain RAS errors which may result in lost CCBs. It is
+        recommended such implementation use the ccb_info API function to check the status of a CCB prior to
+        killing it in order to determine if the CCB is still in queue, or may have been lost due to a RAS error.
+
+        There is no fixed limit on the number of outstanding CCBs guest software may have queued in the virtual
+        machine, however, internal resource limitations within the virtual machine can cause CCB submissions
+        to be temporarily rejected with EWOULDBLOCK. In such cases, guests should continue to attempt
+        submissions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would
+        not be a guarantee that a future submission would succeed.
+
+        The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual
+        device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device
+        node”).
+
+36.1.1. DAX Compatibility Property
+        The query functionality may vary based on the compatibility property of the virtual device:
+
+36.1.1.1. "ORCL,sun4v-dax" Device Compatibility
+        Available CCB commands:
+
+        • No-op/Sync
+
+        • Extract
+
+        • Scan Value
+
+        • Inverted Scan Value
+
+        • Scan Range
+
+
+                                                     509
+                                             Coprocessor services
+
+
+        • Inverted Scan Range
+
+        • Translate
+
+        • Inverted Translate
+
+        • Select
+
+        See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
+
+        Only version 0 CCBs are available.
+
+36.1.1.2. "ORCL,sun4v-dax-fc" Device Compatibility
+        "ORCL,sun4v-dax-fc" is compatible with the "ORCL,sun4v-dax" interface, and includes additional CCB
+        bit fields and controls.
+
+36.1.1.3. "ORCL,sun4v-dax2" Device Compatibility
+        Available CCB commands:
+
+        • No-op/Sync
+
+        • Extract
+
+        • Scan Value
+
+        • Inverted Scan Value
+
+        • Scan Range
+
+        • Inverted Scan Range
+
+        • Translate
+
+        • Inverted Translate
+
+        • Select
+
+        See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
+
+        Version 0 and 1 CCBs are available. Only version 0 CCBs may use Huffman encoded data, whereas only
+        version 1 CCBs may use OZIP.
+
+36.1.2. DAX Virtual Device Interrupts
+        The DAX virtual device has multiple interrupts associated with it which may be used by the guest if
+        desired. The number of device interrupts available to the guest is indicated in the virtual device node of the
+        guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device node”). If the device
+        node indicates N interrupts available, the guest may use any value from 0 to N - 1 (inclusive) in a CCB
+        interrupt number field. Using values outside this range will result in the CCB being rejected for an invalid
+        field value.
+
+        The interrupts may be bound and managed using the standard sun4v device interrupts API (Chapter 16,
+        Device interrupt services). Sysino interrupts are not available for DAX devices.
+
+36.2. Coprocessor Control Block (CCB)
+        CCBs are either 64 or 128 bytes long, depending on the operation type. The exact contents of the CCB
+        are command specific, but all CCBs contain at least one memory buffer address. All memory locations
+
+
+                                                      510
+                                    Coprocessor services
+
+
+referenced by a CCB must be pinned in memory until the CCB either completes execution or is killed
+via the ccb_kill API call. Changes in virtual address mappings occurring after CCB submission are not
+guaranteed to be visible, and as such all virtual address updates need to be synchronized with CCB
+execution.
+
+All CCBs begin with a common 32-bit header.
+
+Table 36.1. CCB Header Format
+Bits          Field Description
+[31:28]       CCB version. For API version 2.0: set to 1 if CCB uses OZIP encoding; set to 0 if the CCB
+              uses Huffman encoding; otherwise either 0 or 1. For API version 1.0: always set to 0.
+[27]          When API version 2.0 is negotiated, this is the Pipeline Flag [512]. It is reserved in
+              API version 1.0
+[26]          Long CCB flag [512]
+[25]          Conditional synchronization flag [512]
+[24]          Serial synchronization flag
+[23:16]       CCB operation code:
+               0x00        No Operation (No-op) or Sync
+               0x01        Extract
+               0x02        Scan Value
+               0x12        Inverted Scan Value
+               0x03        Scan Range
+               0x13        Inverted Scan Range
+               0x04        Translate
+               0x14        Inverted Translate
+               0x05        Select
+[15:13]       Reserved
+[12:11]       Table address type
+               0b'00       No address
+               0b'01       Alternate context virtual address
+               0b'10       Real address
+               0b'11       Primary context virtual address
+[10:8]        Output/Destination address type
+               0b'000      No address
+               0b'001      Alternate context virtual address
+               0b'010      Real address
+               0b'011      Primary context virtual address
+               0b'100      Reserved
+               0b'101      Reserved
+               0b'110      Reserved
+               0b'111      Reserved
+[7:5]         Secondary source address type
+
+
+                                            511
+                                    Coprocessor services
+
+
+Bits           Field Description
+                0b'000       No address
+                0b'001       Alternate context virtual address
+                0b'010       Real address
+                0b'011       Primary context virtual address
+                0b'100       Reserved
+                0b'101       Reserved
+                0b'110       Reserved
+                0b'111       Reserved
+[4:2]          Primary source address type
+                0b'000       No address
+                0b'001       Alternate context virtual address
+                0b'010       Real address
+                0b'011       Primary context virtual address
+                0b'100       Reserved
+                0b'101       Reserved
+                0b'110       Reserved
+                0b'111       Reserved
+[1:0]          Completion area address type
+                0b'00        No address
+                0b'01        Alternate context virtual address
+                0b'10        Real address
+                0b'11        Primary context virtual address
+
+The Long CCB flag indicates whether the submitted CCB is 64 or 128 bytes long; value is 0 for 64 bytes
+and 1 for 128 bytes.
+
+The Serial and Conditional flags allow simple relative ordering between CCBs. Any CCB with the Serial
+flag set will execute sequentially relative to any previous CCB that is also marked as Serial in the same
+CCB submission. CCBs without the Serial flag set execute independently, even if they are between CCBs
+with the Serial flag set. CCBs marked solely with the Serial flag will execute upon the completion of the
+previous Serial CCB, regardless of the completion status of that CCB. The Conditional flag allows CCBs
+to conditionally execute based on the successful execution of the closest CCB marked with the Serial flag.
+A CCB may only be conditional on exactly one CCB, however, a CCB may be marked both Conditional
+and Serial to allow execution chaining. The flags do NOT allow fan-out chaining, where multiple CCBs
+execute in parallel based on the completion of another CCB.
+
+The Pipeline flag is an optimization that directs the output of one CCB (the "source" CCB) directly to
+the input of the next CCB (the "target" CCB). The target CCB thus does not need to read the input from
+memory. The Pipeline flag is advisory and may be dropped.
+
+Both the Pipeline and Serial bits must be set in the source CCB. The Conditional bit must be set in the
+target CCB. Exactly one CCB must be made conditional on the source CCB; either 0 or 2 target CCBs
+is invalid. However, Pipelines can be extended beyond two CCBs: the sequence would start with a CCB
+with both the Pipeline and Serial bits set, proceed through CCBs with the Pipeline, Serial, and Conditional
+bits set, and terminate at a CCB that has the Conditional bit set, but not the Pipeline bit.
+
+
+                                             512
+                                               Coprocessor services
+
+
+          The input of the target CCB must start within 64 bytes of the output of the source CCB or the pipeline flag
+          will be ignored. All CCBs in a pipeline must be submitted in the same call to ccb_submit.
+
+          The various address type fields indicate how the various address values used in the CCB should be
+          interpreted by the virtual machine. Not all of the types specified are used by every CCB format. Types
+          which are not applicable to the given CCB command should be indicated as type 0 (No address). Virtual
+          addresses used in the CCB must have translation entries present in either the TLB or a configured TSB
+          for the submitting virtual processor. Virtual addresses which cannot be translated by the virtual machine
+          will result in the CCB submission being rejected, with the causal virtual address indicated. The CCB
+          may be resubmitted after inserting the translation, or the address may be translated by guest software and
+          resubmitted using the real address translation.
+
+36.2.1. Query CCB Command Formats
+36.2.1.1. Supported Data Formats, Elements Sizes and Offsets
+          Data for query commands may be encoded in multiple possible formats. The data query commands use a
+          common set of values to indicate the encoding formats of the data being processed. Some encoding formats
+          require multiple data streams for processing, requiring the specification of both primary data formats (the
+          encoded data) and secondary data streams (meta-data for the encoded data).
+
+36.2.1.1.1. Primary Input Format
+
+          The primary input format code is a 4-bit field when it is used. There are 10 primary input formats available.
+          The packed formats are not endian neutral. Code values not listed below are reserved.
+
+          Code        Format                              Description
+          0x0         Fixed width byte packed             Up to 16 bytes
+          0x1         Fixed width bit packed              Up to 15 bits (CCB version 0) or 23 bits (CCB version
+                                                          1); bits are read most significant bit to least significant bit
+                                                          within a byte
+          0x2         Variable width byte packed          Data stream of lengths must be provided as a secondary
+                                                          input
+          0x4         Fixed width byte packed with run Up to 16 bytes; data stream of run lengths must be
+                      length encoding                  provided as a secondary input
+          0x5         Fixed width bit packed with run Up to 15 bits (CCB version 0) or 23 bits (CCB version
+                      length encoding                 1); bits are read most significant bit to least significant bit
+                                                      within a byte; data stream of run lengths must be provided
+                                                      as a secondary input
+          0x8         Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
+                      Huffman (CCB version 0) or bits are read most significant bit to least significant bit
+                      OZIP (CCB version 1) encoding within a byte; pointer to the encoding table must be
+                                                    provided
+          0x9         Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits (CCB version
+                      Huffman (CCB version 0) or 1); compressed stream bits are read most significant bit to
+                      OZIP (CCB version 1) encoding least significant bit within a byte; pointer to the encoding
+                                                    table must be provided
+          0xA         Variable width byte packed with Up to 16 bytes before the encoding; compressed stream
+                      Huffman (CCB version 0) or bits are read most significant bit to least significant bit
+                      OZIP (CCB version 1) encoding within a byte; data stream of lengths must be provided as
+                                                      a secondary input; pointer to the encoding table must be
+                                                      provided
+
+
+                                                        513
+                                               Coprocessor services
+
+
+          Code        Format                              Description
+          0xC         Fixed width byte packed with        Up to 16 bytes before the encoding; compressed stream
+                      run length encoding, followed by    bits are read most significant bit to least significant bit
+                      Huffman (CCB version 0) or          within a byte; data stream of run lengths must be provided
+                      OZIP (CCB version 1) encoding       as a secondary input; pointer to the encoding table must
+                                                          be provided
+          0xD         Fixed width bit packed with         Up to 15 bits (CCB version 0) or 23 bits(CCB version 1)
+                      run length encoding, followed by    before the encoding; compressed stream bits are read most
+                      Huffman (CCB version 0) or          significant bit to least significant bit within a byte; data
+                      OZIP (CCB version 1) encoding       stream of run lengths must be provided as a secondary
+                                                          input; pointer to the encoding table must be provided
+
+          If OZIP encoding is used, there must be no reserved bytes in the table.
+
+36.2.1.1.2. Primary Input Element Size
+
+          For primary input data streams with fixed size elements, the element size must be indicated in the CCB
+          command. The size is encoded as the number of bits or bytes, minus one. The valid value range for this
+          field depends on the input format selected, as listed in the table above.
+
+36.2.1.1.3. Secondary Input Format
+
+          For primary input data streams which require a secondary input stream, the secondary input stream is
+          always encoded in a fixed width, bit-packed format. The bits are read from most significant bit to least
+          significant bit within a byte. There are two encoding options for the secondary input stream data elements,
+          depending on whether the value of 0 is needed:
+
+          Secondary           Input Description
+          Format Code
+          0                          Element is stored as value minus 1 (0 evaluates to 1, 1 evaluates
+                                     to 2, etc)
+          1                          Element is stored as value
+
+36.2.1.1.4. Secondary Input Element Size
+
+          Secondary input element size is encoded as a two bit field:
+
+          Secondary Input Size Description
+          Code
+          0x0                        1 bit
+          0x1                        2 bits
+          0x2                        4 bits
+          0x3                        8 bits
+
+36.2.1.1.5. Input Element Offsets
+
+          Bit-wise input data streams may have any alignment within the base addressed byte. The offset, specified
+          from most significant bit to least significant bit, is provided as a fixed 3 bit field for each input type. A
+          value of 0 indicates that the first input element begins at the most significant bit in the first byte, and a
+          value of 7 indicates it begins with the least significant bit.
+
+          This field should be zero for any byte-wise primary input data streams.
+
+
+                                                        514
+                                              Coprocessor services
+
+
+36.2.1.1.6. Output Format
+
+          Query commands support multiple sizes and encodings for output data streams. There are four possible
+          output encodings, and up to four supported element sizes per encoding. Not all output encodings are
+          supported for every command. The format is indicated by a 4-bit field in the CCB:
+
+           Output Format Code        Description
+           0x0                       Byte aligned, 1 byte elements
+           0x1                       Byte aligned, 2 byte elements
+           0x2                       Byte aligned, 4 byte elements
+           0x3                       Byte aligned, 8 byte elements
+           0x4                       16 byte aligned, 16 byte elements
+           0x5                       Reserved
+           0x6                       Reserved
+           0x7                       Reserved
+           0x8                       Packed vector of single bit elements
+           0x9                       Reserved
+           0xA                       Reserved
+           0xB                       Reserved
+           0xC                       Reserved
+           0xD                       2 byte elements where each element is the index value of a bit,
+                                     from an bit vector, which was 1.
+           0xE                       4 byte elements where each element is the index value of a bit,
+                                     from an bit vector, which was 1.
+           0xF                       Reserved
+
+36.2.1.1.7. Application Data Integrity (ADI)
+
+          On platforms which support ADI, the ADI version number may be specified for each separate memory
+          access type used in the CCB command. ADI checking only occurs when reading data. When writing data,
+          the specified ADI version number overwrites any existing ADI value in memory.
+
+          An ADI version value of 0 or 0xF indicates the ADI checking is disabled for that data access, even if it is
+          enabled in memory. By setting the appropriate flag in CCB_SUBMIT (Section 36.3.1, “ccb_submit”) it is
+          also an option to disable ADI checking for all inputs accessed via virtual address for all CCBs submitted
+          during that hypercall invocation.
+
+          The ADI value is only guaranteed to be checked on the first 64 bytes of each data access. Mismatches on
+          subsequent data chunks may not be detected, so guest software should be careful to use page size checking
+          to protect against buffer overruns.
+
+36.2.1.1.8. Page size checking
+
+          All data accesses used in CCB commands must be bounded within a single memory page. When addresses
+          are provided using a virtual address, the page size for checking is extracted from the TTE for that virtual
+          address. When using real addresses, the guest must supply the page size in the same field as the address
+          value. The page size must be one of the sizes supported by the underlying virtual machine. Using a value
+          that is not supported may result in the CCB submission being rejected or the generation of a CCB parsing
+          error in the completion area.
+
+
+                                                       515
+                                               Coprocessor services
+
+
+36.2.1.2. Extract command
+
+        Converts an input vector in one format to an output vector in another format. All input format types are
+        supported.
+
+        The only supported output format is a padded, byte-aligned output stream, using output codes 0x0 - 0x4.
+        When the specified output element size is larger than the extracted input element size, zeros are padded to
+        the extracted input element. First, if the decompressed input size is not a whole number of bytes, 0 bits are
+        padded to the most significant bit side till the next byte boundary. Next, if the output element size is larger
+        than the byte padded input element, bytes of value 0 are added based on the Padding Direction bit in the
+        CCB. If the output element size is smaller than the byte-padded input element size, the input element is
+        truncated by dropped from the least significant byte side until the selected output size is reached.
+
+        The return value of the CCB completion area is invalid. The “number of elements processed” field in the
+        CCB completion area will be valid.
+
+        The extract CCB is a 64-byte “short format” CCB.
+
+        The extract CCB command format can be specified by the following packed C structure for a big-endian
+        machine:
+
+
+                  struct extract_ccb {
+                         uint32_t header;
+                         uint32_t control;
+                         uint64_t completion;
+                         uint64_t primary_input;
+                         uint64_t data_access_control;
+                         uint64_t secondary_input;
+                         uint64_t reserved;
+                         uint64_t output;
+                         uint64_t table;
+                  };
+
+
+        The exact field offsets, sizes, and composition are as follows:
+
+         Offset         Size            Field Description
+         0              4               CCB header (Table 36.1, “CCB Header Format”)
+         4              4               Command control
+                                        Bits        Field Description
+                                        [31:28]     Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+                                                    Format”)
+                                        [27:23]     Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+                                                    Input Element Size”)
+                                        [22:20]     Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                                                    Element Offsets”)
+                                        [19]        Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+                                                    Input Format”)
+                                        [18:16]     Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                                                    Element Offsets”)
+
+
+                                                       516
+                        Coprocessor services
+
+
+Offset   Size   Field Description
+                Bits         Field Description
+                [15:14]      Secondary Input Element Size (see Section 36.2.1.1.4,
+                             “Secondary Input Element Size”
+                [13:10]      Output Format (see Section 36.2.1.1.6, “Output Format”)
+                [9]          Padding Direction selector: A value of 1 causes padding bytes
+                             to be added to the left side of output elements. A value of 0
+                             causes padding bytes to be added to the right side of output
+                             elements.
+                [8:0]        Reserved
+8        8      Completion
+                Bits         Field Description
+                [63:60]      ADI version (see Section 36.2.1.1.7, “Application Data
+                             Integrity (ADI)”)
+                [59]         If set to 1, a virtual device interrupt will be generated using
+                             the device interrupt number specified in the lower bits of this
+                             completion word. If 0, the lower bits of this completion word
+                             are ignored.
+                [58:6]       Completion area address bits [58:6]. Address type is
+                             determined by CCB header.
+                [5:0]        Virtual device interrupt number for completion interrupt, if
+                             enabled.
+16       8      Primary Input
+                Bits         Field Description
+                [63:60]      ADI version (see Section 36.2.1.1.7, “Application Data
+                             Integrity (ADI)”)
+                [59:56]      If using real address, these bits should be filled in with the
+                             page size code for the page boundary checking the guest wants
+                             the virtual machine to use when accessing this data stream
+                             (checking is only guaranteed to be performed when using API
+                             version 1.1 and later). If using a virtual address, this field will
+                             be used as as primary input address bits [59:56].
+                [55:0]       Primary input address bits [55:0]. Address type is determined
+                             by CCB header.
+24       8      Data Access Control
+                Bits         Field Description
+                [63:62]      Flow Control
+                             Value      Description
+                             0b'00      Disable flow control
+                             0b'01      Enable flow control (only valid with "ORCL,sun4v-
+                                        dax-fc" compatible virtual device variants)
+                             0b'10      Reserved
+                             0b'11      Reserved
+                [61:60]      Reserved (API 1.0)
+
+
+                                517
+                       Coprocessor services
+
+
+Offset   Size   Field Description
+                Bits        Field Description
+                            Pipeline target (API 2.0)
+                            Value      Description
+                            0b'00      Connect to primary input
+                            0b'01      Connect to secondary input
+                            0b'10      Reserved
+                            0b'11      Reserved
+                [59:40]     Output buffer size given in units of 64 bytes, minus 1. Value of
+                            0 means 64 bytes, value of 1 means 128 bytes, etc. Buffer size is
+                            only enforced if flow control is enabled in Flow Control field.
+                [39:32]     Reserved
+                [31:30]     Output Data Cache Allocation
+                            Value      Description
+                            0b'00      Do not allocate cache lines for output data stream.
+                            0b'01      Force cache lines for output data stream to be
+                                       allocated in the cache that is local to the submitting
+                                       virtual cpu.
+                            0b'10      Allocate cache lines for output data stream, but allow
+                                       existing cache lines associated with the data to remain
+                                       in their current cache instance. Any memory not
+                                       already in cache will be allocated in the cache local
+                                       to the submitting virtual cpu.
+                            0b'11      Reserved
+                [29:26]     Reserved
+                [25:24]     Primary Input Length Format
+                            Value      Description
+                            0b'00      Number of primary symbols
+                            0b'01      Number of primary bytes
+                            0b'10      Number of primary bits
+                            0b'11      Reserved
+                [23:0]      Primary Input Length
+                            Format                      Field Value
+                            # of primary symbols        Number of input elements to process,
+                                                        minus 1. Command execution stops
+                                                        once count is reached.
+                            # of primary bytes          Number of input bytes to process,
+                                                        minus 1. Command execution stops
+                                                        once count is reached. The count is
+                                                        done before any decompression or
+                                                        decoding.
+                            # of primary bits           Number of input bits to process,
+                                                        minus 1. Command execution stops
+
+
+
+                               518
+                                                Coprocessor services
+
+
+        Offset          Size           Field Description
+                                        Bits         Field Description
+                                                     Format                     Field Value
+                                                                                once count is reached. The count is
+                                                                                done before any decompression or
+                                                                                decoding, and does not include any
+                                                                                bits skipped by the Primary Input
+                                                                                Offset field value of the command
+                                                                                control word.
+        32              8              Secondary Input, if used by Primary Input Format. Same fields as Primary
+                                       Input.
+        40              8              Reserved
+        48              8              Output (same fields as Primary Input)
+        56              8              Symbol Table (if used by Primary Input)
+                                        Bits         Field Description
+                                        [63:60]      ADI version (see Section 36.2.1.1.7, “Application Data
+                                                     Integrity (ADI)”)
+                                        [59:56]      If using real address, these bits should be filled in with the
+                                                     page size code for the page boundary checking the guest wants
+                                                     the virtual machine to use when accessing this data stream
+                                                     (checking is only guaranteed to be performed when using API
+                                                     version 1.1 and later). If using a virtual address, this field will
+                                                     be used as as symbol table address bits [59:56].
+                                        [55:4]       Symbol table address bits [55:4]. Address type is determined
+                                                     by CCB header.
+                                        [3:0]        Symbol table version
+                                                     Value     Description
+                                                     0         Huffman encoding. Must use 64 byte aligned table
+                                                               address. (Only available when using version 0 CCBs)
+                                                     1         OZIP encoding. Must use 16 byte aligned table
+                                                               address. (Only available when using version 1 CCBs)
+
+
+36.2.1.3. Scan commands
+
+        The scan commands search a stream of input data elements for values which match the selection criteria.
+        All the input format types are supported. There are multiple formats for the scan commands, allowing the
+        scan to search for exact matches to one value, exact matches to either of two values, or any value within
+        a specified range. The specific type of scan is indicated by the command code in the CCB header. For the
+        scan range commands, the boundary conditions can be specified as greater-than-or-equal-to a value, less-
+        than-or-equal-to a value, or both by using two boundary values.
+
+        There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
+        0xD, and 0xE). For the standard scan command using the bit vector output, for each input element there
+        exists one bit in the vector that is set if the input element matched the scan criteria, or clear if not. The
+        inverted scan command inverts the polarity of the bits in the output. The most significant bit of the first
+        byte of the output stream corresponds to the first element in the input stream. The standard index array
+        output format contains one array entry for each input element that matched the scan criteria. Each array
+
+
+
+                                                         519
+                                       Coprocessor services
+
+
+entry is the index of an input element that matched the scan criteria. An inverted scan command produces
+a similar array, but of all the input elements which did NOT match the scan criteria.
+
+The return value of the CCB completion area contains the number of input elements found which match
+the scan criteria (or number that did not match for the inverted scans). The “number of elements processed”
+field in the CCB completion area will be valid, indicating the number of input elements processed.
+
+These commands are 128-byte “long format” CCBs.
+
+The scan CCB command format can be specified by the following packed C structure for a big-endian
+machine:
+
+
+         struct scan_ccb         {
+                uint32_t         header;
+                uint32_t         control;
+                uint64_t         completion;
+                uint64_t         primary_input;
+                uint64_t         data_access_control;
+                uint64_t         secondary_input;
+                uint64_t         match_criteria0;
+                uint64_t         output;
+                uint64_t         table;
+                uint64_t         match_criteria1;
+                uint64_t         match_criteria2;
+                uint64_t         match_criteria3;
+                uint64_t         reserved[5];
+         };
+
+
+The exact field offsets, sizes, and composition are as follows:
+
+Offset         Size            Field Description
+0              4               CCB header (Table 36.1, “CCB Header Format”)
+4              4               Command control
+                               Bits         Field Description
+                               [31:28]      Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+                                            Format”)
+                               [27:23]      Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+                                            Input Element Size”)
+                               [22:20]      Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                                            Element Offsets”)
+                               [19]         Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+                                            Input Format”)
+                               [18:16]      Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                                            Element Offsets”)
+                               [15:14]      Secondary Input Element Size (see Section 36.2.1.1.4,
+                                            “Secondary Input Element Size”
+                               [13:10]      Output Format (see Section 36.2.1.1.6, “Output Format”)
+                               [9:5]        Operand size for first scan criteria value. In a scan value
+                                            operation, this is one of two potential exact match values.
+                                            In a scan range operation, this is the size of the upper range
+
+
+                                               520
+                        Coprocessor services
+
+
+Offset   Size   Field Description
+                Bits         Field Description
+                             boundary. The value of this field is the number of bytes in the
+                             operand, minus 1. Values 0xF-0x1E are reserved. A value of
+                             0x1F indicates this operand is not in use for this scan operation.
+                [4:0]        Operand size for second scan criteria value. In a scan value
+                             operation, this is one of two potential exact match values.
+                             In a scan range operation, this is the size of the lower range
+                             boundary. The value of this field is the number of bytes in the
+                             operand, minus 1. Values 0xF-0x1E are reserved. A value of
+                             0x1F indicates this operand is not in use for this scan operation.
+8        8      Completion (same fields as Section 36.2.1.2, “Extract command”)
+16       8      Primary Input (same fields as Section 36.2.1.2, “Extract command”)
+24       8      Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
+32       8      Secondary Input, if used by Primary Input Format. Same fields as Primary
+                Input.
+40       4      Most significant 4 bytes of first scan criteria operand. If first operand is less
+                than 4 bytes, the value is left-aligned to the lowest address bytes.
+44       4      Most significant 4 bytes of second scan criteria operand. If second operand
+                is less than 4 bytes, the value is left-aligned to the lowest address bytes.
+48       8      Output (same fields as Primary Input)
+56       8      Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
+                “Extract command”
+64       4      Next 4 most significant bytes of first scan criteria operand occurring after the
+                bytes specified at offset 40, if needed by the operand size. If first operand
+                is less than 8 bytes, the valid bytes are left-aligned to the lowest address.
+68       4      Next 4 most significant bytes of second scan criteria operand occurring after
+                the bytes specified at offset 44, if needed by the operand size. If second
+                operand is less than 8 bytes, the valid bytes are left-aligned to the lowest
+                address.
+72       4      Next 4 most significant bytes of first scan criteria operand occurring after the
+                bytes specified at offset 64, if needed by the operand size. If first operand
+                is less than 12 bytes, the valid bytes are left-aligned to the lowest address.
+76       4      Next 4 most significant bytes of second scan criteria operand occurring after
+                the bytes specified at offset 68, if needed by the operand size. If second
+                operand is less than 12 bytes, the valid bytes are left-aligned to the lowest
+                address.
+80       4      Next 4 most significant bytes of first scan criteria operand occurring after the
+                bytes specified at offset 72, if needed by the operand size. If first operand
+                is less than 16 bytes, the valid bytes are left-aligned to the lowest address.
+84       4      Next 4 most significant bytes of second scan criteria operand occurring after
+                the bytes specified at offset 76, if needed by the operand size. If second
+                operand is less than 16 bytes, the valid bytes are left-aligned to the lowest
+                address.
+
+
+
+
+                                521
+                                               Coprocessor services
+
+
+36.2.1.4. Translate commands
+
+        The translate commands takes an input array of indices, and a table of single bit values indexed by those
+        indices, and outputs a bit vector or index array created by reading the tables bit value at each index in
+        the input array. The output should therefore contain exactly one bit per index in the input data stream,
+        when outputting as a bit vector. When outputting as an index array, the number of elements depends on the
+        values read in the bit table, but will always be less than, or equal to, the number of input elements. Only
+        a restricted subset of the possible input format types are supported. No variable width or Huffman/OZIP
+        encoded input streams are allowed. The primary input data element size must be 3 bytes or less.
+
+        The maximum table index size allowed is 15 bits, however, larger input elements may be used to provide
+        additional processing of the output values. If 2 or 3 byte values are used, the least significant 15 bits are
+        used as an index into the bit table. The most significant 9 bits (when using 3-byte input elements) or single
+        bit (when using 2-byte input elements) are compared against a fixed 9-bit test value provided in the CCB.
+        If the values match, the value from the bit table is used as the output element value. If the values do not
+        match, the output data element value is forced to 0.
+
+        In the inverted translate operation, the bit value read from bit table is inverted prior to its use. The additional
+        additional processing based on any additional non-index bits remains unchanged, and still forces the output
+        element value to 0 on a mismatch. The specific type of translate command is indicated by the command
+        code in the CCB header.
+
+        There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
+        0xD, and 0xE). The index array format is an array of indices of bits which would have been set if the
+        output format was a bit array.
+
+        The return value of the CCB completion area contains the number of bits set in the output bit vector,
+        or number of elements in the output index array. The “number of elements processed” field in the CCB
+        completion area will be valid, indicating the number of input elements processed.
+
+        These commands are 64-byte “short format” CCBs.
+
+        The translate CCB command format can be specified by the following packed C structure for a big-endian
+        machine:
+
+
+                 struct translate_ccb {
+                        uint32_t header;
+                        uint32_t control;
+                        uint64_t completion;
+                        uint64_t primary_input;
+                        uint64_t data_access_control;
+                        uint64_t secondary_input;
+                        uint64_t reserved;
+                        uint64_t output;
+                        uint64_t table;
+                 };
+
+
+        The exact field offsets, sizes, and composition are as follows:
+
+
+        Offset          Size             Field Description
+        0               4                CCB header (Table 36.1, “CCB Header Format”)
+
+
+                                                        522
+                        Coprocessor services
+
+
+Offset   Size   Field Description
+4        4      Command control
+                Bits         Field Description
+                [31:28]      Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+                             Format”)
+                [27:23]      Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+                             Input Element Size”)
+                [22:20]      Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                             Element Offsets”)
+                [19]         Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+                             Input Format”)
+                [18:16]      Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                             Element Offsets”)
+                [15:14]      Secondary Input Element Size (see Section 36.2.1.1.4,
+                             “Secondary Input Element Size”
+                [13:10]      Output Format (see Section 36.2.1.1.6, “Output Format”)
+                [9]          Reserved
+                [8:0]        Test value used for comparison against the most significant bits
+                             in the input values, when using 2 or 3 byte input elements.
+8        8      Completion (same fields as Section 36.2.1.2, “Extract command”
+16       8      Primary Input (same fields as Section 36.2.1.2, “Extract command”
+24       8      Data Access Control (same fields as Section 36.2.1.2, “Extract command”,
+                except Primary Input Length Format may not use the 0x0 value)
+32       8      Secondary Input, if used by Primary Input Format. Same fields as Primary
+                Input.
+40       8      Reserved
+48       8      Output (same fields as Primary Input)
+56       8      Bit Table
+                Bits         Field Description
+                [63:60]      ADI version (see Section 36.2.1.1.7, “Application Data
+                             Integrity (ADI)”)
+                [59:56]      If using real address, these bits should be filled in with the
+                             page size code for the page boundary checking the guest wants
+                             the virtual machine to use when accessing this data stream
+                             (checking is only guaranteed to be performed when using API
+                             version 1.1 and later). If using a virtual address, this field will
+                             be used as as bit table address bits [59:56]
+                [55:4]       Bit table address bits [55:4]. Address type is determined by
+                             CCB header. Address must be 64-byte aligned (CCB version
+                             0) or 16-byte aligned (CCB version 1).
+                [3:0]        Bit table version
+                             Value      Description
+                             0          4KB table size
+                             1          8KB table size
+
+
+
+                                 523
+                                              Coprocessor services
+
+
+36.2.1.5. Select command
+        The select command filters the primary input data stream by using a secondary input bit vector to determine
+        which input elements to include in the output. For each bit set at a given index N within the bit vector,
+        the Nth input element is included in the output. If the bit is not set, the element is not included. Only a
+        restricted subset of the possible input format types are supported. No variable width or run length encoded
+        input streams are allowed, since the secondary input stream is used for the filtering bit vector.
+
+        The only supported output format is a padded, byte-aligned output stream. The stream follows the same
+        rules and restrictions as padded output stream described in Section 36.2.1.2, “Extract command”.
+
+        The return value of the CCB completion area contains the number of bits set in the input bit vector. The
+        "number of elements processed" field in the CCB completion area will be valid, indicating the number
+        of input elements processed.
+
+        The select CCB is a 64-byte “short format” CCB.
+
+        The select CCB command format can be specified by the following packed C structure for a big-endian
+        machine:
+
+
+                  struct select_ccb {
+                         uint32_t header;
+                         uint32_t control;
+                         uint64_t completion;
+                         uint64_t primary_input;
+                         uint64_t data_access_control;
+                         uint64_t secondary_input;
+                         uint64_t reserved;
+                         uint64_t output;
+                         uint64_t table;
+                  };
+
+
+        The exact field offsets, sizes, and composition are as follows:
+
+         Offset        Size            Field Description
+         0             4               CCB header (Table 36.1, “CCB Header Format”)
+         4             4               Command control
+                                       Bits        Field Description
+                                       [31:28]     Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+                                                   Format”)
+                                       [27:23]     Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+                                                   Input Element Size”)
+                                       [22:20]     Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                                                   Element Offsets”)
+                                       [19]        Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+                                                   Input Format”)
+                                       [18:16]     Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+                                                   Element Offsets”)
+                                       [15:14]     Secondary Input Element Size (see Section 36.2.1.1.4,
+                                                   “Secondary Input Element Size”
+
+
+                                                      524
+                                               Coprocessor services
+
+
+        Offset         Size            Field Description
+                                       Bits         Field Description
+                                       [13:10]      Output Format (see Section 36.2.1.1.6, “Output Format”)
+                                       [9]          Padding Direction selector: A value of 1 causes padding bytes
+                                                    to be added to the left side of output elements. A value of 0
+                                                    causes padding bytes to be added to the right side of output
+                                                    elements.
+                                       [8:0]        Reserved
+        8              8               Completion (same fields as Section 36.2.1.2, “Extract command”
+        16             8               Primary Input (same fields as Section 36.2.1.2, “Extract command”
+        24             8               Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
+        32             8               Secondary Bit Vector Input. Same fields as Primary Input.
+        40             8               Reserved
+        48             8               Output (same fields as Primary Input)
+        56             8               Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
+                                       “Extract command”
+
+36.2.1.6. No-op and Sync commands
+        The no-op (no operation) command is a CCB which has no processing effect. The CCB, when processed
+        by the virtual machine, simply updates the completion area with its execution status. The CCB may have
+        the serial-conditional flags set in order to restrict when it executes.
+
+        The sync command is a variant of the no-op command which with restricted execution timing. A sync
+        command CCB will only execute when all previous commands submitted in the same request have
+        completed. This is stronger than the conditional flag sequencing, which is only dependent on a single
+        previous serial CCB. While the relative ordering is guaranteed, virtual machine implementations with
+        shared hardware resources may cause the sync command to wait for longer than the minimum required
+        time.
+
+        The return value of the CCB completion area is invalid for these CCBs. The “number of elements
+        processed” field is also invalid for these CCBs.
+
+        These commands are 64-byte “short format” CCBs.
+
+        The no-op CCB command format can be specified by the following packed C structure for a big-endian
+        machine:
+
+
+                 struct nop_ccb {
+                        uint32_t header;
+                        uint32_t control;
+                        uint64_t completion;
+                        uint64_t reserved[6];
+                 };
+
+
+        The exact field offsets, sizes, and composition are as follows:
+
+        Offset         Size            Field Description
+        0              4               CCB header (Table 36.1, “CCB Header Format”)
+
+
+                                                       525
+                                          Coprocessor services
+
+
+       Offset        Size          Field Description
+       4             4             Command control
+                                   Bits        Field Description
+                                   [31]        If set, this CCB functions as a Sync command. If clear, this
+                                               CCB functions as a No-op command.
+                                   [30:0]      Reserved
+       8             8             Completion (same fields as Section 36.2.1.2, “Extract command”
+       16            46            Reserved
+
+36.2.2. CCB Completion Area
+       All CCB commands use a common 128-byte Completion Area format, which can be specified by the
+       following packed C structure for a big-endian machine:
+
+
+                struct completion_area {
+                       uint8_t status_flag;
+                       uint8_t error_note;
+                       uint8_t rsvd0[2];
+                       uint32_t error_values;
+                       uint32_t output_size;
+                       uint32_t rsvd1;
+                       uint64_t run_time;
+                       uint64_t run_stats;
+                       uint32_t elements;
+                       uint8_t rsvd2[20];
+                       uint64_t return_value;
+                       uint64_t extra_return_value[8];
+                };
+
+
+       The Completion Area must be a 128-byte aligned memory location. The exact layout can be described
+       using byte offsets and sizes relative to the memory base:
+
+       Offset        Size          Field Description
+       0             1             CCB execution status
+                                   0x0                  Command not yet completed
+                                   0x1                  Command ran and succeeded
+                                   0x2                  Command ran and failed (partial results may be been
+                                                        produced)
+                                   0x3                  Command ran and was killed (partial execution may
+                                                        have occurred)
+                                   0x4                  Command was not run
+                                   0x5-0xF              Reserved
+       1             1             Error reason code
+                                   0x0                  Reserved
+                                   0x1                  Buffer overflow
+
+
+                                                  526
+                                      Coprocessor services
+
+
+Offset          Size           Field Description
+                                0x2                 CCB decoding error
+                                0x3                 Page overflow
+                                0x4-0x6             Reserved
+                                0x7                 Command was killed
+                                0x8                 Command execution timeout
+                                0x9                 ADI miscompare error
+                                0xA                 Data format error
+                                0xB-0xD             Reserved
+                                0xE                 Unexpected hardware error (Do not retry)
+                                0xF                 Unexpected hardware error (Retry is ok)
+                                0x10-0x7F           Reserved
+                                0x80                Partial Symbol Warning
+                                0x81-0xFF           Reserved
+2               2              Reserved
+4               4              If a partial symbol warning was generated, this field contains the number
+                               of remaining bits which were not decoded.
+8               4              Number of bytes of output produced
+12              4              Reserved
+16              8              Runtime of command (unspecified time units)
+24              8              Reserved
+32              4              Number of elements processed
+36              20             Reserved
+56              8              Return value
+64              64             Extended return value
+
+The CCB completion area should be treated as read-only by guest software. The CCB execution status
+byte will be cleared by the Hypervisor to reflect the pending execution status when the CCB is submitted
+successfully. All other fields are considered invalid upon CCB submission until the CCB execution status
+byte becomes non-zero.
+
+CCBs which complete with status 0x2 or 0x3 may produce partial results and/or side effects due to partial
+execution of the CCB command. Some valid data may be accessible depending on the fault type, however,
+it is recommended that guest software treat the destination buffer as being in an unknown state. If a CCB
+completes with a status byte of 0x2, the error reason code byte can be read to determine what corrective
+action should be taken.
+
+A buffer overflow indicates that the results of the operation exceeded the size of the output buffer indicated
+in the CCB. The operation can be retried by resubmitting the CCB with a larger output buffer.
+
+A CCB decoding error indicates that the CCB contained some invalid field values. It may be also be
+triggered if the CCB output is directed at a non-existent secondary input and the pipelining hint is followed.
+
+A page overflow error indicates that the operation required accessing a memory location beyond the page
+size associated with a given address. No data will have been read or written past the page boundary, but
+partial results may have been written to the destination buffer. The CCB can be resubmitted with a larger
+page size memory allocation to complete the operation.
+
+
+                                              527
+                                            Coprocessor services
+
+
+       In the case of pipelined CCBs, a page overflow error will be triggered if the output from the pipeline source
+       CCB ends before the input of the pipeline target CCB. Page boundaries are ignored when the pipeline
+       hint is followed.
+
+       Command kill indicates that the CCB execution was halted or prevented by use of the ccb_kill API call.
+
+       Command timeout indicates that the CCB execution began, but did not complete within a pre-determined
+       limit set by the virtual machine. The command may have produced some or no output. The CCB may be
+       resubmitted with no alterations.
+
+       ADI miscompare indicates that the memory buffer version specified in the CCB did not match the value
+       in memory when accessed by the virtual machine. Guest software should not attempt to resubmit the CCB
+       without determining the cause of the version mismatch.
+
+       A data format error indicates that the input data stream did not follow the specified data input formatting
+       selected in the CCB.
+
+       Some CCBs which encounter hardware errors may be resubmitted without change. Persistent hardware
+       errors may result in multiple failures until RAS software can identify and isolate the faulty component.
+
+       The output size field indicates the number of bytes of valid output in the destination buffer. This field is
+       not valid for all possible CCB commands.
+
+       The runtime field indicates the execution time of the CCB command once it leaves the internal virtual
+       machine queue. The time units are fixed, but unspecified, allowing only relative timing comparisons
+       by guest software. The time units may also vary by hardware platform, and should not be construed to
+       represent any absolute time value.
+
+       Some data query commands process data in units of elements. If applicable to the command, the number of
+       elements processed is indicated in the listed field. This field is not valid for all possible CCB commands.
+
+       The return value and extended return value fields are output locations for commands which do not use
+       a destination output buffer, or have secondary return results. The field is not valid for all possible CCB
+       commands.
+
+36.3. Hypervisor API Functions
+36.3.1. ccb_submit
+       trap#             FAST_TRAP
+       function#         CCB_SUBMIT
+       arg0              address
+       arg1              length
+       arg2              flags
+       arg3              reserved
+       ret0              status
+       ret1              length
+       ret2              status data
+       ret3              reserved
+
+       Submit one or more coprocessor control blocks (CCBs) for evaluation and processing by the virtual
+       machine. The CCBs are passed in a linear array indicated by address. length indicates the size of
+       the array in bytes.
+
+
+                                                     528
+                                      Coprocessor services
+
+
+The address should be aligned to the size indicated by length, rounded up to the nearest power of
+two. Virtual machines implementations may reject submissions which do not adhere to that alignment.
+length must be a multiple of 64 bytes. If length is zero, the maximum supported array length will be
+returned as length in ret1. In all other cases, the length value in ret1 will reflect the number of bytes
+successfully consumed from the input CCB array.
+
+      Implementation note
+      Virtual machines should never reject submissions based on the alignment of address if the
+      entire array is contained within a single memory page of the smallest page size supported by the
+      virtual machine.
+
+A guest may choose to submit addresses used in this API function, including the CCB array address,
+as either a real or virtual addresses, with the type of each address indicated in flags. Virtual addresses
+must be present in either the TLB or an active TSB to be processed. The translation context for virtual
+addresses is determined by a combination of CCB contents and the flags argument.
+
+The flags argument is divided into multiple fields defined as follows:
+
+
+Bits            Field Description
+[63:16]         Reserved
+[15]            Disable ADI for VA reads (in API 2.0)
+                Reserved (in API 1.0)
+[14]            Virtual addresses within CCBs are translated in privileged context
+[13:12]         Alternate translation context for virtual addresses within CCBs:
+                 0b'00        CCBs requesting alternate context are rejected
+                 0b'01        Reserved
+                 0b'10        CCBs requesting alternate context use secondary context
+                 0b'11        CCBs requesting alternate context use nucleus context
+[11:9]          Reserved
+[8]             Queue info flag
+[7]             All-or-nothing flag
+[6]             If address is a virtual address, treat its translation context as privileged
+[5:4]           Address type of address:
+                 0b'00        Real address
+                 0b'01        Virtual address in primary context
+                 0b'10        Virtual address in secondary context
+                 0b'11        Virtual address in nucleus context
+[3:2]           Reserved
+[1:0]           CCB command type:
+                 0b'00        Reserved
+                 0b'01        Reserved
+                 0b'10        Query command
+                 0b'11        Reserved
+
+
+
+                                              529
+                                             Coprocessor services
+
+
+         The CCB submission type and address type for the CCB array must be provided in the flags argument.
+         All other fields are optional values which change the default behavior of the CCB processing.
+
+         When set to one, the "Disable ADI for VA reads" bit will turn off ADI checking when using a virtual
+         address to load data. ADI checking will still be done when loading real-addressed memory. This bit is only
+         available when using major version 2 of the coprocessor API group; at major version 1 it is reserved. For
+         more information about using ADI and DAX, see Section 36.2.1.1.7, “Application Data Integrity (ADI)”.
+
+         By default, all virtual addresses are treated as user addresses. If the virtual address translations are
+         privileged, they must be marked as such in the appropriate flags field. The virtual addresses used within
+         the submitted CCBs must all be translated with the same privilege level.
+
+         By default, all virtual addresses used within the submitted CCBs are translated using the primary context
+         active at the time of the submission. The address type field within a CCB allows each address to request
+         translation in an alternate address context. The address context used when the alternate address context is
+         requested is selected in the flags argument.
+
+         The all-or-nothing flag specifies whether the virtual machine should allow partial submissions of the
+         input CCB array. When using CCBs with serial-conditional flags, it is strongly recommended to use
+         the all-or-nothing flag to avoid broken conditional chains. Using long CCB chains on a machine under
+         high coprocessor load may make this impractical, however, and require submitting without the flag.
+         When submitting serial-conditional CCBs without the all-or-nothing flag, guest software must manually
+         implement the serial-conditional behavior at any point where the chain was not submitted in a single API
+         call, and resubmission of the remaining CCBs should clear any conditional flag that might be set in the
+         first remaining CCB. Failure to do so will produce indeterminate CCB execution status and ordering.
+
+         When the all-or-nothing flag is not specified, callers should check the value of length in ret1 to determine
+         how many CCBs from the array were successfully submitted. Any remaining CCBs can be resubmitted
+         without modifications.
+
+         The value of length in ret1 is also valid when the API call returns an error, and callers should always
+         check its value to determine which CCBs in the array were already processed. This will additionally
+         identify which CCB encountered the processing error, and was not submitted successfully.
+
+         If the queue info flag is used during submission, and at least one CCB was successfully submitted, the
+         length value in ret1 will be a multi-field value defined as follows:
+          Bits          Field Description
+          [63:48]       DAX unit instance identifier
+          [47:32]       DAX queue instance identifier
+          [31:16]       Reserved
+          [15:0]        Number of CCB bytes successfully submitted
+
+         The value of status data depends on the status value. See error status code descriptions for details.
+         The value is undefined for status values that do not specifically list a value for the status data.
+
+         The API has a reserved input and output register which will be used in subsequent minor versions of this
+         API function. Guest software implementations should treat that register as voltile across the function call
+         in order to maintain forward compatibility.
+
+36.3.1.1. Errors
+          EOK                       One or more CCBs have been accepted and enqueued in the virtual machine
+                                    and no errors were been encountered during submission. Some submitted
+                                    CCBs may not have been enqueued due to internal virtual machine limitations,
+                                    and may be resubmitted without changes.
+
+
+                                                       530
+                        Coprocessor services
+
+
+EWOULDBLOCK    An internal resource conflict within the virtual machine has prevented it from
+               being able to complete the CCB submissions sufficiently quickly, requiring
+               it to abandon processing before it was complete. Some CCBs may have been
+               successfully enqueued prior to the block, and all remaining CCBs may be
+               resubmitted without changes.
+EBADALIGN      CCB array is not on a 64-byte boundary, or the array length is not a multiple
+               of 64 bytes.
+ENORADDR       A real address used either for the CCB array, or within one of the submitted
+               CCBs, is not valid for the guest. Some CCBs may have been enqueued prior
+               to the error being detected.
+ENOMAP         A virtual address used either for the CCB array, or within one of the submitted
+               CCBs, could not be translated by the virtual machine using either the TLB
+               or TSB contents. The submission may be retried after adding the required
+               mapping, or by converting the virtual address into a real address. Due to the
+               shared nature of address translation resources, there is no theoretical limit on
+               the number of times the translation may fail, and it is recommended all guests
+               implement some real address based backup. The virtual address which failed
+               translation is returned as status data in ret2. Some CCBs may have been
+               enqueued prior to the error being detected.
+EINVAL         The virtual machine detected an invalid CCB during submission, or invalid
+               input arguments, such as bad flag values. Note that not all invalid CCB values
+               will be detected during submission, and some may be reported as errors in the
+               completion area instead. Some CCBs may have been enqueued prior to the
+               error being detected. This error may be returned if the CCB version is invalid.
+ETOOMANY       The request was submitted with the all-or-nothing flag set, and the array size is
+               greater than the virtual machine can support in a single request. The maximum
+               supported size for the current virtual machine can be queried by submitting a
+               request with a zero length array, as described above.
+ENOACCESS      The guest does not have permission to submit CCBs, or an address used in a
+               CCBs lacks sufficient permissions to perform the required operation (no write
+               permission on the destination buffer address, for example). A virtual address
+               which fails permission checking is returned as status data in ret2. Some
+               CCBs may have been enqueued prior to the error being detected.
+EUNAVAILABLE   The requested CCB operation could not be performed at this time. The
+               restricted operation availability may apply only to the first unsuccessfully
+               submitted CCB, or may apply to a larger scope. The status should not be
+               interpreted as permanent, and the guest should attempt to submit CCBs in
+               the future which had previously been unable to be performed. The status
+               data provides additional information about scope of the restricted availability
+               as follows:
+               Value       Description
+               0           Processing for the exact CCB instance submitted was unavailable,
+                           and it is recommended the guest emulate the operation. The
+                           guest should continue to submit all other CCBs, and assume no
+                           restrictions beyond this exact CCB instance.
+               1           Processing is unavailable for all CCBs using the requested opcode,
+                           and it is recommended the guest emulate the operation. The
+                           guest should continue to submit all other CCBs that use different
+                           opcodes, but can expect continued rejections of CCBs using the
+                           same opcode in the near future.
+
+
+                                 531
+                                              Coprocessor services
+
+
+                                      Value     Description
+                                      2         Processing is unavailable for all CCBs using the requested CCB
+                                                version, and it is recommended the guest emulate the operation.
+                                                The guest should continue to submit all other CCBs that use
+                                                different CCB versions, but can expect continued rejections of
+                                                CCBs using the same CCB version in the near future.
+                                      3         Processing is unavailable for all CCBs on the submitting vcpu,
+                                                and it is recommended the guest emulate the operation or resubmit
+                                                the CCB on a different vcpu. The guest should continue to submit
+                                                CCBs on all other vcpus but can expect continued rejections of all
+                                                CCBs on this vcpu in the near future.
+                                      4         Processing is unavailable for all CCBs, and it is recommended
+                                                the guest emulate the operation. The guest should expect all CCB
+                                                submissions to be similarly rejected in the near future.
+
+
+36.3.2. ccb_info
+
+        trap#               FAST_TRAP
+        function#           CCB_INFO
+        arg0                address
+        ret0                status
+        ret1                CCB state
+        ret2                position
+        ret3                dax
+        ret4                queue
+
+       Requests status information on a previously submitted CCB. The previously submitted CCB is identified
+       by the 64-byte aligned real address of the CCBs completion area.
+
+       A CCB can be in one of 4 states:
+
+
+        State                     Value       Description
+        COMPLETED                 0           The CCB has been fetched and executed, and is no longer active in
+                                              the virtual machine.
+        ENQUEUED                  1           The requested CCB is current in a queue awaiting execution.
+        INPROGRESS                2           The CCB has been fetched and is currently being executed. It may still
+                                              be possible to stop the execution using the ccb_kill hypercall.
+        NOTFOUND                  3           The CCB could not be located in the virtual machine, and does not
+                                              appear to have been executed. This may occur if the CCB was lost
+                                              due to a hardware error, or the CCB may not have been successfully
+                                              submitted to the virtual machine in the first place.
+
+               Implementation note
+               Some platforms may not be able to report CCBs that are currently being processed, and therefore
+               guest software should invoke the ccb_kill hypercall prior to assuming the request CCB will never
+               be executed because it was in the NOTFOUND state.
+
+
+                                                       532
+                                             Coprocessor services
+
+
+         The position return value is only valid when the state is ENQUEUED. The value returned is the number
+         of other CCBs ahead of the requested CCB, to provide a relative estimate of when the CCB may execute.
+
+         The dax return value is only valid when the state is ENQUEUED. The value returned is the DAX unit
+         instance identifier for the DAX unit processing the queue where the requested CCB is located. The value
+         matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
+
+         The queue return value is only valid when the state is ENQUEUED. The value returned is the DAX
+         queue instance identifier for the DAX unit processing the queue where the requested CCB is located. The
+         value matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
+
+36.3.2.1. Errors
+
+          EOK                       The request was processed and the CCB state is valid.
+          EBADALIGN                 address is not on a 64-byte aligned.
+          ENORADDR                  The real address provided for address is not valid.
+          EINVAL                    The CCB completion area contents are not valid.
+          EWOULDBLOCK               Internal resource constraints prevented the CCB state from being queried at this
+                                    time. The guest should retry the request.
+          ENOACCESS                 The guest does not have permission to access the coprocessor virtual device
+                                    functionality.
+
+36.3.3. ccb_kill
+
+          trap#           FAST_TRAP
+          function#       CCB_KILL
+          arg0            address
+          ret0            status
+          ret1            result
+
+         Request to stop execution of a previously submitted CCB. The previously submitted CCB is identified by
+         the 64-byte aligned real address of the CCBs completion area.
+
+         The kill attempt can produce one of several values in the result return value, reflecting the CCB state
+         and actions taken by the Hypervisor:
+
+          Result                Value       Description
+          COMPLETED             0           The CCB has been fetched and executed, and is no longer active in
+                                            the virtual machine. It could not be killed and no action was taken.
+          DEQUEUED              1           The requested CCB was still enqueued when the kill request was
+                                            submitted, and has been removed from the queue. Since the CCB
+                                            never began execution, no memory modifications were produced by
+                                            it, and the completion area will never be updated. The same CCB may
+                                            be submitted again, if desired, with no modifications required.
+          KILLED                2           The CCB had been fetched and was being executed when the kill
+                                            request was submitted. The CCB execution was stopped, and the CCB
+                                            is no longer active in the virtual machine. The CCB completion area
+                                            will reflect the killed status, with the subsequent implications that
+                                            partial results may have been produced. Partial results may include full
+
+
+                                                      533
+                                              Coprocessor services
+
+
+          Result                 Value       Description
+                                             command execution if the command was stopped just prior to writing
+                                             to the completion area.
+          NOTFOUND               3           The CCB could not be located in the virtual machine, and does not
+                                             appear to have been executed. This may occur if the CCB was lost
+                                             due to a hardware error, or the CCB may not have been successfully
+                                             submitted to the virtual machine in the first place. CCBs in the state
+                                             are guaranteed to never execute in the future unless resubmitted.
+
+36.3.3.1. Interactions with Pipelined CCBs
+
+         If the pipeline target CCB is killed but the pipeline source CCB was skipped, the completion area of the
+         target CCB may contain status (4,0) "Command was skipped" instead of (3,7) "Command was killed".
+
+         If the pipeline source CCB is killed, the pipeline target CCB's completion status may read (1,0) "Success".
+         This does not mean the target CCB was processed; since the source CCB was killed, there was no
+         meaningful output on which the target CCB could operate.
+
+36.3.3.2. Errors
+
+          EOK                        The request was processed and the result is valid.
+          EBADALIGN                  address is not on a 64-byte aligned.
+          ENORADDR                   The real address provided for address is not valid.
+          EINVAL                     The CCB completion area contents are not valid.
+          EWOULDBLOCK                Internal resource constraints prevented the CCB from being killed at this time.
+                                     The guest should retry the request.
+          ENOACCESS                  The guest does not have permission to access the coprocessor virtual device
+                                     functionality.
+
+36.3.4. dax_info
+          trap#            FAST_TRAP
+          function#        DAX_INFO
+          ret0             status
+          ret1             Number of enabled DAX units
+          ret2             Number of disabled DAX units
+
+         Returns the number of DAX units that are enabled for the calling guest to submit CCBs. The number of
+         DAX units that are disabled for the calling guest are also returned. A disabled DAX unit would have been
+         available for CCB submission to the calling guest had it not been offlined.
+
+36.3.4.1. Errors
+
+          EOK                        The request was processed and the number of enabled/disabled DAX units
+                                     are valid.
+
+
+
+
+                                                       534
+
diff --git a/Documentation/arch/sparc/oradax/oracle-dax.rst b/Documentation/arch/sparc/oradax/oracle-dax.rst
new file mode 100644
index 000000000000..d1e14d572918
--- /dev/null
+++ b/Documentation/arch/sparc/oradax/oracle-dax.rst
@@ -0,0 +1,445 @@
+=======================================
+Oracle Data Analytics Accelerator (DAX)
+=======================================
+
+DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
+(DAX2) processor chips, and has direct access to the CPU's L3 caches
+as well as physical memory. It can perform several operations on data
+streams with various input and output formats.  A driver provides a
+transport mechanism and has limited knowledge of the various opcodes
+and data formats. A user space library provides high level services
+and translates these into low level commands which are then passed
+into the driver and subsequently the Hypervisor and the coprocessor.
+The library is the recommended way for applications to use the
+coprocessor, and the driver interface is not intended for general use.
+This document describes the general flow of the driver, its
+structures, and its programmatic interface. It also provides example
+code sufficient to write user or kernel applications that use DAX
+functionality.
+
+The user library is open source and available at:
+
+    https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
+
+The Hypervisor interface to the coprocessor is described in detail in
+the accompanying document, dax-hv-api.txt, which is a plain text
+excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
+Specification" version 3.0.20+15, dated 2017-09-25.
+
+
+High Level Overview
+===================
+
+A coprocessor request is described by a Command Control Block
+(CCB). The CCB contains an opcode and various parameters. The opcode
+specifies what operation is to be done, and the parameters specify
+options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
+is passed to the Hypervisor, which handles queueing and scheduling of
+requests to the available coprocessor execution units. A status code
+returned indicates if the request was submitted successfully or if
+there was an error.  One of the addresses given in each CCB is a
+pointer to a "completion area", which is a 128 byte memory block that
+is written by the coprocessor to provide execution status. No
+interrupt is generated upon completion; the completion area must be
+polled by software to find out when a transaction has finished, but
+the M7 and later processors provide a mechanism to pause the virtual
+processor until the completion status has been updated by the
+coprocessor. This is done using the monitored load and mwait
+instructions, which are described in more detail later.  The DAX
+coprocessor was designed so that after a request is submitted, the
+kernel is no longer involved in the processing of it.  The polling is
+done at the user level, which results in almost zero latency between
+completion of a request and resumption of execution of the requesting
+thread.
+
+
+Addressing Memory
+=================
+
+The kernel does not have access to physical memory in the Sun4v
+architecture, as there is an additional level of memory virtualization
+present. This intermediate level is called "real" memory, and the
+kernel treats this as if it were physical.  The Hypervisor handles the
+translations between real memory and physical so that each logical
+domain (LDOM) can have a partition of physical memory that is isolated
+from that of other LDOMs.  When the kernel sets up a virtual mapping,
+it specifies a virtual address and the real address to which it should
+be mapped.
+
+The DAX coprocessor can only operate on physical memory, so before a
+request can be fed to the coprocessor, all the addresses in a CCB must
+be converted into physical addresses. The kernel cannot do this since
+it has no visibility into physical addresses. So a CCB may contain
+either the virtual or real addresses of the buffers or a combination
+of them. An "address type" field is available for each address that
+may be given in the CCB. In all cases, the Hypervisor will translate
+all the addresses to physical before dispatching to hardware. Address
+translations are performed using the context of the process initiating
+the request.
+
+
+The Driver API
+==============
+
+An application makes requests to the driver via the write() system
+call, and gets results (if any) via read(). The completion areas are
+made accessible via mmap(), and are read-only for the application.
+
+The request may either be an immediate command or an array of CCBs to
+be submitted to the hardware.
+
+Each open instance of the device is exclusive to the thread that
+opened it, and must be used by that thread for all subsequent
+operations. The driver open function creates a new context for the
+thread and initializes it for use.  This context contains pointers and
+values used internally by the driver to keep track of submitted
+requests. The completion area buffer is also allocated, and this is
+large enough to contain the completion areas for many concurrent
+requests.  When the device is closed, any outstanding transactions are
+flushed and the context is cleaned up.
+
+On a DAX1 system (M7), the device will be called "oradax1", while on a
+DAX2 system (M8) it will be "oradax2". If an application requires one
+or the other, it should simply attempt to open the appropriate
+device. Only one of the devices will exist on any given system, so the
+name can be used to determine what the platform supports.
+
+The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
+all of these, success is indicated by a return value from write()
+equal to the number of bytes given in the call. Otherwise -1 is
+returned and errno is set.
+
+CCB_DEQUEUE
+-----------
+
+Tells the driver to clean up resources associated with past
+requests. Since no interrupt is generated upon the completion of a
+request, the driver must be told when it may reclaim resources.  No
+further status information is returned, so the user should not
+subsequently call read().
+
+CCB_KILL
+--------
+
+Kills a CCB during execution. The CCB is guaranteed to not continue
+executing once this call returns successfully. On success, read() must
+be called to retrieve the result of the action.
+
+CCB_INFO
+--------
+
+Retrieves information about a currently executing CCB. Note that some
+Hypervisors might return 'notfound' when the CCB is in 'inprogress'
+state. To ensure a CCB in the 'notfound' state will never be executed,
+CCB_KILL must be invoked on that CCB. Upon success, read() must be
+called to retrieve the details of the action.
+
+Submission of an array of CCBs for execution
+---------------------------------------------
+
+A write() whose length is a multiple of the CCB size is treated as a
+submit operation. The file offset is treated as the index of the
+completion area to use, and may be set via lseek() or using the
+pwrite() system call. If -1 is returned then errno is set to indicate
+the error. Otherwise, the return value is the length of the array that
+was actually accepted by the coprocessor. If the accepted length is
+equal to the requested length, then the submission was completely
+successful and there is no further status needed; hence, the user
+should not subsequently call read(). Partial acceptance of the CCB
+array is indicated by a return value less than the requested length,
+and read() must be called to retrieve further status information.  The
+status will reflect the error caused by the first CCB that was not
+accepted, and status_data will provide additional data in some cases.
+
+MMAP
+----
+
+The mmap() function provides access to the completion area allocated
+in the driver.  Note that the completion area is not writeable by the
+user process, and the mmap call must not specify PROT_WRITE.
+
+
+Completion of a Request
+=======================
+
+The first byte in each completion area is the command status which is
+updated by the coprocessor hardware. Software may take advantage of
+new M7/M8 processor capabilities to efficiently poll this status byte.
+First, a "monitored load" is achieved via a Load from Alternate Space
+(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
+"monitored wait" is achieved via the mwait instruction (a write to
+%asr28). This instruction is like pause in that it suspends execution
+of the virtual processor for the given number of nanoseconds, but in
+addition will terminate early when one of several events occur. If the
+block of data containing the monitored location is modified, then the
+mwait terminates. This causes software to resume execution immediately
+(without a context switch or kernel to user transition) after a
+transaction completes. Thus the latency between transaction completion
+and resumption of execution may be just a few nanoseconds.
+
+
+Application Life Cycle of a DAX Submission
+==========================================
+
+ - open dax device
+ - call mmap() to get the completion area address
+ - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
+ - submit CCB via write() or pwrite()
+ - go into a loop executing monitored load + monitored wait and
+   terminate when the command status indicates the request is complete
+   (CCB_KILL or CCB_INFO may be used any time as necessary)
+ - perform a CCB_DEQUEUE
+ - call munmap() for completion area
+ - close the dax device
+
+
+Memory Constraints
+==================
+
+The DAX hardware operates only on physical addresses. Therefore, it is
+not aware of virtual memory mappings and the discontiguities that may
+exist in the physical memory that a virtual buffer maps to. There is
+no I/O TLB or any scatter/gather mechanism. All buffers, whether input
+or output, must reside in a physically contiguous region of memory.
+
+The Hypervisor translates all addresses within a CCB to physical
+before handing off the CCB to DAX. The Hypervisor determines the
+virtual page size for each virtual address given, and uses this to
+program a size limit for each address. This prevents the coprocessor
+from reading or writing beyond the bound of the virtual page, even
+though it is accessing physical memory directly. A simpler way of
+saying this is that a DAX operation will never "cross" a virtual page
+boundary. If an 8k virtual page is used, then the data is strictly
+limited to 8k. If a user's buffer is larger than 8k, then a larger
+page size must be used, or the transaction size will be truncated to
+8k.
+
+Huge pages. A user may allocate huge pages using standard interfaces.
+Memory buffers residing on huge pages may be used to achieve much
+larger DAX transaction sizes, but the rules must still be followed,
+and no transaction will cross a page boundary, even a huge page.  A
+major caveat is that Linux on Sparc presents 8Mb as one of the huge
+page sizes. Sparc does not actually provide a 8Mb hardware page size,
+and this size is synthesized by pasting together two 4Mb pages. The
+reasons for this are historical, and it creates an issue because only
+half of this 8Mb page can actually be used for any given buffer in a
+DAX request, and it must be either the first half or the second half;
+it cannot be a 4Mb chunk in the middle, since that crosses a
+(hardware) page boundary. Note that this entire issue may be hidden by
+higher level libraries.
+
+
+CCB Structure
+-------------
+A CCB is an array of 8 64-bit words. Several of these words provide
+command opcodes, parameters, flags, etc., and the rest are addresses
+for the completion area, output buffer, and various inputs::
+
+   struct ccb {
+       u64   control;
+       u64   completion;
+       u64   input0;
+       u64   access;
+       u64   input1;
+       u64   op_data;
+       u64   output;
+       u64   table;
+   };
+
+See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
+each of these fields, and see dax-hv-api.txt for a complete description
+of the Hypervisor API available to the guest OS (ie, Linux kernel).
+
+The first word (control) is examined by the driver for the following:
+ - CCB version, which must be consistent with hardware version
+ - Opcode, which must be one of the documented allowable commands
+ - Address types, which must be set to "virtual" for all the addresses
+   given by the user, thereby ensuring that the application can
+   only access memory that it owns
+
+
+Example Code
+============
+
+The DAX is accessible to both user and kernel code.  The kernel code
+can make hypercalls directly while the user code must use wrappers
+provided by the driver. The setup of the CCB is nearly identical for
+both; the only difference is in preparation of the completion area. An
+example of user code is given now, with kernel code afterwards.
+
+In order to program using the driver API, the file
+arch/sparc/include/uapi/asm/oradax.h must be included.
+
+First, the proper device must be opened. For M7 it will be
+/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
+procedure is to attempt to open both, as only one will succeed::
+
+	fd = open("/dev/oradax1", O_RDWR);
+	if (fd < 0)
+		fd = open("/dev/oradax2", O_RDWR);
+	if (fd < 0)
+	       /* No DAX found */
+
+Next, the completion area must be mapped::
+
+      completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
+
+All input and output buffers must be fully contained in one hardware
+page, since as explained above, the DAX is strictly constrained by
+virtual page boundaries.  In addition, the output buffer must be
+64-byte aligned and its size must be a multiple of 64 bytes because
+the coprocessor writes in units of cache lines.
+
+This example demonstrates the DAX Scan command, which takes as input a
+vector and a match value, and produces a bitmap as the output. For
+each input element that matches the value, the corresponding bit is
+set in the output.
+
+In this example, the input vector consists of a series of single bits,
+and the match value is 0. So each 0 bit in the input will produce a 1
+in the output, and vice versa, which produces an output bitmap which
+is the input bitmap inverted.
+
+For details of all the parameters and bits used in this CCB, please
+refer to section 36.2.1.3 of the DAX Hypervisor API document, which
+describes the Scan command in detail::
+
+	ccb->control =       /* Table 36.1, CCB Header Format */
+		  (2L << 48)     /* command = Scan Value */
+		| (3L << 40)     /* output address type = primary virtual */
+		| (3L << 34)     /* primary input address type = primary virtual */
+		             /* Section 36.2.1, Query CCB Command Formats */
+		| (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
+		| (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
+		| (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
+		| (0 <<  5)	/* 36.2.1.3 First scan criteria size = 0 (1 byte) */
+		| (31 << 0);	/* 36.2.1.3 Disable second scan criteria */
+
+	ccb->completion = 0;    /* Completion area address, to be filled in by driver */
+
+	ccb->input0 = (unsigned long) input; /* primary input address */
+
+	ccb->access =       /* Section 36.2.1.2, Data Access Control */
+		  (2 << 24)    /* Primary input length format = bits */
+		| (nbits - 1); /* number of bits in primary input stream, minus 1 */
+
+	ccb->input1 = 0;       /* secondary input address, unused */
+
+	ccb->op_data = 0;      /* scan criteria (value to be matched) */
+
+	ccb->output = (unsigned long) output;	/* output address */
+
+	ccb->table = 0;	       /* table address, unused */
+
+The CCB submission is a write() or pwrite() system call to the
+driver. If the call fails, then a read() must be used to retrieve the
+status::
+
+	if (pwrite(fd, ccb, 64, 0) != 64) {
+		struct ccb_exec_result status;
+		read(fd, &status, sizeof(status));
+		/* bail out */
+	}
+
+After a successful submission of the CCB, the completion area may be
+polled to determine when the DAX is finished. Detailed information on
+the contents of the completion area can be found in section 36.2.2 of
+the DAX HV API document::
+
+	while (1) {
+		/* Monitored Load */
+		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
+				     : "=r" (status)
+				     : "r"  (completion_area));
+
+		if (status)	     /* 0 indicates command in progress */
+			break;
+
+		/* MWAIT */
+		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
+	}
+
+A completion area status of 1 indicates successful completion of the
+CCB and validity of the output bitmap, which may be used immediately.
+All other non-zero values indicate error conditions which are
+described in section 36.2.2::
+
+	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
+		/* completion_area[0] contains the completion status */
+		/* completion_area[1] contains an error code, see 36.2.2 */
+	}
+
+After the completion area has been processed, the driver must be
+notified that it can release any resources associated with the
+request. This is done via the dequeue operation::
+
+	struct dax_command cmd;
+	cmd.command = CCB_DEQUEUE;
+	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
+		/* bail out */
+	}
+
+Finally, normal program cleanup should be done, i.e., unmapping
+completion area, closing the dax device, freeing memory etc.
+
+Kernel example
+--------------
+
+The only difference in using the DAX in kernel code is the treatment
+of the completion area. Unlike user applications which mmap the
+completion area allocated by the driver, kernel code must allocate its
+own memory to use for the completion area, and this address and its
+type must be given in the CCB::
+
+	ccb->control |=      /* Table 36.1, CCB Header Format */
+	        (3L << 32);     /* completion area address type = primary virtual */
+
+	ccb->completion = (unsigned long) completion_area;   /* Completion area address */
+
+The dax submit hypercall is made directly. The flags used in the
+ccb_submit call are documented in the DAX HV API in section 36.3.1/
+
+::
+
+  #include <asm/hypervisor.h>
+
+	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
+				 HV_CCB_QUERY_CMD |
+				 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
+				 HV_CCB_VA_PRIVILEGED,
+				 0, &bytes_accepted, &status_data);
+
+	if (hv_rv != HV_EOK) {
+		/* hv_rv is an error code, status_data contains */
+		/* potential additional status, see 36.3.1.1 */
+	}
+
+After the submission, the completion area polling code is identical to
+that in user land::
+
+	while (1) {
+		/* Monitored Load */
+		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
+				     : "=r" (status)
+				     : "r"  (completion_area));
+
+		if (status)	     /* 0 indicates command in progress */
+			break;
+
+		/* MWAIT */
+		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
+	}
+
+	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
+		/* completion_area[0] contains the completion status */
+		/* completion_area[1] contains an error code, see 36.2.2 */
+	}
+
+The output bitmap is ready for consumption immediately after the
+completion status indicates success.
+
+Excer[t from UltraSPARC Virtual Machine Specification
+=====================================================
+
+ .. include:: dax-hv-api.txt
+    :literal:
diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst
new file mode 100644
index 000000000000..934310ce7258
--- /dev/null
+++ b/Documentation/arch/x86/amd-memory-encryption.rst
@@ -0,0 +1,133 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+AMD Memory Encryption
+=====================
+
+Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV) are
+features found on AMD processors.
+
+SME provides the ability to mark individual pages of memory as encrypted using
+the standard x86 page tables.  A page that is marked encrypted will be
+automatically decrypted when read from DRAM and encrypted when written to
+DRAM.  SME can therefore be used to protect the contents of DRAM from physical
+attacks on the system.
+
+SEV enables running encrypted virtual machines (VMs) in which the code and data
+of the guest VM are secured so that a decrypted version is available only
+within the VM itself. SEV guest VMs have the concept of private and shared
+memory. Private memory is encrypted with the guest-specific key, while shared
+memory may be encrypted with hypervisor key. When SME is enabled, the hypervisor
+key is the same key which is used in SME.
+
+A page is encrypted when a page table entry has the encryption bit set (see
+below on how to determine its position).  The encryption bit can also be
+specified in the cr3 register, allowing the PGD table to be encrypted. Each
+successive level of page tables can also be encrypted by setting the encryption
+bit in the page table entry that points to the next table. This allows the full
+page table hierarchy to be encrypted. Note, this means that just because the
+encryption bit is set in cr3, doesn't imply the full hierarchy is encrypted.
+Each page table entry in the hierarchy needs to have the encryption bit set to
+achieve that. So, theoretically, you could have the encryption bit set in cr3
+so that the PGD is encrypted, but not set the encryption bit in the PGD entry
+for a PUD which results in the PUD pointed to by that entry to not be
+encrypted.
+
+When SEV is enabled, instruction pages and guest page tables are always treated
+as private. All the DMA operations inside the guest must be performed on shared
+memory. Since the memory encryption bit is controlled by the guest OS when it
+is operating in 64-bit or 32-bit PAE mode, in all other modes the SEV hardware
+forces the memory encryption bit to 1.
+
+Support for SME and SEV can be determined through the CPUID instruction. The
+CPUID function 0x8000001f reports information related to SME::
+
+	0x8000001f[eax]:
+		Bit[0] indicates support for SME
+		Bit[1] indicates support for SEV
+	0x8000001f[ebx]:
+		Bits[5:0]  pagetable bit number used to activate memory
+			   encryption
+		Bits[11:6] reduction in physical address space, in bits, when
+			   memory encryption is enabled (this only affects
+			   system physical addresses, not guest physical
+			   addresses)
+
+If support for SME is present, MSR 0xc00100010 (MSR_AMD64_SYSCFG) can be used to
+determine if SME is enabled and/or to enable memory encryption::
+
+	0xc0010010:
+		Bit[23]   0 = memory encryption features are disabled
+			  1 = memory encryption features are enabled
+
+If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if
+SEV is active::
+
+	0xc0010131:
+		Bit[0]	  0 = memory encryption is not active
+			  1 = memory encryption is active
+
+Linux relies on BIOS to set this bit if BIOS has determined that the reduction
+in the physical address space as a result of enabling memory encryption (see
+CPUID information above) will not conflict with the address space resource
+requirements for the system.  If this bit is not set upon Linux startup then
+Linux itself will not set it and memory encryption will not be possible.
+
+The state of SME in the Linux kernel can be documented as follows:
+
+	- Supported:
+	  The CPU supports SME (determined through CPUID instruction).
+
+	- Enabled:
+	  Supported and bit 23 of MSR_AMD64_SYSCFG is set.
+
+	- Active:
+	  Supported, Enabled and the Linux kernel is actively applying
+	  the encryption bit to page table entries (the SME mask in the
+	  kernel is non-zero).
+
+SME can also be enabled and activated in the BIOS. If SME is enabled and
+activated in the BIOS, then all memory accesses will be encrypted and it will
+not be necessary to activate the Linux memory encryption support.  If the BIOS
+merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate
+memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or
+by supplying mem_encrypt=on on the kernel command line.  However, if BIOS does
+not enable SME, then Linux will not be able to activate memory encryption, even
+if configured to do so by default or the mem_encrypt=on command line parameter
+is specified.
+
+Secure Nested Paging (SNP)
+==========================
+
+SEV-SNP introduces new features (SEV_FEATURES[1:63]) which can be enabled
+by the hypervisor for security enhancements. Some of these features need
+guest side implementation to function correctly. The below table lists the
+expected guest behavior with various possible scenarios of guest/hypervisor
+SNP feature support.
+
++-----------------+---------------+---------------+------------------+
+| Feature Enabled | Guest needs   | Guest has     | Guest boot       |
+| by the HV       | implementation| implementation| behaviour        |
++=================+===============+===============+==================+
+|      No         |      No       |      No       |     Boot         |
+|                 |               |               |                  |
++-----------------+---------------+---------------+------------------+
+|      No         |      Yes      |      No       |     Boot         |
+|                 |               |               |                  |
++-----------------+---------------+---------------+------------------+
+|      No         |      Yes      |      Yes      |     Boot         |
+|                 |               |               |                  |
++-----------------+---------------+---------------+------------------+
+|      Yes        |      No       |      No       | Boot with        |
+|                 |               |               | feature enabled  |
++-----------------+---------------+---------------+------------------+
+|      Yes        |      Yes      |      No       | Graceful boot    |
+|                 |               |               | failure          |
++-----------------+---------------+---------------+------------------+
+|      Yes        |      Yes      |      Yes      | Boot with        |
+|                 |               |               | feature enabled  |
++-----------------+---------------+---------------+------------------+
+
+More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR
+
+[1] https://www.amd.com/system/files/TechDocs/40332.pdf
diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst
new file mode 100644
index 000000000000..440e4b645a1c
--- /dev/null
+++ b/Documentation/arch/x86/amd_hsmp.rst
@@ -0,0 +1,86 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+AMD HSMP interface
+============================================
+
+Newer Fam19h EPYC server line of processors from AMD support system
+management functionality via HSMP (Host System Management Port).
+
+The Host System Management Port (HSMP) is an interface to provide
+OS-level software with access to system management functions via a
+set of mailbox registers.
+
+More details on the interface can be found in chapter
+"7 Host System Management Port (HSMP)" of the family/model PPR
+Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
+
+HSMP interface is supported on EPYC server CPU models only.
+
+
+HSMP device
+============================================
+
+amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice
+/dev/hsmp to let user space programs run hsmp mailbox commands.
+
+$ ls -al /dev/hsmp
+crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp
+
+Characteristics of the dev node:
+ * Write mode is used for running set/configure commands
+ * Read mode is used for running get/status monitor commands
+
+Access restrictions:
+ * Only root user is allowed to open the file in write mode.
+ * The file can be opened in read mode by all the users.
+
+In-kernel integration:
+ * Other subsystems in the kernel can use the exported transport
+   function hsmp_send_message().
+ * Locking across callers is taken care by the driver.
+
+
+An example
+==========
+
+To access hsmp device from a C program.
+First, you need to include the headers::
+
+  #include <linux/amd_hsmp.h>
+
+Which defines the supported messages/message IDs.
+
+Next thing, open the device file, as follows::
+
+  int file;
+
+  file = open("/dev/hsmp", O_RDWR);
+  if (file < 0) {
+    /* ERROR HANDLING; you can check errno to see what went wrong */
+    exit(1);
+  }
+
+The following IOCTL is defined:
+
+``ioctl(file, HSMP_IOCTL_CMD, struct hsmp_message *msg)``
+  The argument is a pointer to a::
+
+    struct hsmp_message {
+    	__u32	msg_id;				/* Message ID */
+    	__u16	num_args;			/* Number of input argument words in message */
+    	__u16	response_sz;			/* Number of expected output/response words */
+    	__u32	args[HSMP_MAX_MSG_LEN];		/* argument/response buffer */
+    	__u16	sock_ind;			/* socket number */
+    };
+
+The ioctl would return a non-zero on failure; you can read errno to see
+what happened. The transaction returns 0 on success.
+
+More details on the interface and message definitions can be found in chapter
+"7 Host System Management Port (HSMP)" of the respective family/model PPR
+eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
+
+User space C-APIs are made available by linking against the esmi library,
+which is provided by the E-SMS project https://developer.amd.com/e-sms/.
+See: https://github.com/amd/esmi_ib_library
diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
new file mode 100644
index 000000000000..33520ecdb37a
--- /dev/null
+++ b/Documentation/arch/x86/boot.rst
@@ -0,0 +1,1443 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+The Linux/x86 Boot Protocol
+===========================
+
+On the x86 platform, the Linux kernel uses a rather complicated boot
+convention.  This has evolved partially due to historical aspects, as
+well as the desire in the early days to have the kernel itself be a
+bootable image, the complicated PC memory model and due to changed
+expectations in the PC industry caused by the effective demise of
+real-mode DOS as a mainstream operating system.
+
+Currently, the following versions of the Linux/x86 boot protocol exist.
+
+=============	============================================================
+Old kernels	zImage/Image support only.  Some very early kernels
+		may not even support a command line.
+
+Protocol 2.00	(Kernel 1.3.73) Added bzImage and initrd support, as
+		well as a formalized way to communicate between the
+		boot loader and the kernel.  setup.S made relocatable,
+		although the traditional setup area still assumed
+		writable.
+
+Protocol 2.01	(Kernel 1.3.76) Added a heap overrun warning.
+
+Protocol 2.02	(Kernel 2.4.0-test3-pre3) New command line protocol.
+		Lower the conventional memory ceiling.	No overwrite
+		of the traditional setup area, thus making booting
+		safe for systems which use the EBDA from SMM or 32-bit
+		BIOS entry points.  zImage deprecated but still
+		supported.
+
+Protocol 2.03	(Kernel 2.4.18-pre1) Explicitly makes the highest possible
+		initrd address available to the bootloader.
+
+Protocol 2.04	(Kernel 2.6.14) Extend the syssize field to four bytes.
+
+Protocol 2.05	(Kernel 2.6.20) Make protected mode kernel relocatable.
+		Introduce relocatable_kernel and kernel_alignment fields.
+
+Protocol 2.06	(Kernel 2.6.22) Added a field that contains the size of
+		the boot command line.
+
+Protocol 2.07	(Kernel 2.6.24) Added paravirtualised boot protocol.
+		Introduced hardware_subarch and hardware_subarch_data
+		and KEEP_SEGMENTS flag in load_flags.
+
+Protocol 2.08	(Kernel 2.6.26) Added crc32 checksum and ELF format
+		payload. Introduced payload_offset and payload_length
+		fields to aid in locating the payload.
+
+Protocol 2.09	(Kernel 2.6.26) Added a field of 64-bit physical
+		pointer to single linked list of struct	setup_data.
+
+Protocol 2.10	(Kernel 2.6.31) Added a protocol for relaxed alignment
+		beyond the kernel_alignment added, new init_size and
+		pref_address fields.  Added extended boot loader IDs.
+
+Protocol 2.11	(Kernel 3.6) Added a field for offset of EFI handover
+		protocol entry point.
+
+Protocol 2.12	(Kernel 3.8) Added the xloadflags field and extension fields
+		to struct boot_params for loading bzImage and ramdisk
+		above 4G in 64bit.
+
+Protocol 2.13	(Kernel 3.14) Support 32- and 64-bit flags being set in
+		xloadflags to support booting a 64-bit kernel from 32-bit
+		EFI
+
+Protocol 2.14	BURNT BY INCORRECT COMMIT
+                ae7e1238e68f2a472a125673ab506d49158c1889
+		(x86/boot: Add ACPI RSDP address to setup_header)
+		DO NOT USE!!! ASSUME SAME AS 2.13.
+
+Protocol 2.15	(Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max.
+=============	============================================================
+
+.. note::
+     The protocol version number should be changed only if the setup header
+     is changed. There is no need to update the version number if boot_params
+     or kernel_info are changed. Additionally, it is recommended to use
+     xloadflags (in this case the protocol version number should not be
+     updated either) or kernel_info to communicate supported Linux kernel
+     features to the boot loader. Due to very limited space available in
+     the original setup header every update to it should be considered
+     with great care. Starting from the protocol 2.15 the primary way to
+     communicate things to the boot loader is the kernel_info.
+
+
+Memory Layout
+=============
+
+The traditional memory map for the kernel loader, used for Image or
+zImage kernels, typically looks like::
+
+		|			 |
+	0A0000	+------------------------+
+		|  Reserved for BIOS	 |	Do not use.  Reserved for BIOS EBDA.
+	09A000	+------------------------+
+		|  Command line		 |
+		|  Stack/heap		 |	For use by the kernel real-mode code.
+	098000	+------------------------+
+		|  Kernel setup		 |	The kernel real-mode code.
+	090200	+------------------------+
+		|  Kernel boot sector	 |	The kernel legacy boot sector.
+	090000	+------------------------+
+		|  Protected-mode kernel |	The bulk of the kernel image.
+	010000	+------------------------+
+		|  Boot loader		 |	<- Boot sector entry point 0000:7C00
+	001000	+------------------------+
+		|  Reserved for MBR/BIOS |
+	000800	+------------------------+
+		|  Typically used by MBR |
+	000600	+------------------------+
+		|  BIOS use only	 |
+	000000	+------------------------+
+
+When using bzImage, the protected-mode kernel was relocated to
+0x100000 ("high memory"), and the kernel real-mode block (boot sector,
+setup, and stack/heap) was made relocatable to any address between
+0x10000 and end of low memory. Unfortunately, in protocols 2.00 and
+2.01 the 0x90000+ memory range is still used internally by the kernel;
+the 2.02 protocol resolves that problem.
+
+It is desirable to keep the "memory ceiling" -- the highest point in
+low memory touched by the boot loader -- as low as possible, since
+some newer BIOSes have begun to allocate some rather large amounts of
+memory, called the Extended BIOS Data Area, near the top of low
+memory.	 The boot loader should use the "INT 12h" BIOS call to verify
+how much low memory is available.
+
+Unfortunately, if INT 12h reports that the amount of memory is too
+low, there is usually nothing the boot loader can do but to report an
+error to the user.  The boot loader should therefore be designed to
+take up as little space in low memory as it reasonably can.  For
+zImage or old bzImage kernels, which need data written into the
+0x90000 segment, the boot loader should make sure not to use memory
+above the 0x9A000 point; too many BIOSes will break above that point.
+
+For a modern bzImage kernel with boot protocol version >= 2.02, a
+memory layout like the following is suggested::
+
+		~                        ~
+		|  Protected-mode kernel |
+	100000  +------------------------+
+		|  I/O memory hole	 |
+	0A0000	+------------------------+
+		|  Reserved for BIOS	 |	Leave as much as possible unused
+		~                        ~
+		|  Command line		 |	(Can also be below the X+10000 mark)
+	X+10000	+------------------------+
+		|  Stack/heap		 |	For use by the kernel real-mode code.
+	X+08000	+------------------------+
+		|  Kernel setup		 |	The kernel real-mode code.
+		|  Kernel boot sector	 |	The kernel legacy boot sector.
+	X       +------------------------+
+		|  Boot loader		 |	<- Boot sector entry point 0000:7C00
+	001000	+------------------------+
+		|  Reserved for MBR/BIOS |
+	000800	+------------------------+
+		|  Typically used by MBR |
+	000600	+------------------------+
+		|  BIOS use only	 |
+	000000	+------------------------+
+
+  ... where the address X is as low as the design of the boot loader permits.
+
+
+The Real-Mode Kernel Header
+===========================
+
+In the following text, and anywhere in the kernel boot sequence, "a
+sector" refers to 512 bytes.  It is independent of the actual sector
+size of the underlying medium.
+
+The first step in loading a Linux kernel should be to load the
+real-mode code (boot sector and setup code) and then examine the
+following header at offset 0x01f1.  The real-mode code can total up to
+32K, although the boot loader may choose to load only the first two
+sectors (1K) and then examine the bootup sector size.
+
+The header looks like:
+
+===========	========	=====================	============================================
+Offset/Size	Proto		Name			Meaning
+===========	========	=====================	============================================
+01F1/1		ALL(1)		setup_sects		The size of the setup in sectors
+01F2/2		ALL		root_flags		If set, the root is mounted readonly
+01F4/4		2.04+(2)	syssize			The size of the 32-bit code in 16-byte paras
+01F8/2		ALL		ram_size		DO NOT USE - for bootsect.S use only
+01FA/2		ALL		vid_mode		Video mode control
+01FC/2		ALL		root_dev		Default root device number
+01FE/2		ALL		boot_flag		0xAA55 magic number
+0200/2		2.00+		jump			Jump instruction
+0202/4		2.00+		header			Magic signature "HdrS"
+0206/2		2.00+		version			Boot protocol version supported
+0208/4		2.00+		realmode_swtch		Boot loader hook (see below)
+020C/2		2.00+		start_sys_seg		The load-low segment (0x1000) (obsolete)
+020E/2		2.00+		kernel_version		Pointer to kernel version string
+0210/1		2.00+		type_of_loader		Boot loader identifier
+0211/1		2.00+		loadflags		Boot protocol option flags
+0212/2		2.00+		setup_move_size		Move to high memory size (used with hooks)
+0214/4		2.00+		code32_start		Boot loader hook (see below)
+0218/4		2.00+		ramdisk_image		initrd load address (set by boot loader)
+021C/4		2.00+		ramdisk_size		initrd size (set by boot loader)
+0220/4		2.00+		bootsect_kludge		DO NOT USE - for bootsect.S use only
+0224/2		2.01+		heap_end_ptr		Free memory after setup end
+0226/1		2.02+(3)	ext_loader_ver		Extended boot loader version
+0227/1		2.02+(3)	ext_loader_type		Extended boot loader ID
+0228/4		2.02+		cmd_line_ptr		32-bit pointer to the kernel command line
+022C/4		2.03+		initrd_addr_max		Highest legal initrd address
+0230/4		2.05+		kernel_alignment	Physical addr alignment required for kernel
+0234/1		2.05+		relocatable_kernel	Whether kernel is relocatable or not
+0235/1		2.10+		min_alignment		Minimum alignment, as a power of two
+0236/2		2.12+		xloadflags		Boot protocol option flags
+0238/4		2.06+		cmdline_size		Maximum size of the kernel command line
+023C/4		2.07+		hardware_subarch	Hardware subarchitecture
+0240/8		2.07+		hardware_subarch_data	Subarchitecture-specific data
+0248/4		2.08+		payload_offset		Offset of kernel payload
+024C/4		2.08+		payload_length		Length of kernel payload
+0250/8		2.09+		setup_data		64-bit physical pointer to linked list
+							of struct setup_data
+0258/8		2.10+		pref_address		Preferred loading address
+0260/4		2.10+		init_size		Linear memory required during initialization
+0264/4		2.11+		handover_offset		Offset of handover entry point
+0268/4		2.15+		kernel_info_offset	Offset of the kernel_info
+===========	========	=====================	============================================
+
+.. note::
+  (1) For backwards compatibility, if the setup_sects field contains 0, the
+      real value is 4.
+
+  (2) For boot protocol prior to 2.04, the upper two bytes of the syssize
+      field are unusable, which means the size of a bzImage kernel
+      cannot be determined.
+
+  (3) Ignored, but safe to set, for boot protocols 2.02-2.09.
+
+If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
+the boot protocol version is "old".  Loading an old kernel, the
+following parameters should be assumed::
+
+	Image type = zImage
+	initrd not supported
+	Real-mode kernel must be located at 0x90000.
+
+Otherwise, the "version" field contains the protocol version,
+e.g. protocol version 2.01 will contain 0x0201 in this field.  When
+setting fields in the header, you must make sure only to set fields
+supported by the protocol version in use.
+
+
+Details of Header Fields
+========================
+
+For each field, some are information from the kernel to the bootloader
+("read"), some are expected to be filled out by the bootloader
+("write"), and some are expected to be read and modified by the
+bootloader ("modify").
+
+All general purpose boot loaders should write the fields marked
+(obligatory).  Boot loaders who want to load the kernel at a
+nonstandard address should fill in the fields marked (reloc); other
+boot loaders can ignore those fields.
+
+The byte order of all fields is littleendian (this is x86, after all.)
+
+============	===========
+Field name:	setup_sects
+Type:		read
+Offset/size:	0x1f1/1
+Protocol:	ALL
+============	===========
+
+  The size of the setup code in 512-byte sectors.  If this field is
+  0, the real value is 4.  The real-mode code consists of the boot
+  sector (always one 512-byte sector) plus the setup code.
+
+============	=================
+Field name:	root_flags
+Type:		modify (optional)
+Offset/size:	0x1f2/2
+Protocol:	ALL
+============	=================
+
+  If this field is nonzero, the root defaults to readonly.  The use of
+  this field is deprecated; use the "ro" or "rw" options on the
+  command line instead.
+
+============	===============================================
+Field name:	syssize
+Type:		read
+Offset/size:	0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
+Protocol:	2.04+
+============	===============================================
+
+  The size of the protected-mode code in units of 16-byte paragraphs.
+  For protocol versions older than 2.04 this field is only two bytes
+  wide, and therefore cannot be trusted for the size of a kernel if
+  the LOAD_HIGH flag is set.
+
+============	===============
+Field name:	ram_size
+Type:		kernel internal
+Offset/size:	0x1f8/2
+Protocol:	ALL
+============	===============
+
+  This field is obsolete.
+
+============	===================
+Field name:	vid_mode
+Type:		modify (obligatory)
+Offset/size:	0x1fa/2
+============	===================
+
+  Please see the section on SPECIAL COMMAND LINE OPTIONS.
+
+============	=================
+Field name:	root_dev
+Type:		modify (optional)
+Offset/size:	0x1fc/2
+Protocol:	ALL
+============	=================
+
+  The default root device device number.  The use of this field is
+  deprecated, use the "root=" option on the command line instead.
+
+============	=========
+Field name:	boot_flag
+Type:		read
+Offset/size:	0x1fe/2
+Protocol:	ALL
+============	=========
+
+  Contains 0xAA55.  This is the closest thing old Linux kernels have
+  to a magic number.
+
+============	=======
+Field name:	jump
+Type:		read
+Offset/size:	0x200/2
+Protocol:	2.00+
+============	=======
+
+  Contains an x86 jump instruction, 0xEB followed by a signed offset
+  relative to byte 0x202.  This can be used to determine the size of
+  the header.
+
+============	=======
+Field name:	header
+Type:		read
+Offset/size:	0x202/4
+Protocol:	2.00+
+============	=======
+
+  Contains the magic number "HdrS" (0x53726448).
+
+============	=======
+Field name:	version
+Type:		read
+Offset/size:	0x206/2
+Protocol:	2.00+
+============	=======
+
+  Contains the boot protocol version, in (major << 8)+minor format,
+  e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
+  10.17.
+
+============	=================
+Field name:	realmode_swtch
+Type:		modify (optional)
+Offset/size:	0x208/4
+Protocol:	2.00+
+============	=================
+
+  Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
+
+============	=============
+Field name:	start_sys_seg
+Type:		read
+Offset/size:	0x20c/2
+Protocol:	2.00+
+============	=============
+
+  The load low segment (0x1000).  Obsolete.
+
+============	==============
+Field name:	kernel_version
+Type:		read
+Offset/size:	0x20e/2
+Protocol:	2.00+
+============	==============
+
+  If set to a nonzero value, contains a pointer to a NUL-terminated
+  human-readable kernel version number string, less 0x200.  This can
+  be used to display the kernel version to the user.  This value
+  should be less than (0x200*setup_sects).
+
+  For example, if this value is set to 0x1c00, the kernel version
+  number string can be found at offset 0x1e00 in the kernel file.
+  This is a valid value if and only if the "setup_sects" field
+  contains the value 15 or higher, as::
+
+	0x1c00  < 15*0x200 (= 0x1e00) but
+	0x1c00 >= 14*0x200 (= 0x1c00)
+
+	0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
+
+============	==================
+Field name:	type_of_loader
+Type:		write (obligatory)
+Offset/size:	0x210/1
+Protocol:	2.00+
+============	==================
+
+  If your boot loader has an assigned id (see table below), enter
+  0xTV here, where T is an identifier for the boot loader and V is
+  a version number.  Otherwise, enter 0xFF here.
+
+  For boot loader IDs above T = 0xD, write T = 0xE to this field and
+  write the extended ID minus 0x10 to the ext_loader_type field.
+  Similarly, the ext_loader_ver field can be used to provide more than
+  four bits for the bootloader version.
+
+  For example, for T = 0x15, V = 0x234, write::
+
+	type_of_loader  <- 0xE4
+	ext_loader_type <- 0x05
+	ext_loader_ver  <- 0x23
+
+  Assigned boot loader ids (hexadecimal):
+
+	== =======================================
+	0  LILO
+	   (0x00 reserved for pre-2.00 bootloader)
+	1  Loadlin
+	2  bootsect-loader
+	   (0x20, all other values reserved)
+	3  Syslinux
+	4  Etherboot/gPXE/iPXE
+	5  ELILO
+	7  GRUB
+	8  U-Boot
+	9  Xen
+	A  Gujin
+	B  Qemu
+	C  Arcturus Networks uCbootloader
+	D  kexec-tools
+	E  Extended (see ext_loader_type)
+	F  Special (0xFF = undefined)
+	10 Reserved
+	11 Minimal Linux Bootloader
+	   <http://sebastian-plotz.blogspot.de>
+	12 OVMF UEFI virtualization stack
+	13 barebox
+	== =======================================
+
+  Please contact <hpa@zytor.com> if you need a bootloader ID value assigned.
+
+============	===================
+Field name:	loadflags
+Type:		modify (obligatory)
+Offset/size:	0x211/1
+Protocol:	2.00+
+============	===================
+
+  This field is a bitmask.
+
+  Bit 0 (read):	LOADED_HIGH
+
+	- If 0, the protected-mode code is loaded at 0x10000.
+	- If 1, the protected-mode code is loaded at 0x100000.
+
+  Bit 1 (kernel internal): KASLR_FLAG
+
+	- Used internally by the compressed kernel to communicate
+	  KASLR status to kernel proper.
+
+	    - If 1, KASLR enabled.
+	    - If 0, KASLR disabled.
+
+  Bit 5 (write): QUIET_FLAG
+
+	- If 0, print early messages.
+	- If 1, suppress early messages.
+
+		This requests to the kernel (decompressor and early
+		kernel) to not write early messages that require
+		accessing the display hardware directly.
+
+  Bit 6 (obsolete): KEEP_SEGMENTS
+
+	Protocol: 2.07+
+
+        - This flag is obsolete.
+
+  Bit 7 (write): CAN_USE_HEAP
+
+	Set this bit to 1 to indicate that the value entered in the
+	heap_end_ptr is valid.  If this field is clear, some setup code
+	functionality will be disabled.
+
+
+============	===================
+Field name:	setup_move_size
+Type:		modify (obligatory)
+Offset/size:	0x212/2
+Protocol:	2.00-2.01
+============	===================
+
+  When using protocol 2.00 or 2.01, if the real mode kernel is not
+  loaded at 0x90000, it gets moved there later in the loading
+  sequence.  Fill in this field if you want additional data (such as
+  the kernel command line) moved in addition to the real-mode kernel
+  itself.
+
+  The unit is bytes starting with the beginning of the boot sector.
+
+  This field is can be ignored when the protocol is 2.02 or higher, or
+  if the real-mode code is loaded at 0x90000.
+
+============	========================
+Field name:	code32_start
+Type:		modify (optional, reloc)
+Offset/size:	0x214/4
+Protocol:	2.00+
+============	========================
+
+  The address to jump to in protected mode.  This defaults to the load
+  address of the kernel, and can be used by the boot loader to
+  determine the proper load address.
+
+  This field can be modified for two purposes:
+
+    1. as a boot loader hook (see Advanced Boot Loader Hooks below.)
+
+    2. if a bootloader which does not install a hook loads a
+       relocatable kernel at a nonstandard address it will have to modify
+       this field to point to the load address.
+
+============	==================
+Field name:	ramdisk_image
+Type:		write (obligatory)
+Offset/size:	0x218/4
+Protocol:	2.00+
+============	==================
+
+  The 32-bit linear address of the initial ramdisk or ramfs.  Leave at
+  zero if there is no initial ramdisk/ramfs.
+
+============	==================
+Field name:	ramdisk_size
+Type:		write (obligatory)
+Offset/size:	0x21c/4
+Protocol:	2.00+
+============	==================
+
+  Size of the initial ramdisk or ramfs.  Leave at zero if there is no
+  initial ramdisk/ramfs.
+
+============	===============
+Field name:	bootsect_kludge
+Type:		kernel internal
+Offset/size:	0x220/4
+Protocol:	2.00+
+============	===============
+
+  This field is obsolete.
+
+============	==================
+Field name:	heap_end_ptr
+Type:		write (obligatory)
+Offset/size:	0x224/2
+Protocol:	2.01+
+============	==================
+
+  Set this field to the offset (from the beginning of the real-mode
+  code) of the end of the setup stack/heap, minus 0x0200.
+
+============	================
+Field name:	ext_loader_ver
+Type:		write (optional)
+Offset/size:	0x226/1
+Protocol:	2.02+
+============	================
+
+  This field is used as an extension of the version number in the
+  type_of_loader field.  The total version number is considered to be
+  (type_of_loader & 0x0f) + (ext_loader_ver << 4).
+
+  The use of this field is boot loader specific.  If not written, it
+  is zero.
+
+  Kernels prior to 2.6.31 did not recognize this field, but it is safe
+  to write for protocol version 2.02 or higher.
+
+============	=====================================================
+Field name:	ext_loader_type
+Type:		write (obligatory if (type_of_loader & 0xf0) == 0xe0)
+Offset/size:	0x227/1
+Protocol:	2.02+
+============	=====================================================
+
+  This field is used as an extension of the type number in
+  type_of_loader field.  If the type in type_of_loader is 0xE, then
+  the actual type is (ext_loader_type + 0x10).
+
+  This field is ignored if the type in type_of_loader is not 0xE.
+
+  Kernels prior to 2.6.31 did not recognize this field, but it is safe
+  to write for protocol version 2.02 or higher.
+
+============	==================
+Field name:	cmd_line_ptr
+Type:		write (obligatory)
+Offset/size:	0x228/4
+Protocol:	2.02+
+============	==================
+
+  Set this field to the linear address of the kernel command line.
+  The kernel command line can be located anywhere between the end of
+  the setup heap and 0xA0000; it does not have to be located in the
+  same 64K segment as the real-mode code itself.
+
+  Fill in this field even if your boot loader does not support a
+  command line, in which case you can point this to an empty string
+  (or better yet, to the string "auto".)  If this field is left at
+  zero, the kernel will assume that your boot loader does not support
+  the 2.02+ protocol.
+
+============	===============
+Field name:	initrd_addr_max
+Type:		read
+Offset/size:	0x22c/4
+Protocol:	2.03+
+============	===============
+
+  The maximum address that may be occupied by the initial
+  ramdisk/ramfs contents.  For boot protocols 2.02 or earlier, this
+  field is not present, and the maximum address is 0x37FFFFFF.  (This
+  address is defined as the address of the highest safe byte, so if
+  your ramdisk is exactly 131072 bytes long and this field is
+  0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
+
+============	============================
+Field name:	kernel_alignment
+Type:		read/modify (reloc)
+Offset/size:	0x230/4
+Protocol:	2.05+ (read), 2.10+ (modify)
+============	============================
+
+  Alignment unit required by the kernel (if relocatable_kernel is
+  true.)  A relocatable kernel that is loaded at an alignment
+  incompatible with the value in this field will be realigned during
+  kernel initialization.
+
+  Starting with protocol version 2.10, this reflects the kernel
+  alignment preferred for optimal performance; it is possible for the
+  loader to modify this field to permit a lesser alignment.  See the
+  min_alignment and pref_address field below.
+
+============	==================
+Field name:	relocatable_kernel
+Type:		read (reloc)
+Offset/size:	0x234/1
+Protocol:	2.05+
+============	==================
+
+  If this field is nonzero, the protected-mode part of the kernel can
+  be loaded at any address that satisfies the kernel_alignment field.
+  After loading, the boot loader must set the code32_start field to
+  point to the loaded code, or to a boot loader hook.
+
+============	=============
+Field name:	min_alignment
+Type:		read (reloc)
+Offset/size:	0x235/1
+Protocol:	2.10+
+============	=============
+
+  This field, if nonzero, indicates as a power of two the minimum
+  alignment required, as opposed to preferred, by the kernel to boot.
+  If a boot loader makes use of this field, it should update the
+  kernel_alignment field with the alignment unit desired; typically::
+
+	kernel_alignment = 1 << min_alignment
+
+  There may be a considerable performance cost with an excessively
+  misaligned kernel.  Therefore, a loader should typically try each
+  power-of-two alignment from kernel_alignment down to this alignment.
+
+============	==========
+Field name:	xloadflags
+Type:		read
+Offset/size:	0x236/2
+Protocol:	2.12+
+============	==========
+
+  This field is a bitmask.
+
+  Bit 0 (read):	XLF_KERNEL_64
+
+	- If 1, this kernel has the legacy 64-bit entry point at 0x200.
+
+  Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G
+
+        - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G.
+
+  Bit 2 (read):	XLF_EFI_HANDOVER_32
+
+	- If 1, the kernel supports the 32-bit EFI handoff entry point
+          given at handover_offset.
+
+  Bit 3 (read): XLF_EFI_HANDOVER_64
+
+	- If 1, the kernel supports the 64-bit EFI handoff entry point
+          given at handover_offset + 0x200.
+
+  Bit 4 (read): XLF_EFI_KEXEC
+
+	- If 1, the kernel supports kexec EFI boot with EFI runtime support.
+
+
+============	============
+Field name:	cmdline_size
+Type:		read
+Offset/size:	0x238/4
+Protocol:	2.06+
+============	============
+
+  The maximum size of the command line without the terminating
+  zero. This means that the command line can contain at most
+  cmdline_size characters. With protocol version 2.05 and earlier, the
+  maximum size was 255.
+
+============	====================================
+Field name:	hardware_subarch
+Type:		write (optional, defaults to x86/PC)
+Offset/size:	0x23c/4
+Protocol:	2.07+
+============	====================================
+
+  In a paravirtualized environment the hardware low level architectural
+  pieces such as interrupt handling, page table handling, and
+  accessing process control registers needs to be done differently.
+
+  This field allows the bootloader to inform the kernel we are in one
+  one of those environments.
+
+  ==========	==============================
+  0x00000000	The default x86/PC environment
+  0x00000001	lguest
+  0x00000002	Xen
+  0x00000003	Moorestown MID
+  0x00000004	CE4100 TV Platform
+  ==========	==============================
+
+============	=========================
+Field name:	hardware_subarch_data
+Type:		write (subarch-dependent)
+Offset/size:	0x240/8
+Protocol:	2.07+
+============	=========================
+
+  A pointer to data that is specific to hardware subarch
+  This field is currently unused for the default x86/PC environment,
+  do not modify.
+
+============	==============
+Field name:	payload_offset
+Type:		read
+Offset/size:	0x248/4
+Protocol:	2.08+
+============	==============
+
+  If non-zero then this field contains the offset from the beginning
+  of the protected-mode code to the payload.
+
+  The payload may be compressed. The format of both the compressed and
+  uncompressed data should be determined using the standard magic
+  numbers.  The currently supported compression formats are gzip
+  (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA
+  (magic number 5D 00), XZ (magic number FD 37), LZ4 (magic number
+  02 21) and ZSTD (magic number 28 B5). The uncompressed payload is
+  currently always ELF (magic number 7F 45 4C 46).
+
+============	==============
+Field name:	payload_length
+Type:		read
+Offset/size:	0x24c/4
+Protocol:	2.08+
+============	==============
+
+  The length of the payload.
+
+============	===============
+Field name:	setup_data
+Type:		write (special)
+Offset/size:	0x250/8
+Protocol:	2.09+
+============	===============
+
+  The 64-bit physical pointer to NULL terminated single linked list of
+  struct setup_data. This is used to define a more extensible boot
+  parameters passing mechanism. The definition of struct setup_data is
+  as follow::
+
+	struct setup_data {
+		u64 next;
+		u32 type;
+		u32 len;
+		u8  data[0];
+	};
+
+  Where, the next is a 64-bit physical pointer to the next node of
+  linked list, the next field of the last node is 0; the type is used
+  to identify the contents of data; the len is the length of data
+  field; the data holds the real payload.
+
+  This list may be modified at a number of points during the bootup
+  process.  Therefore, when modifying this list one should always make
+  sure to consider the case where the linked list already contains
+  entries.
+
+  The setup_data is a bit awkward to use for extremely large data objects,
+  both because the setup_data header has to be adjacent to the data object
+  and because it has a 32-bit length field. However, it is important that
+  intermediate stages of the boot process have a way to identify which
+  chunks of memory are occupied by kernel data.
+
+  Thus setup_indirect struct and SETUP_INDIRECT type were introduced in
+  protocol 2.15::
+
+    struct setup_indirect {
+      __u32 type;
+      __u32 reserved;  /* Reserved, must be set to zero. */
+      __u64 len;
+      __u64 addr;
+    };
+
+  The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be
+  SETUP_INDIRECT itself since making the setup_indirect a tree structure
+  could require a lot of stack space in something that needs to parse it
+  and stack space can be limited in boot contexts.
+
+  Let's give an example how to point to SETUP_E820_EXT data using setup_indirect.
+  In this case setup_data and setup_indirect will look like this::
+
+    struct setup_data {
+      __u64 next = 0 or <addr_of_next_setup_data_struct>;
+      __u32 type = SETUP_INDIRECT;
+      __u32 len = sizeof(setup_indirect);
+      __u8 data[sizeof(setup_indirect)] = struct setup_indirect {
+        __u32 type = SETUP_INDIRECT | SETUP_E820_EXT;
+        __u32 reserved = 0;
+        __u64 len = <len_of_SETUP_E820_EXT_data>;
+        __u64 addr = <addr_of_SETUP_E820_EXT_data>;
+      }
+    }
+
+.. note::
+     SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished
+     from SETUP_INDIRECT itself. So, this kind of objects cannot be provided
+     by the bootloaders.
+
+============	============
+Field name:	pref_address
+Type:		read (reloc)
+Offset/size:	0x258/8
+Protocol:	2.10+
+============	============
+
+  This field, if nonzero, represents a preferred load address for the
+  kernel.  A relocating bootloader should attempt to load at this
+  address if possible.
+
+  A non-relocatable kernel will unconditionally move itself and to run
+  at this address.
+
+============	=======
+Field name:	init_size
+Type:		read
+Offset/size:	0x260/4
+============	=======
+
+  This field indicates the amount of linear contiguous memory starting
+  at the kernel runtime start address that the kernel needs before it
+  is capable of examining its memory map.  This is not the same thing
+  as the total amount of memory the kernel needs to boot, but it can
+  be used by a relocating boot loader to help select a safe load
+  address for the kernel.
+
+  The kernel runtime start address is determined by the following algorithm::
+
+	if (relocatable_kernel)
+	runtime_start = align_up(load_address, kernel_alignment)
+	else
+	runtime_start = pref_address
+
+============	===============
+Field name:	handover_offset
+Type:		read
+Offset/size:	0x264/4
+============	===============
+
+  This field is the offset from the beginning of the kernel image to
+  the EFI handover protocol entry point. Boot loaders using the EFI
+  handover protocol to boot the kernel should jump to this offset.
+
+  See EFI HANDOVER PROTOCOL below for more details.
+
+============	==================
+Field name:	kernel_info_offset
+Type:		read
+Offset/size:	0x268/4
+Protocol:	2.15+
+============	==================
+
+  This field is the offset from the beginning of the kernel image to the
+  kernel_info. The kernel_info structure is embedded in the Linux image
+  in the uncompressed protected mode region.
+
+
+The kernel_info
+===============
+
+The relationships between the headers are analogous to the various data
+sections:
+
+  setup_header = .data
+  boot_params/setup_data = .bss
+
+What is missing from the above list? That's right:
+
+  kernel_info = .rodata
+
+We have been (ab)using .data for things that could go into .rodata or .bss for
+a long time, for lack of alternatives and -- especially early on -- inertia.
+Also, the BIOS stub is responsible for creating boot_params, so it isn't
+available to a BIOS-based loader (setup_data is, though).
+
+setup_header is permanently limited to 144 bytes due to the reach of the
+2-byte jump field, which doubles as a length field for the structure, combined
+with the size of the "hole" in struct boot_params that a protected-mode loader
+or the BIOS stub has to copy it into. It is currently 119 bytes long, which
+leaves us with 25 very precious bytes. This isn't something that can be fixed
+without revising the boot protocol entirely, breaking backwards compatibility.
+
+boot_params proper is limited to 4096 bytes, but can be arbitrarily extended
+by adding setup_data entries. It cannot be used to communicate properties of
+the kernel image, because it is .bss and has no image-provided content.
+
+kernel_info solves this by providing an extensible place for information about
+the kernel image. It is readonly, because the kernel cannot rely on a
+bootloader copying its contents anywhere, but that is OK; if it becomes
+necessary it can still contain data items that an enabled bootloader would be
+expected to copy into a setup_data chunk.
+
+All kernel_info data should be part of this structure. Fixed size data have to
+be put before kernel_info_var_len_data label. Variable size data have to be put
+after kernel_info_var_len_data label. Each chunk of variable size data has to
+be prefixed with header/magic and its size, e.g.::
+
+  kernel_info:
+          .ascii  "LToP"          /* Header, Linux top (structure). */
+          .long   kernel_info_var_len_data - kernel_info
+          .long   kernel_info_end - kernel_info
+          .long   0x01234567      /* Some fixed size data for the bootloaders. */
+  kernel_info_var_len_data:
+  example_struct:                 /* Some variable size data for the bootloaders. */
+          .ascii  "0123"          /* Header/Magic. */
+          .long   example_struct_end - example_struct
+          .ascii  "Struct"
+          .long   0x89012345
+  example_struct_end:
+  example_strings:                /* Some variable size data for the bootloaders. */
+          .ascii  "ABCD"          /* Header/Magic. */
+          .long   example_strings_end - example_strings
+          .asciz  "String_0"
+          .asciz  "String_1"
+  example_strings_end:
+  kernel_info_end:
+
+This way the kernel_info is self-contained blob.
+
+.. note::
+     Each variable size data header/magic can be any 4-character string,
+     without \0 at the end of the string, which does not collide with
+     existing variable length data headers/magics.
+
+
+Details of the kernel_info Fields
+=================================
+
+============	========
+Field name:	header
+Offset/size:	0x0000/4
+============	========
+
+  Contains the magic number "LToP" (0x506f544c).
+
+============	========
+Field name:	size
+Offset/size:	0x0004/4
+============	========
+
+  This field contains the size of the kernel_info including kernel_info.header.
+  It does not count kernel_info.kernel_info_var_len_data size. This field should be
+  used by the bootloaders to detect supported fixed size fields in the kernel_info
+  and beginning of kernel_info.kernel_info_var_len_data.
+
+============	========
+Field name:	size_total
+Offset/size:	0x0008/4
+============	========
+
+  This field contains the size of the kernel_info including kernel_info.header
+  and kernel_info.kernel_info_var_len_data.
+
+============	==============
+Field name:	setup_type_max
+Offset/size:	0x000c/4
+============	==============
+
+  This field contains maximal allowed type for setup_data and setup_indirect structs.
+
+
+The Image Checksum
+==================
+
+From boot protocol version 2.08 onwards the CRC-32 is calculated over
+the entire file using the characteristic polynomial 0x04C11DB7 and an
+initial remainder of 0xffffffff.  The checksum is appended to the
+file; therefore the CRC of the file up to the limit specified in the
+syssize field of the header is always 0.
+
+
+The Kernel Command Line
+=======================
+
+The kernel command line has become an important way for the boot
+loader to communicate with the kernel.  Some of its options are also
+relevant to the boot loader itself, see "special command line options"
+below.
+
+The kernel command line is a null-terminated string. The maximum
+length can be retrieved from the field cmdline_size.  Before protocol
+version 2.06, the maximum was 255 characters.  A string that is too
+long will be automatically truncated by the kernel.
+
+If the boot protocol version is 2.02 or later, the address of the
+kernel command line is given by the header field cmd_line_ptr (see
+above.)  This address can be anywhere between the end of the setup
+heap and 0xA0000.
+
+If the protocol version is *not* 2.02 or higher, the kernel
+command line is entered using the following protocol:
+
+  - At offset 0x0020 (word), "cmd_line_magic", enter the magic
+    number 0xA33F.
+
+  - At offset 0x0022 (word), "cmd_line_offset", enter the offset
+    of the kernel command line (relative to the start of the
+    real-mode kernel).
+
+  - The kernel command line *must* be within the memory region
+    covered by setup_move_size, so you may need to adjust this
+    field.
+
+
+Memory Layout of The Real-Mode Code
+===================================
+
+The real-mode code requires a stack/heap to be set up, as well as
+memory allocated for the kernel command line.  This needs to be done
+in the real-mode accessible memory in bottom megabyte.
+
+It should be noted that modern machines often have a sizable Extended
+BIOS Data Area (EBDA).  As a result, it is advisable to use as little
+of the low megabyte as possible.
+
+Unfortunately, under the following circumstances the 0x90000 memory
+segment has to be used:
+
+	- When loading a zImage kernel ((loadflags & 0x01) == 0).
+	- When loading a 2.01 or earlier boot protocol kernel.
+
+.. note::
+     For the 2.00 and 2.01 boot protocols, the real-mode code
+     can be loaded at another address, but it is internally
+     relocated to 0x90000.  For the "old" protocol, the
+     real-mode code must be loaded at 0x90000.
+
+When loading at 0x90000, avoid using memory above 0x9a000.
+
+For boot protocol 2.02 or higher, the command line does not have to be
+located in the same 64K segment as the real-mode setup code; it is
+thus permitted to give the stack/heap the full 64K segment and locate
+the command line above it.
+
+The kernel command line should not be located below the real-mode
+code, nor should it be located in high memory.
+
+
+Sample Boot Configuartion
+=========================
+
+As a sample configuration, assume the following layout of the real
+mode segment.
+
+    When loading below 0x90000, use the entire segment:
+
+        =============	===================
+	0x0000-0x7fff	Real mode kernel
+	0x8000-0xdfff	Stack and heap
+	0xe000-0xffff	Kernel command line
+	=============	===================
+
+    When loading at 0x90000 OR the protocol version is 2.01 or earlier:
+
+	=============	===================
+	0x0000-0x7fff	Real mode kernel
+	0x8000-0x97ff	Stack and heap
+	0x9800-0x9fff	Kernel command line
+	=============	===================
+
+Such a boot loader should enter the following fields in the header::
+
+	unsigned long base_ptr;	/* base address for real-mode segment */
+
+	if ( setup_sects == 0 ) {
+		setup_sects = 4;
+	}
+
+	if ( protocol >= 0x0200 ) {
+		type_of_loader = <type code>;
+		if ( loading_initrd ) {
+			ramdisk_image = <initrd_address>;
+			ramdisk_size = <initrd_size>;
+		}
+
+		if ( protocol >= 0x0202 && loadflags & 0x01 )
+			heap_end = 0xe000;
+		else
+			heap_end = 0x9800;
+
+		if ( protocol >= 0x0201 ) {
+			heap_end_ptr = heap_end - 0x200;
+			loadflags |= 0x80; /* CAN_USE_HEAP */
+		}
+
+		if ( protocol >= 0x0202 ) {
+			cmd_line_ptr = base_ptr + heap_end;
+			strcpy(cmd_line_ptr, cmdline);
+		} else {
+			cmd_line_magic	= 0xA33F;
+			cmd_line_offset = heap_end;
+			setup_move_size = heap_end + strlen(cmdline)+1;
+			strcpy(base_ptr+cmd_line_offset, cmdline);
+		}
+	} else {
+		/* Very old kernel */
+
+		heap_end = 0x9800;
+
+		cmd_line_magic	= 0xA33F;
+		cmd_line_offset = heap_end;
+
+		/* A very old kernel MUST have its real-mode code
+		   loaded at 0x90000 */
+
+		if ( base_ptr != 0x90000 ) {
+			/* Copy the real-mode kernel */
+			memcpy(0x90000, base_ptr, (setup_sects+1)*512);
+			base_ptr = 0x90000;		 /* Relocated */
+		}
+
+		strcpy(0x90000+cmd_line_offset, cmdline);
+
+		/* It is recommended to clear memory up to the 32K mark */
+		memset(0x90000 + (setup_sects+1)*512, 0,
+		       (64-(setup_sects+1))*512);
+	}
+
+
+Loading The Rest of The Kernel
+==============================
+
+The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
+in the kernel file (again, if setup_sects == 0 the real value is 4.)
+It should be loaded at address 0x10000 for Image/zImage kernels and
+0x100000 for bzImage kernels.
+
+The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
+bit (LOAD_HIGH) in the loadflags field is set::
+
+	is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
+	load_address = is_bzImage ? 0x100000 : 0x10000;
+
+Note that Image/zImage kernels can be up to 512K in size, and thus use
+the entire 0x10000-0x90000 range of memory.  This means it is pretty
+much a requirement for these kernels to load the real-mode part at
+0x90000.  bzImage kernels allow much more flexibility.
+
+Special Command Line Options
+============================
+
+If the command line provided by the boot loader is entered by the
+user, the user may expect the following command line options to work.
+They should normally not be deleted from the kernel command line even
+though not all of them are actually meaningful to the kernel.  Boot
+loader authors who need additional command line options for the boot
+loader itself should get them registered in
+Documentation/admin-guide/kernel-parameters.rst to make sure they will not
+conflict with actual kernel options now or in the future.
+
+  vga=<mode>
+	<mode> here is either an integer (in C notation, either
+	decimal, octal, or hexadecimal) or one of the strings
+	"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
+	(meaning 0xFFFD).  This value should be entered into the
+	vid_mode field, as it is used by the kernel before the command
+	line is parsed.
+
+  mem=<size>
+	<size> is an integer in C notation optionally followed by
+	(case insensitive) K, M, G, T, P or E (meaning << 10, << 20,
+	<< 30, << 40, << 50 or << 60).  This specifies the end of
+	memory to the kernel. This affects the possible placement of
+	an initrd, since an initrd should be placed near end of
+	memory.  Note that this is an option to *both* the kernel and
+	the bootloader!
+
+  initrd=<file>
+	An initrd should be loaded.  The meaning of <file> is
+	obviously bootloader-dependent, and some boot loaders
+	(e.g. LILO) do not have such a command.
+
+In addition, some boot loaders add the following options to the
+user-specified command line:
+
+  BOOT_IMAGE=<file>
+	The boot image which was loaded.  Again, the meaning of <file>
+	is obviously bootloader-dependent.
+
+  auto
+	The kernel was booted without explicit user intervention.
+
+If these options are added by the boot loader, it is highly
+recommended that they are located *first*, before the user-specified
+or configuration-specified command line.  Otherwise, "init=/bin/sh"
+gets confused by the "auto" option.
+
+
+Running the Kernel
+==================
+
+The kernel is started by jumping to the kernel entry point, which is
+located at *segment* offset 0x20 from the start of the real mode
+kernel.  This means that if you loaded your real-mode kernel code at
+0x90000, the kernel entry point is 9020:0000.
+
+At entry, ds = es = ss should point to the start of the real-mode
+kernel code (0x9000 if the code is loaded at 0x90000), sp should be
+set up properly, normally pointing to the top of the heap, and
+interrupts should be disabled.  Furthermore, to guard against bugs in
+the kernel, it is recommended that the boot loader sets fs = gs = ds =
+es = ss.
+
+In our example from above, we would do::
+
+	/* Note: in the case of the "old" kernel protocol, base_ptr must
+	   be == 0x90000 at this point; see the previous sample code */
+
+	seg = base_ptr >> 4;
+
+	cli();	/* Enter with interrupts disabled! */
+
+	/* Set up the real-mode kernel stack */
+	_SS = seg;
+	_SP = heap_end;
+
+	_DS = _ES = _FS = _GS = seg;
+	jmp_far(seg+0x20, 0);	/* Run the kernel */
+
+If your boot sector accesses a floppy drive, it is recommended to
+switch off the floppy motor before running the kernel, since the
+kernel boot leaves interrupts off and thus the motor will not be
+switched off, especially if the loaded kernel has the floppy driver as
+a demand-loaded module!
+
+
+Advanced Boot Loader Hooks
+==========================
+
+If the boot loader runs in a particularly hostile environment (such as
+LOADLIN, which runs under DOS) it may be impossible to follow the
+standard memory location requirements.  Such a boot loader may use the
+following hooks that, if set, are invoked by the kernel at the
+appropriate time.  The use of these hooks should probably be
+considered an absolutely last resort!
+
+IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
+%edi across invocation.
+
+  realmode_swtch:
+	A 16-bit real mode far subroutine invoked immediately before
+	entering protected mode.  The default routine disables NMI, so
+	your routine should probably do so, too.
+
+  code32_start:
+	A 32-bit flat-mode routine *jumped* to immediately after the
+	transition to protected mode, but before the kernel is
+	uncompressed.  No segments, except CS, are guaranteed to be
+	set up (current kernels do, but older ones do not); you should
+	set them up to BOOT_DS (0x18) yourself.
+
+	After completing your hook, you should jump to the address
+	that was in this field before your boot loader overwrote it
+	(relocated, if appropriate.)
+
+
+32-bit Boot Protocol
+====================
+
+For machine with some new BIOS other than legacy BIOS, such as EFI,
+LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
+based on legacy BIOS can not be used, so a 32-bit boot protocol needs
+to be defined.
+
+In 32-bit boot protocol, the first step in loading a Linux kernel
+should be to setup the boot parameters (struct boot_params,
+traditionally known as "zero page"). The memory for struct boot_params
+should be allocated and initialized to all zero. Then the setup header
+from offset 0x01f1 of kernel image on should be loaded into struct
+boot_params and examined. The end of setup header can be calculated as
+follow::
+
+	0x0202 + byte value at offset 0x0201
+
+In addition to read/modify/write the setup header of the struct
+boot_params as that of 16-bit boot protocol, the boot loader should
+also fill the additional fields of the struct boot_params as
+described in chapter Documentation/arch/x86/zero-page.rst.
+
+After setting up the struct boot_params, the boot loader can load the
+32/64-bit kernel in the same way as that of 16-bit boot protocol.
+
+In 32-bit boot protocol, the kernel is started by jumping to the
+32-bit kernel entry point, which is the start address of loaded
+32/64-bit kernel.
+
+At entry, the CPU must be in 32-bit protected mode with paging
+disabled; a GDT must be loaded with the descriptors for selectors
+__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
+segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
+must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
+must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
+address of the struct boot_params; %ebp, %edi and %ebx must be zero.
+
+64-bit Boot Protocol
+====================
+
+For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader
+and we need a 64-bit boot protocol.
+
+In 64-bit boot protocol, the first step in loading a Linux kernel
+should be to setup the boot parameters (struct boot_params,
+traditionally known as "zero page"). The memory for struct boot_params
+could be allocated anywhere (even above 4G) and initialized to all zero.
+Then, the setup header at offset 0x01f1 of kernel image on should be
+loaded into struct boot_params and examined. The end of setup header
+can be calculated as follows::
+
+	0x0202 + byte value at offset 0x0201
+
+In addition to read/modify/write the setup header of the struct
+boot_params as that of 16-bit boot protocol, the boot loader should
+also fill the additional fields of the struct boot_params as described
+in chapter Documentation/arch/x86/zero-page.rst.
+
+After setting up the struct boot_params, the boot loader can load
+64-bit kernel in the same way as that of 16-bit boot protocol, but
+kernel could be loaded above 4G.
+
+In 64-bit boot protocol, the kernel is started by jumping to the
+64-bit kernel entry point, which is the start address of loaded
+64-bit kernel plus 0x200.
+
+At entry, the CPU must be in 64-bit mode with paging enabled.
+The range with setup_header.init_size from start address of loaded
+kernel and zero page and command line buffer get ident mapping;
+a GDT must be loaded with the descriptors for selectors
+__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
+segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
+must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
+must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base
+address of the struct boot_params.
+
+EFI Handover Protocol (deprecated)
+==================================
+
+This protocol allows boot loaders to defer initialisation to the EFI
+boot stub. The boot loader is required to load the kernel/initrd(s)
+from the boot media and jump to the EFI handover protocol entry point
+which is hdr->handover_offset bytes from the beginning of
+startup_{32,64}.
+
+The boot loader MUST respect the kernel's PE/COFF metadata when it comes
+to section alignment, the memory footprint of the executable image beyond
+the size of the file itself, and any other aspect of the PE/COFF header
+that may affect correct operation of the image as a PE/COFF binary in the
+execution context provided by the EFI firmware.
+
+The function prototype for the handover entry point looks like this::
+
+    efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp)
+
+'handle' is the EFI image handle passed to the boot loader by the EFI
+firmware, 'table' is the EFI system table - these are the first two
+arguments of the "handoff state" as described in section 2.3 of the
+UEFI specification. 'bp' is the boot loader-allocated boot params.
+
+The boot loader *must* fill out the following fields in bp::
+
+  - hdr.cmd_line_ptr
+  - hdr.ramdisk_image (if applicable)
+  - hdr.ramdisk_size  (if applicable)
+
+All other fields should be zero.
+
+NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
+      entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
+      loading protocol (refer to [0] for an example of the bootloader side of
+      this), which removes the need for any knowledge on the part of the EFI
+      bootloader regarding the internal representation of boot_params or any
+      requirements/limitations regarding the placement of the command line
+      and ramdisk in memory, or the placement of the kernel image itself.
+
+[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0
diff --git a/Documentation/arch/x86/booting-dt.rst b/Documentation/arch/x86/booting-dt.rst
new file mode 100644
index 000000000000..b089ffd56e6e
--- /dev/null
+++ b/Documentation/arch/x86/booting-dt.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+DeviceTree Booting
+------------------
+
+  There is one single 32bit entry point to the kernel at code32_start,
+  the decompressor (the real mode entry point goes to the same  32bit
+  entry point once it switched into protected mode). That entry point
+  supports one calling convention which is documented in
+  Documentation/arch/x86/boot.rst
+  The physical pointer to the device-tree block is passed via setup_data
+  which requires at least boot protocol 2.09.
+  The type filed is defined as
+
+  #define SETUP_DTB                      2
+
+  This device-tree is used as an extension to the "boot page". As such it
+  does not parse / consider data which is already covered by the boot
+  page. This includes memory size, reserved ranges, command line arguments
+  or initrd address. It simply holds information which can not be retrieved
+  otherwise like interrupt routing or a list of devices behind an I2C bus.
diff --git a/Documentation/arch/x86/buslock.rst b/Documentation/arch/x86/buslock.rst
new file mode 100644
index 000000000000..31ec0ef78086
--- /dev/null
+++ b/Documentation/arch/x86/buslock.rst
@@ -0,0 +1,132 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+===============================
+Bus lock detection and handling
+===============================
+
+:Copyright: |copy| 2021 Intel Corporation
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+          - Tony Luck <tony.luck@intel.com>
+
+Problem
+=======
+
+A split lock is any atomic operation whose operand crosses two cache lines.
+Since the operand spans two cache lines and the operation must be atomic,
+the system locks the bus while the CPU accesses the two cache lines.
+
+A bus lock is acquired through either split locked access to writeback (WB)
+memory or any locked access to non-WB memory. This is typically thousands of
+cycles slower than an atomic operation within a cache line. It also disrupts
+performance on other cores and brings the whole system to its knees.
+
+Detection
+=========
+
+Intel processors may support either or both of the following hardware
+mechanisms to detect split locks and bus locks.
+
+#AC exception for split lock detection
+--------------------------------------
+
+Beginning with the Tremont Atom CPU split lock operations may raise an
+Alignment Check (#AC) exception when a split lock operation is attemped.
+
+#DB exception for bus lock detection
+------------------------------------
+
+Some CPUs have the ability to notify the kernel by an #DB trap after a user
+instruction acquires a bus lock and is executed. This allows the kernel to
+terminate the application or to enforce throttling.
+
+Software handling
+=================
+
+The kernel #AC and #DB handlers handle bus lock based on the kernel
+parameter "split_lock_detect". Here is a summary of different options:
+
++------------------+----------------------------+-----------------------+
+|split_lock_detect=|#AC for split lock		|#DB for bus lock	|
++------------------+----------------------------+-----------------------+
+|off	  	   |Do nothing			|Do nothing		|
++------------------+----------------------------+-----------------------+
+|warn		   |Kernel OOPs			|Warn once per task and |
+|(default)	   |Warn once per task, add a	|and continues to run.  |
+|		   |delay, add synchronization	|			|
+|		   |to prevent more than one	|			|
+|		   |core from executing a	|			|
+|		   |split lock in parallel.	|			|
+|		   |sysctl split_lock_mitigate	|			|
+|		   |can be used to avoid the	|			|
+|		   |delay and synchronization	|			|
+|		   |When both features are	|			|
+|		   |supported, warn in #AC	|			|
++------------------+----------------------------+-----------------------+
+|fatal		   |Kernel OOPs			|Send SIGBUS to user.	|
+|		   |Send SIGBUS to user		|			|
+|		   |When both features are	|			|
+|		   |supported, fatal in #AC	|			|
++------------------+----------------------------+-----------------------+
+|ratelimit:N	   |Do nothing			|Limit bus lock rate to	|
+|(0 < N <= 1000)   |				|N bus locks per second	|
+|		   |				|system wide and warn on|
+|		   |				|bus locks.		|
++------------------+----------------------------+-----------------------+
+
+Usages
+======
+
+Detecting and handling bus lock may find usages in various areas:
+
+It is critical for real time system designers who build consolidated real
+time systems. These systems run hard real time code on some cores and run
+"untrusted" user processes on other cores. The hard real time cannot afford
+to have any bus lock from the untrusted processes to hurt real time
+performance. To date the designers have been unable to deploy these
+solutions as they have no way to prevent the "untrusted" user code from
+generating split lock and bus lock to block the hard real time code to
+access memory during bus locking.
+
+It's also useful for general computing to prevent guests or user
+applications from slowing down the overall system by executing instructions
+with bus lock.
+
+
+Guidance
+========
+off
+---
+
+Disable checking for split lock and bus lock. This option can be useful if
+there are legacy applications that trigger these events at a low rate so
+that mitigation is not needed.
+
+warn
+----
+
+A warning is emitted when a bus lock is detected which allows to identify
+the offending application. This is the default behavior.
+
+fatal
+-----
+
+In this case, the bus lock is not tolerated and the process is killed.
+
+ratelimit
+---------
+
+A system wide bus lock rate limit N is specified where 0 < N <= 1000. This
+allows a bus lock rate up to N bus locks per second. When the bus lock rate
+is exceeded then any task which is caught via the buslock #DB exception is
+throttled by enforced sleeps until the rate goes under the limit again.
+
+This is an effective mitigation in cases where a minimal impact can be
+tolerated, but an eventual Denial of Service attack has to be prevented. It
+allows to identify the offending processes and analyze whether they are
+malicious or just badly written.
+
+Selecting a rate limit of 1000 allows the bus to be locked for up to about
+seven million cycles each second (assuming 7000 cycles for each bus
+lock). On a 2 GHz processor that would be about 0.35% system slowdown.
diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst
new file mode 100644
index 000000000000..08246e8ac835
--- /dev/null
+++ b/Documentation/arch/x86/cpuinfo.rst
@@ -0,0 +1,154 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+x86 Feature Flags
+=================
+
+Introduction
+============
+
+On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition
+in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature
+or KVM want to expose the feature to a KVM guest, it can and should have
+an X86_FEATURE_* defined. These flags represent hardware features as
+well as software features.
+
+If users want to know if a feature is available on a given system, they
+try to find the flag in /proc/cpuinfo. If a given flag is present, it
+means that the kernel supports it and is currently making it available.
+If such flag represents a hardware feature, it also means that the
+hardware supports it.
+
+If the expected flag does not appear in /proc/cpuinfo, things are murkier.
+Users need to find out the reason why the flag is missing and find the way
+how to enable it, which is not always easy. There are several factors that
+can explain missing flags: the expected feature failed to enable, the feature
+is missing in hardware, platform firmware did not enable it, the feature is
+disabled at build or run time, an old kernel is in use, or the kernel does
+not support the feature and thus has not enabled it. In general, /proc/cpuinfo
+shows features which the kernel supports. For a full list of CPUID flags
+which the CPU supports, use tools/arch/x86/kcpuid.
+
+How are feature flags created?
+==============================
+
+a: Feature flags can be derived from the contents of CPUID leaves.
+------------------------------------------------------------------
+These feature definitions are organized mirroring the layout of CPUID
+leaves and grouped in words with offsets as mapped in enum cpuid_leafs
+in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details).
+If a feature is defined with a X86_FEATURE_<name> definition in
+cpufeatures.h, and if it is detected at run time, the flags will be
+displayed accordingly in /proc/cpuinfo. For example, the flag "avx2"
+comes from X86_FEATURE_AVX2 in cpufeatures.h.
+
+b: Flags can be from scattered CPUID-based features.
+----------------------------------------------------
+Hardware features enumerated in sparsely populated CPUID leaves get
+software-defined values. Still, CPUID needs to be queried to determine
+if a given feature is present. This is done in init_scattered_cpuid_features().
+For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is
+checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1].
+
+The intent of scattering CPUID leaves is to not bloat struct
+cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf
+[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1]
+has only one feature and would waste 31 bits of space in the x86_capability[]
+array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted
+memory is not trivial.
+
+c: Flags can be created synthetically under certain conditions for hardware features.
+-------------------------------------------------------------------------------------
+Examples of conditions include whether certain features are present in
+MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed
+conditions are met, the features are enabled by the set_cpu_cap or
+setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS,
+the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and
+"split_lock_detect" will be displayed. The flag "ring3mwait" will be
+displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors.
+
+d: Flags can represent purely software features.
+------------------------------------------------
+These flags do not represent hardware features. Instead, they represent a
+software feature implemented in the kernel. For example, Kernel Page Table
+Isolation is purely software feature and its feature flag X86_FEATURE_PTI is
+also defined in cpufeatures.h.
+
+Naming of Flags
+===============
+
+The script arch/x86/kernel/cpu/mkcapflags.sh processes the
+#define X86_FEATURE_<name> from cpufeatures.h and generates the
+x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the
+resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming
+of flags in the x86_cap/bug_flags[] are as follows:
+
+a: The name of the flag is from the string in X86_FEATURE_<name> by default.
+----------------------------------------------------------------------------
+By default, the flag <name> in /proc/cpuinfo is extracted from the respective
+X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from
+X86_FEATURE_AVX2.
+
+b: The naming can be overridden.
+--------------------------------
+If the comment on the line for the #define X86_FEATURE_* starts with a
+double-quote character (""), the string inside the double-quote characters
+will be the name of the flags. For example, the flag "sse4_1" comes from
+the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition.
+
+There are situations in which overriding the displayed name of the flag is
+needed. For instance, /proc/cpuinfo is a userspace interface and must remain
+constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one
+shall override the new naming with the name already used in /proc/cpuinfo.
+
+c: The naming override can be "", which means it will not appear in /proc/cpuinfo.
+----------------------------------------------------------------------------------
+The feature shall be omitted from /proc/cpuinfo if it does not make sense for
+the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is
+defined in cpufeatures.h but that flag is an internal kernel feature used
+in the alternative runtime patching functionality. So, its name is overridden
+with "". Its flag will not appear in /proc/cpuinfo.
+
+Flags are missing when one or more of these happen
+==================================================
+
+a: The hardware does not enumerate support for it.
+--------------------------------------------------
+For example, when a new kernel is running on old hardware or the feature is
+not enabled by boot firmware. Even if the hardware is new, there might be a
+problem enabling the feature at run time, the flag will not be displayed.
+
+b: The kernel does not know about the flag.
+-------------------------------------------
+For example, when an old kernel is running on new hardware.
+
+c: The kernel disabled support for it at compile-time.
+------------------------------------------------------
+For example, if 5-level-paging is not enabled when building (i.e.,
+CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_.
+Even though the feature will still be detected via CPUID, the kernel disables
+it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57).
+
+d: The feature is disabled at boot-time.
+----------------------------------------
+A feature can be disabled either using a command-line parameter or because
+it failed to be enabled. The command-line parameter clearcpuid= can be used
+to disable features using the feature number as defined in
+/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction
+Protection can be disabled using clearcpuid=514. The number 514 is calculated
+from #define X86_FEATURE_UMIP (16*32 + 2).
+
+In addition, there exists a variety of custom command-line parameters that
+disable specific features. The list of parameters includes, but is not limited
+to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using
+"no5lvl".
+
+e: The feature was known to be non-functional.
+----------------------------------------------
+The feature was known to be non-functional because a dependency was
+missing at runtime. For example, AVX flags will not show up if XSAVE feature
+is disabled since they depend on XSAVE feature. Another example would be broken
+CPUs and them missing microcode patches. Due to that, the kernel decides not to
+enable a feature.
+
+.. [#f1] 5-level paging uses linear address of 57 bits.
diff --git a/Documentation/arch/x86/earlyprintk.rst b/Documentation/arch/x86/earlyprintk.rst
new file mode 100644
index 000000000000..51ef11e8f725
--- /dev/null
+++ b/Documentation/arch/x86/earlyprintk.rst
@@ -0,0 +1,151 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Early Printk
+============
+
+Mini-HOWTO for using the earlyprintk=dbgp boot option with a
+USB2 Debug port key and a debug cable, on x86 systems.
+
+You need two computers, the 'USB debug key' special gadget and
+two USB cables, connected like this::
+
+  [host/target] <-------> [USB debug key] <-------> [client/console]
+
+Hardware requirements
+=====================
+
+  a) Host/target system needs to have USB debug port capability.
+
+     You can check this capability by looking at a 'Debug port' bit in
+     the lspci -vvv output::
+
+       # lspci -vvv
+       ...
+       00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI])
+               Subsystem: Lenovo ThinkPad T61
+               Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
+               Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
+               Latency: 0
+               Interrupt: pin D routed to IRQ 19
+               Region 0: Memory at fe227000 (32-bit, non-prefetchable) [size=1K]
+               Capabilities: [50] Power Management version 2
+                       Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
+                       Status: D0 PME-Enable- DSel=0 DScale=0 PME+
+               Capabilities: [58] Debug port: BAR=1 offset=00a0
+                            ^^^^^^^^^^^ <==================== [ HERE ]
+               Kernel driver in use: ehci_hcd
+               Kernel modules: ehci-hcd
+       ...
+
+     .. note::
+       If your system does not list a debug port capability then you probably
+       won't be able to use the USB debug key.
+
+  b) You also need a NetChip USB debug cable/key:
+
+        http://www.plxtech.com/products/NET2000/NET20DC/default.asp
+
+     This is a small blue plastic connector with two USB connections;
+     it draws power from its USB connections.
+
+  c) You need a second client/console system with a high speed USB 2.0 port.
+
+  d) The NetChip device must be plugged directly into the physical
+     debug port on the "host/target" system. You cannot use a USB hub in
+     between the physical debug port and the "host/target" system.
+
+     The EHCI debug controller is bound to a specific physical USB
+     port and the NetChip device will only work as an early printk
+     device in this port.  The EHCI host controllers are electrically
+     wired such that the EHCI debug controller is hooked up to the
+     first physical port and there is no way to change this via software.
+     You can find the physical port through experimentation by trying
+     each physical port on the system and rebooting.  Or you can try
+     and use lsusb or look at the kernel info messages emitted by the
+     usb stack when you plug a usb device into various ports on the
+     "host/target" system.
+
+     Some hardware vendors do not expose the usb debug port with a
+     physical connector and if you find such a device send a complaint
+     to the hardware vendor, because there is no reason not to wire
+     this port into one of the physically accessible ports.
+
+  e) It is also important to note, that many versions of the NetChip
+     device require the "client/console" system to be plugged into the
+     right hand side of the device (with the product logo facing up and
+     readable left to right).  The reason being is that the 5 volt
+     power supply is taken from only one side of the device and it
+     must be the side that does not get rebooted.
+
+Software requirements
+=====================
+
+  a) On the host/target system:
+
+    You need to enable the following kernel config option::
+
+      CONFIG_EARLY_PRINTK_DBGP=y
+
+    And you need to add the boot command line: "earlyprintk=dbgp".
+
+    .. note::
+      If you are using Grub, append it to the 'kernel' line in
+      /etc/grub.conf.  If you are using Grub2 on a BIOS firmware system,
+      append it to the 'linux' line in /boot/grub2/grub.cfg. If you are
+      using Grub2 on an EFI firmware system, append it to the 'linux'
+      or 'linuxefi' line in /boot/grub2/grub.cfg or
+      /boot/efi/EFI/<distro>/grub.cfg.
+
+    On systems with more than one EHCI debug controller you must
+    specify the correct EHCI debug controller number.  The ordering
+    comes from the PCI bus enumeration of the EHCI controllers.  The
+    default with no number argument is "0" or the first EHCI debug
+    controller.  To use the second EHCI debug controller, you would
+    use the command line: "earlyprintk=dbgp1"
+
+    .. note::
+      normally earlyprintk console gets turned off once the
+      regular console is alive - use "earlyprintk=dbgp,keep" to keep
+      this channel open beyond early bootup. This can be useful for
+      debugging crashes under Xorg, etc.
+
+  b) On the client/console system:
+
+    You should enable the following kernel config option::
+
+      CONFIG_USB_SERIAL_DEBUG=y
+
+    On the next bootup with the modified kernel you should
+    get a /dev/ttyUSBx device(s).
+
+    Now this channel of kernel messages is ready to be used: start
+    your favorite terminal emulator (minicom, etc.) and set
+    it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to
+    see the raw output.
+
+  c) On Nvidia Southbridge based systems: the kernel will try to probe
+     and find out which port has a debug device connected.
+
+Testing
+=======
+
+You can test the output by using earlyprintk=dbgp,keep and provoking
+kernel messages on the host/target system. You can provoke a harmless
+kernel message by for example doing::
+
+     echo h > /proc/sysrq-trigger
+
+On the host/target system you should see this help line in "dmesg" output::
+
+     SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
+
+On the client/console system do::
+
+       cat /dev/ttyUSB0
+
+And you should see the help line above displayed shortly after you've
+provoked it on the host system.
+
+If it does not work then please ask about it on the linux-kernel@vger.kernel.org
+mailing list or contact the x86 maintainers.
diff --git a/Documentation/arch/x86/elf_auxvec.rst b/Documentation/arch/x86/elf_auxvec.rst
new file mode 100644
index 000000000000..18e4744717f9
--- /dev/null
+++ b/Documentation/arch/x86/elf_auxvec.rst
@@ -0,0 +1,53 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+x86-specific ELF Auxiliary Vectors
+==================================
+
+This document describes the semantics of the x86 auxiliary vectors.
+
+Introduction
+============
+
+ELF Auxiliary vectors enable the kernel to efficiently provide
+configuration-specific parameters to userspace. In this example, a program
+allocates an alternate stack based on the kernel-provided size::
+
+   #include <sys/auxv.h>
+   #include <elf.h>
+   #include <signal.h>
+   #include <stdlib.h>
+   #include <assert.h>
+   #include <err.h>
+
+   #ifndef AT_MINSIGSTKSZ
+   #define AT_MINSIGSTKSZ	51
+   #endif
+
+   ....
+   stack_t ss;
+
+   ss.ss_sp = malloc(ss.ss_size);
+   assert(ss.ss_sp);
+
+   ss.ss_size = getauxval(AT_MINSIGSTKSZ) + SIGSTKSZ;
+   ss.ss_flags = 0;
+
+   if (sigaltstack(&ss, NULL))
+        err(1, "sigaltstack");
+
+
+The exposed auxiliary vectors
+=============================
+
+AT_SYSINFO is used for locating the vsyscall entry point.  It is not
+exported on 64-bit mode.
+
+AT_SYSINFO_EHDR is the start address of the page containing the vDSO.
+
+AT_MINSIGSTKSZ denotes the minimum stack size required by the kernel to
+deliver a signal to user-space.  AT_MINSIGSTKSZ comprehends the space
+consumed by the kernel to accommodate the user context for the current
+hardware configuration.  It does not comprehend subsequent user-space stack
+consumption, which must be added by the user.  (e.g. Above, user-space adds
+SIGSTKSZ to AT_MINSIGSTKSZ.)
diff --git a/Documentation/arch/x86/entry_64.rst b/Documentation/arch/x86/entry_64.rst
new file mode 100644
index 000000000000..0afdce3c06f4
--- /dev/null
+++ b/Documentation/arch/x86/entry_64.rst
@@ -0,0 +1,110 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Kernel Entries
+==============
+
+This file documents some of the kernel entries in
+arch/x86/entry/entry_64.S.  A lot of this explanation is adapted from
+an email from Ingo Molnar:
+
+https://lore.kernel.org/r/20110529191055.GC9835%40elte.hu
+
+The x86 architecture has quite a few different ways to jump into
+kernel code.  Most of these entry points are registered in
+arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S
+for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally
+arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility
+syscall entry points and thus provides for 32-bit processes the
+ability to execute syscalls when running on 64-bit kernels.
+
+The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h.
+
+Some of these entries are:
+
+ - system_call: syscall instruction from 64-bit code.
+
+ - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall
+   either way.
+
+ - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit
+   code
+
+ - interrupt: An array of entries.  Every IDT vector that doesn't
+   explicitly point somewhere else gets set to the corresponding
+   value in interrupts.  These point to a whole array of
+   magically-generated functions that make their way to common_interrupt()
+   with the interrupt number as a parameter.
+
+ - APIC interrupts: Various special-purpose interrupts for things
+   like TLB shootdown.
+
+ - Architecturally-defined exceptions like divide_error.
+
+There are a few complexities here.  The different x86-64 entries
+have different calling conventions.  The syscall and sysenter
+instructions have their own peculiar calling conventions.  Some of
+the IDT entries push an error code onto the stack; others don't.
+IDT entries using the IST alternative stack mechanism need their own
+magic to get the stack frames right.  (You can find some
+documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
+Volume 3, Chapter 6.)
+
+Dealing with the swapgs instruction is especially tricky.  Swapgs
+toggles whether gs is the kernel gs or the user gs.  The swapgs
+instruction is rather fragile: it must nest perfectly and only in
+single depth, it should only be used if entering from user mode to
+kernel mode and then when returning to user-space, and precisely
+so. If we mess that up even slightly, we crash.
+
+So when we have a secondary entry, already in kernel mode, we *must
+not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
+not switched/swapped yet.
+
+Now, there's a secondary complication: there's a cheap way to test
+which mode the CPU is in and an expensive way.
+
+The cheap way is to pick this info off the entry frame on the kernel
+stack, from the CS of the ptregs area of the kernel stack::
+
+	xorl %ebx,%ebx
+	testl $3,CS+8(%rsp)
+	je error_kernelspace
+	SWAPGS
+
+The expensive (paranoid) way is to read back the MSR_GS_BASE value
+(which is what SWAPGS modifies)::
+
+	movl $1,%ebx
+	movl $MSR_GS_BASE,%ecx
+	rdmsr
+	testl %edx,%edx
+	js 1f   /* negative -> in kernel */
+	SWAPGS
+	xorl %ebx,%ebx
+  1:	ret
+
+If we are at an interrupt or user-trap/gate-alike boundary then we can
+use the faster check: the stack will be a reliable indicator of
+whether SWAPGS was already done: if we see that we are a secondary
+entry interrupting kernel mode execution, then we know that the GS
+base has already been switched. If it says that we interrupted
+user-space execution then we must do the SWAPGS.
+
+But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
+which might have triggered right after a normal entry wrote CS to the
+stack but before we executed SWAPGS, then the only safe way to check
+for GS is the slower method: the RDMSR.
+
+Therefore, super-atomic entries (except NMI, which is handled separately)
+must use idtentry with paranoid=1 to handle gsbase correctly.  This
+triggers three main behavior changes:
+
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/arch/x86/exception-tables.rst b/Documentation/arch/x86/exception-tables.rst
new file mode 100644
index 000000000000..efde1fef4fbd
--- /dev/null
+++ b/Documentation/arch/x86/exception-tables.rst
@@ -0,0 +1,357 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Kernel level exception handling
+===============================
+
+Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
+
+When a process runs in kernel mode, it often has to access user
+mode memory whose address has been passed by an untrusted program.
+To protect itself the kernel has to verify this address.
+
+In older versions of Linux this was done with the
+int verify_area(int type, const void * addr, unsigned long size)
+function (which has since been replaced by access_ok()).
+
+This function verified that the memory area starting at address
+'addr' and of size 'size' was accessible for the operation specified
+in type (read or write). To do this, verify_read had to look up the
+virtual memory area (vma) that contained the address addr. In the
+normal case (correctly working program), this test was successful.
+It only failed for a few buggy programs. In some kernel profiling
+tests, this normally unneeded verification used up a considerable
+amount of time.
+
+To overcome this situation, Linus decided to let the virtual memory
+hardware present in every Linux-capable CPU handle this test.
+
+How does this work?
+
+Whenever the kernel tries to access an address that is currently not
+accessible, the CPU generates a page fault exception and calls the
+page fault handler::
+
+  void exc_page_fault(struct pt_regs *regs, unsigned long error_code)
+
+in arch/x86/mm/fault.c. The parameters on the stack are set up by
+the low level assembly glue in arch/x86/entry/entry_32.S. The parameter
+regs is a pointer to the saved registers on the stack, error_code
+contains a reason code for the exception.
+
+exc_page_fault() first obtains the inaccessible address from the CPU
+control register CR2. If the address is within the virtual address
+space of the process, the fault probably occurred, because the page
+was not swapped in, write protected or something similar. However,
+we are interested in the other case: the address is not valid, there
+is no vma that contains this address. In this case, the kernel jumps
+to the bad_area label.
+
+There it uses the address of the instruction that caused the exception
+(i.e. regs->eip) to find an address where the execution can continue
+(fixup). If this search is successful, the fault handler modifies the
+return address (again regs->eip) and returns. The execution will
+continue at the address in fixup.
+
+Where does fixup point to?
+
+Since we jump to the contents of fixup, fixup obviously points
+to executable code. This code is hidden inside the user access macros.
+I have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h
+as an example. The definition is somewhat hard to follow, so let's peek at
+the code generated by the preprocessor and the compiler. I selected
+the get_user() call in drivers/char/sysrq.c for a detailed examination.
+
+The original code in sysrq.c line 587::
+
+        get_user(c, buf);
+
+The preprocessor output (edited to become somewhat readable)::
+
+  (
+    {
+      long __gu_err = - 14 , __gu_val = 0;
+      const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
+      if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
+        (((sizeof(*(buf))) <= 0xC0000000UL) &&
+        ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
+        do {
+          __gu_err  = 0;
+          switch ((sizeof(*(buf)))) {
+            case 1:
+              __asm__ __volatile__(
+                "1:      mov" "b" " %2,%" "b" "1\n"
+                "2:\n"
+                ".section .fixup,\"ax\"\n"
+                "3:      movl %3,%0\n"
+                "        xor" "b" " %" "b" "1,%" "b" "1\n"
+                "        jmp 2b\n"
+                ".section __ex_table,\"a\"\n"
+                "        .align 4\n"
+                "        .long 1b,3b\n"
+                ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
+                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
+                break;
+            case 2:
+              __asm__ __volatile__(
+                "1:      mov" "w" " %2,%" "w" "1\n"
+                "2:\n"
+                ".section .fixup,\"ax\"\n"
+                "3:      movl %3,%0\n"
+                "        xor" "w" " %" "w" "1,%" "w" "1\n"
+                "        jmp 2b\n"
+                ".section __ex_table,\"a\"\n"
+                "        .align 4\n"
+                "        .long 1b,3b\n"
+                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
+                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
+                break;
+            case 4:
+              __asm__ __volatile__(
+                "1:      mov" "l" " %2,%" "" "1\n"
+                "2:\n"
+                ".section .fixup,\"ax\"\n"
+                "3:      movl %3,%0\n"
+                "        xor" "l" " %" "" "1,%" "" "1\n"
+                "        jmp 2b\n"
+                ".section __ex_table,\"a\"\n"
+                "        .align 4\n"        "        .long 1b,3b\n"
+                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
+                              (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
+                break;
+            default:
+              (__gu_val) = __get_user_bad();
+          }
+        } while (0) ;
+      ((c)) = (__typeof__(*((buf))))__gu_val;
+      __gu_err;
+    }
+  );
+
+WOW! Black GCC/assembly magic. This is impossible to follow, so let's
+see what code gcc generates::
+
+ >         xorl %edx,%edx
+ >         movl current_set,%eax
+ >         cmpl $24,788(%eax)
+ >         je .L1424
+ >         cmpl $-1073741825,64(%esp)
+ >         ja .L1423
+ > .L1424:
+ >         movl %edx,%eax
+ >         movl 64(%esp),%ebx
+ > #APP
+ > 1:      movb (%ebx),%dl                /* this is the actual user access */
+ > 2:
+ > .section .fixup,"ax"
+ > 3:      movl $-14,%eax
+ >         xorb %dl,%dl
+ >         jmp 2b
+ > .section __ex_table,"a"
+ >         .align 4
+ >         .long 1b,3b
+ > .text
+ > #NO_APP
+ > .L1423:
+ >         movzbl %dl,%esi
+
+The optimizer does a good job and gives us something we can actually
+understand. Can we? The actual user access is quite obvious. Thanks
+to the unified address space we can just access the address in user
+memory. But what does the .section stuff do?????
+
+To understand this we have to look at the final kernel::
+
+ > objdump --section-headers vmlinux
+ >
+ > vmlinux:     file format elf32-i386
+ >
+ > Sections:
+ > Idx Name          Size      VMA       LMA       File off  Algn
+ >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
+ >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
+ >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
+ >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
+ >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
+ >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
+ >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
+ >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
+ >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
+ >                   CONTENTS, ALLOC, LOAD, DATA
+ >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
+ >                   ALLOC
+ >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
+ >                   CONTENTS, READONLY
+ >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
+ >                   CONTENTS, READONLY
+
+There are obviously 2 non standard ELF sections in the generated object
+file. But first we want to find out what happened to our code in the
+final kernel executable::
+
+ > objdump --disassemble --section=.text vmlinux
+ >
+ > c017e785 <do_con_write+c1> xorl   %edx,%edx
+ > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
+ > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
+ > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
+ > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
+ > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
+ > c017e79f <do_con_write+db> movl   %edx,%eax
+ > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
+ > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
+ > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
+
+The whole user memory access is reduced to 10 x86 machine instructions.
+The instructions bracketed in the .section directives are no longer
+in the normal execution path. They are located in a different section
+of the executable file::
+
+ > objdump --disassemble --section=.fixup vmlinux
+ >
+ > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
+ > c0199ffa <.fixup+10ba> xorb   %dl,%dl
+ > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
+
+And finally::
+
+ > objdump --full-contents --section=__ex_table vmlinux
+ >
+ >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
+ >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
+ >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
+
+or in human readable byte order::
+
+ >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
+ >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
+                               ^^^^^^^^^^^^^^^^^
+                               this is the interesting part!
+ >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
+
+What happened? The assembly directives::
+
+  .section .fixup,"ax"
+  .section __ex_table,"a"
+
+told the assembler to move the following code to the specified
+sections in the ELF object file. So the instructions::
+
+  3:      movl $-14,%eax
+          xorb %dl,%dl
+          jmp 2b
+
+ended up in the .fixup section of the object file and the addresses::
+
+        .long 1b,3b
+
+ended up in the __ex_table section of the object file. 1b and 3b
+are local labels. The local label 1b (1b stands for next label 1
+backward) is the address of the instruction that might fault, i.e.
+in our case the address of the label 1 is c017e7a5:
+the original assembly code: > 1:      movb (%ebx),%dl
+and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
+
+The local label 3 (backwards again) is the address of the code to handle
+the fault, in our case the actual value is c0199ff5:
+the original assembly code: > 3:      movl $-14,%eax
+and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
+
+If the fixup was able to handle the exception, control flow may be returned
+to the instruction after the one that triggered the fault, ie. local label 2b.
+
+The assembly code::
+
+ > .section __ex_table,"a"
+ >         .align 4
+ >         .long 1b,3b
+
+becomes the value pair::
+
+ >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
+                               ^this is ^this is
+                               1b       3b
+
+c017e7a5,c0199ff5 in the exception table of the kernel.
+
+So, what actually happens if a fault from kernel mode with no suitable
+vma occurs?
+
+#. access to invalid address::
+
+    > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
+#. MMU generates exception
+#. CPU calls exc_page_fault()
+#. exc_page_fault() calls do_user_addr_fault()
+#. do_user_addr_fault() calls kernelmode_fixup_or_oops()
+#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5);
+#. fixup_exception() calls search_exception_tables()
+#. search_exception_tables() looks up the address c017e7a5 in the
+   exception table (i.e. the contents of the ELF section __ex_table)
+   and returns the address of the associated fault handle code c0199ff5.
+#. fixup_exception() modifies its own return address to point to the fault
+   handle code and returns.
+#. execution continues in the fault handling code.
+#. a) EAX becomes -EFAULT (== -14)
+   b) DL  becomes zero (the value we "read" from user space)
+   c) execution continues at local label 2 (address of the
+      instruction immediately after the faulting user access).
+
+The steps 8a to 8c in a certain way emulate the faulting instruction.
+
+That's it, mostly. If you look at our example, you might ask why
+we set EAX to -EFAULT in the exception handler code. Well, the
+get_user() macro actually returns a value: 0, if the user access was
+successful, -EFAULT on failure. Our original code did not test this
+return value, however the inline assembly code in get_user() tries to
+return -EFAULT. GCC selected EAX to return this value.
+
+NOTE:
+Due to the way that the exception table is built and needs to be ordered,
+only use exceptions for code in the .text section.  Any other section
+will cause the exception table to not be sorted correctly, and the
+exceptions will fail.
+
+Things changed when 64-bit support was added to x86 Linux. Rather than
+double the size of the exception table by expanding the two entries
+from 32-bits to 64 bits, a clever trick was used to store addresses
+as relative offsets from the table itself. The assembly code changed
+from::
+
+    .long 1b,3b
+  to:
+          .long (from) - .
+          .long (to) - .
+
+and the C-code that uses these values converts back to absolute addresses
+like this::
+
+	ex_insn_addr(const struct exception_table_entry *x)
+	{
+		return (unsigned long)&x->insn + x->insn;
+	}
+
+In v4.6 the exception table entry was expanded with a new field "handler".
+This is also 32-bits wide and contains a third relative function
+pointer which points to one of:
+
+1) ``int ex_handler_default(const struct exception_table_entry *fixup)``
+     This is legacy case that just jumps to the fixup code
+
+2) ``int ex_handler_fault(const struct exception_table_entry *fixup)``
+     This case provides the fault number of the trap that occurred at
+     entry->insn. It is used to distinguish page faults from machine
+     check.
+
+More functions can easily be added.
+
+CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post
+link of the kernel image, via a host utility scripts/sorttable. It will set the
+symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section
+at boot time. With the exception table sorted, at runtime when an exception
+occurs we can quickly lookup the __ex_table entry via binary search.
+
+This is not just a boot time optimization, some architectures require this
+table to be sorted in order to handle exceptions relatively early in the boot
+process. For example, i386 makes use of this form of exception handling before
+paging support is even enabled!
diff --git a/Documentation/arch/x86/features.rst b/Documentation/arch/x86/features.rst
new file mode 100644
index 000000000000..b663f15053ce
--- /dev/null
+++ b/Documentation/arch/x86/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features x86
diff --git a/Documentation/arch/x86/i386/IO-APIC.rst b/Documentation/arch/x86/i386/IO-APIC.rst
new file mode 100644
index 000000000000..ce4d8df15e7c
--- /dev/null
+++ b/Documentation/arch/x86/i386/IO-APIC.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+IO-APIC
+=======
+
+:Author: Ingo Molnar <mingo@kernel.org>
+
+Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC',
+which is an enhanced interrupt controller. It enables us to route
+hardware interrupts to multiple CPUs, or to CPU groups. Without an
+IO-APIC, interrupts from hardware will be delivered only to the
+CPU which boots the operating system (usually CPU#0).
+
+Linux supports all variants of compliant SMP boards, including ones with
+multiple IO-APICs. Multiple IO-APICs are used in high-end servers to
+distribute IRQ load further.
+
+There are (a few) known breakages in certain older boards, such bugs are
+usually worked around by the kernel. If your MP-compliant SMP board does
+not boot Linux, then consult the linux-smp mailing list archives first.
+
+If your box boots fine with enabled IO-APIC IRQs, then your
+/proc/interrupts will look like this one::
+
+  hell:~> cat /proc/interrupts
+             CPU0
+    0:    1360293    IO-APIC-edge  timer
+    1:          4    IO-APIC-edge  keyboard
+    2:          0          XT-PIC  cascade
+   13:          1          XT-PIC  fpu
+   14:       1448    IO-APIC-edge  ide0
+   16:      28232   IO-APIC-level  Intel EtherExpress Pro 10/100 Ethernet
+   17:      51304   IO-APIC-level  eth0
+  NMI:          0
+  ERR:          0
+  hell:~>
+
+Some interrupts are still listed as 'XT PIC', but this is not a problem;
+none of those IRQ sources is performance-critical.
+
+
+In the unlikely case that your board does not create a working mp-table,
+you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This
+is non-trivial though and cannot be automated. One sample /etc/lilo.conf
+entry::
+
+	append="pirq=15,11,10"
+
+The actual numbers depend on your system, on your PCI cards and on their
+PCI slot position. Usually PCI slots are 'daisy chained' before they are
+connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4
+lines)::
+
+               ,-.        ,-.        ,-.        ,-.        ,-.
+     PIRQ4 ----| |-.    ,-| |-.    ,-| |-.    ,-| |--------| |
+               |S|  \  /  |S|  \  /  |S|  \  /  |S|        |S|
+     PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l|
+               |o|  \/    |o|  \/    |o|  \/    |o|        |o|
+     PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t|
+               |1| /\     |2| /\     |3| /\     |4|        |5|
+     PIRQ1 ----| |-  `----| |-  `----| |-  `----| |--------| |
+               `-'        `-'        `-'        `-'        `-'
+
+Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD::
+
+                               ,-.
+                         INTD--| |
+                               |S|
+                         INTC--|l|
+                               |o|
+                         INTB--|t|
+                               |x|
+                         INTA--| |
+                               `-'
+
+These INTA-D PCI IRQs are always 'local to the card', their real meaning
+depends on which slot they are in. If you look at the daisy chaining diagram,
+a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of
+the PCI chipset. Most cards issue INTA, this creates optimal distribution
+between the PIRQ lines. (distributing IRQ sources properly is not a
+necessity, PCI IRQs can be shared at will, but it's a good for performance
+to have non shared interrupts). Slot5 should be used for videocards, they
+do not use interrupts normally, thus they are not daisy chained either.
+
+so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in
+Slot2, then you'll have to specify this pirq= line::
+
+	append="pirq=11,9"
+
+the following script tries to figure out such a default pirq= line from
+your PCI configuration::
+
+	echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
+
+note that this script won't work if you have skipped a few slots or if your
+board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
+connected in some strange way). E.g. if in the above case you have your SCSI
+card (IRQ11) in Slot3, and have Slot1 empty::
+
+	append="pirq=0,9,11"
+
+[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting)
+slots.]
+
+Generally, it's always possible to find out the correct pirq= settings, just
+permute all IRQ numbers properly ... it will take some time though. An
+'incorrect' pirq line will cause the booting process to hang, or a device
+won't function properly (e.g. if it's inserted as a module).
+
+If you have 2 PCI buses, then you can use up to 8 pirq values, although such
+boards tend to have a good configuration.
+
+Be prepared that it might happen that you need some strange pirq line::
+
+	append="pirq=0,0,0,0,0,0,9,11"
+
+Use smart trial-and-error techniques to find out the correct pirq line ...
+
+Good luck and mail to linux-smp@vger.kernel.org or
+linux-kernel@vger.kernel.org if you have any problems that are not covered
+by this document.
+
diff --git a/Documentation/arch/x86/i386/index.rst b/Documentation/arch/x86/i386/index.rst
new file mode 100644
index 000000000000..8747cf5bbd49
--- /dev/null
+++ b/Documentation/arch/x86/i386/index.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+i386 Support
+============
+
+.. toctree::
+   :maxdepth: 2
+
+   IO-APIC
diff --git a/Documentation/arch/x86/ifs.rst b/Documentation/arch/x86/ifs.rst
new file mode 100644
index 000000000000..97abb696a680
--- /dev/null
+++ b/Documentation/arch/x86/ifs.rst
@@ -0,0 +1,2 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. kernel-doc:: drivers/platform/x86/intel/ifs/ifs.h
diff --git a/Documentation/arch/x86/index.rst b/Documentation/arch/x86/index.rst
new file mode 100644
index 000000000000..c73d133fd37c
--- /dev/null
+++ b/Documentation/arch/x86/index.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+x86-specific Documentation
+==========================
+
+.. toctree::
+   :maxdepth: 2
+   :numbered:
+
+   boot
+   booting-dt
+   cpuinfo
+   topology
+   exception-tables
+   kernel-stacks
+   entry_64
+   earlyprintk
+   orc-unwinder
+   zero-page
+   tlb
+   mtrr
+   pat
+   intel-hfi
+   iommu
+   intel_txt
+   amd-memory-encryption
+   amd_hsmp
+   tdx
+   pti
+   mds
+   microcode
+   resctrl
+   tsx_async_abort
+   buslock
+   usb-legacy-support
+   i386/index
+   x86_64/index
+   ifs
+   sva
+   sgx
+   features
+   elf_auxvec
+   xstate
diff --git a/Documentation/arch/x86/intel-hfi.rst b/Documentation/arch/x86/intel-hfi.rst
new file mode 100644
index 000000000000..49dea58ea4fb
--- /dev/null
+++ b/Documentation/arch/x86/intel-hfi.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================
+Hardware-Feedback Interface for scheduling on Intel Hardware
+============================================================
+
+Overview
+--------
+
+Intel has described the Hardware Feedback Interface (HFI) in the Intel 64 and
+IA-32 Architectures Software Developer's Manual (Intel SDM) Volume 3 Section
+14.6 [1]_.
+
+The HFI gives the operating system a performance and energy efficiency
+capability data for each CPU in the system. Linux can use the information from
+the HFI to influence task placement decisions.
+
+The Hardware Feedback Interface
+-------------------------------
+
+The Hardware Feedback Interface provides to the operating system information
+about the performance and energy efficiency of each CPU in the system. Each
+capability is given as a unit-less quantity in the range [0-255]. Higher values
+indicate higher capability. Energy efficiency and performance are reported in
+separate capabilities. Even though on some systems these two metrics may be
+related, they are specified as independent capabilities in the Intel SDM.
+
+These capabilities may change at runtime as a result of changes in the
+operating conditions of the system or the action of external factors. The rate
+at which these capabilities are updated is specific to each processor model. On
+some models, capabilities are set at boot time and never change. On others,
+capabilities may change every tens of milliseconds. For instance, a remote
+mechanism may be used to lower Thermal Design Power. Such change can be
+reflected in the HFI. Likewise, if the system needs to be throttled due to
+excessive heat, the HFI may reflect reduced performance on specific CPUs.
+
+The kernel or a userspace policy daemon can use these capabilities to modify
+task placement decisions. For instance, if either the performance or energy
+capabilities of a given logical processor becomes zero, it is an indication that
+the hardware recommends to the operating system to not schedule any tasks on
+that processor for performance or energy efficiency reasons, respectively.
+
+Implementation details for Linux
+--------------------------------
+
+The infrastructure to handle thermal event interrupts has two parts. In the
+Local Vector Table of a CPU's local APIC, there exists a register for the
+Thermal Monitor Register. This register controls how interrupts are delivered
+to a CPU when the thermal monitor generates and interrupt. Further details
+can be found in the Intel SDM Vol. 3 Section 10.5 [1]_.
+
+The thermal monitor may generate interrupts per CPU or per package. The HFI
+generates package-level interrupts. This monitor is configured and initialized
+via a set of machine-specific registers. Specifically, the HFI interrupt and
+status are controlled via designated bits in the IA32_PACKAGE_THERM_INTERRUPT
+and IA32_PACKAGE_THERM_STATUS registers, respectively. There exists one HFI
+table per package. Further details can be found in the Intel SDM Vol. 3
+Section 14.9 [1]_.
+
+The hardware issues an HFI interrupt after updating the HFI table and is ready
+for the operating system to consume it. CPUs receive such interrupt via the
+thermal entry in the Local APIC's Local Vector Table.
+
+When servicing such interrupt, the HFI driver parses the updated table and
+relays the update to userspace using the thermal notification framework. Given
+that there may be many HFI updates every second, the updates relayed to
+userspace are throttled at a rate of CONFIG_HZ jiffies.
+
+References
+----------
+
+.. [1] https://www.intel.com/sdm
diff --git a/Documentation/arch/x86/intel_txt.rst b/Documentation/arch/x86/intel_txt.rst
new file mode 100644
index 000000000000..d83c1a2122c9
--- /dev/null
+++ b/Documentation/arch/x86/intel_txt.rst
@@ -0,0 +1,227 @@
+=====================
+Intel(R) TXT Overview
+=====================
+
+Intel's technology for safer computing, Intel(R) Trusted Execution
+Technology (Intel(R) TXT), defines platform-level enhancements that
+provide the building blocks for creating trusted platforms.
+
+Intel TXT was formerly known by the code name LaGrande Technology (LT).
+
+Intel TXT in Brief:
+
+-  Provides dynamic root of trust for measurement (DRTM)
+-  Data protection in case of improper shutdown
+-  Measurement and verification of launched environment
+
+Intel TXT is part of the vPro(TM) brand and is also available some
+non-vPro systems.  It is currently available on desktop systems
+based on the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell
+Optiplex 755, HP dc7800, etc.) and mobile systems based on the GM45,
+PM45, and GS45 Express chipsets.
+
+For more information, see http://www.intel.com/technology/security/.
+This site also has a link to the Intel TXT MLE Developers Manual,
+which has been updated for the new released platforms.
+
+Intel TXT has been presented at various events over the past few
+years, some of which are:
+
+      - LinuxTAG 2008:
+          http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html
+
+      - TRUST2008:
+          http://www.trust-conference.eu/downloads/Keynote-Speakers/
+          3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf
+
+      - IDF, Shanghai:
+          http://www.prcidf.com.cn/index_en.html
+
+      - IDFs 2006, 2007
+	  (I'm not sure if/where they are online)
+
+Trusted Boot Project Overview
+=============================
+
+Trusted Boot (tboot) is an open source, pre-kernel/VMM module that
+uses Intel TXT to perform a measured and verified launch of an OS
+kernel/VMM.
+
+It is hosted on SourceForge at http://sourceforge.net/projects/tboot.
+The mercurial source repo is available at http://www.bughost.org/
+repos.hg/tboot.hg.
+
+Tboot currently supports launching Xen (open source VMM/hypervisor
+w/ TXT support since v3.2), and now Linux kernels.
+
+
+Value Proposition for Linux or "Why should you care?"
+=====================================================
+
+While there are many products and technologies that attempt to
+measure or protect the integrity of a running kernel, they all
+assume the kernel is "good" to begin with.  The Integrity
+Measurement Architecture (IMA) and Linux Integrity Module interface
+are examples of such solutions.
+
+To get trust in the initial kernel without using Intel TXT, a
+static root of trust must be used.  This bases trust in BIOS
+starting at system reset and requires measurement of all code
+executed between system reset through the completion of the kernel
+boot as well as data objects used by that code.  In the case of a
+Linux kernel, this means all of BIOS, any option ROMs, the
+bootloader and the boot config.  In practice, this is a lot of
+code/data, much of which is subject to change from boot to boot
+(e.g. changing NICs may change option ROMs).  Without reference
+hashes, these measurement changes are difficult to assess or
+confirm as benign.  This process also does not provide DMA
+protection, memory configuration/alias checks and locks, crash
+protection, or policy support.
+
+By using the hardware-based root of trust that Intel TXT provides,
+many of these issues can be mitigated.  Specifically: many
+pre-launch components can be removed from the trust chain, DMA
+protection is provided to all launched components, a large number
+of platform configuration checks are performed and values locked,
+protection is provided for any data in the event of an improper
+shutdown, and there is support for policy-based execution/verification.
+This provides a more stable measurement and a higher assurance of
+system configuration and initial state than would be otherwise
+possible.  Since the tboot project is open source, source code for
+almost all parts of the trust chain is available (excepting SMM and
+Intel-provided firmware).
+
+How Does it Work?
+=================
+
+-  Tboot is an executable that is launched by the bootloader as
+   the "kernel" (the binary the bootloader executes).
+-  It performs all of the work necessary to determine if the
+   platform supports Intel TXT and, if so, executes the GETSEC[SENTER]
+   processor instruction that initiates the dynamic root of trust.
+
+   -  If tboot determines that the system does not support Intel TXT
+      or is not configured correctly (e.g. the SINIT AC Module was
+      incorrect), it will directly launch the kernel with no changes
+      to any state.
+   -  Tboot will output various information about its progress to the
+      terminal, serial port, and/or an in-memory log; the output
+      locations can be configured with a command line switch.
+
+-  The GETSEC[SENTER] instruction will return control to tboot and
+   tboot then verifies certain aspects of the environment (e.g. TPM NV
+   lock, e820 table does not have invalid entries, etc.).
+-  It will wake the APs from the special sleep state the GETSEC[SENTER]
+   instruction had put them in and place them into a wait-for-SIPI
+   state.
+
+   -  Because the processors will not respond to an INIT or SIPI when
+      in the TXT environment, it is necessary to create a small VT-x
+      guest for the APs.  When they run in this guest, they will
+      simply wait for the INIT-SIPI-SIPI sequence, which will cause
+      VMEXITs, and then disable VT and jump to the SIPI vector.  This
+      approach seemed like a better choice than having to insert
+      special code into the kernel's MP wakeup sequence.
+
+-  Tboot then applies an (optional) user-defined launch policy to
+   verify the kernel and initrd.
+
+   -  This policy is rooted in TPM NV and is described in the tboot
+      project.  The tboot project also contains code for tools to
+      create and provision the policy.
+   -  Policies are completely under user control and if not present
+      then any kernel will be launched.
+   -  Policy action is flexible and can include halting on failures
+      or simply logging them and continuing.
+
+-  Tboot adjusts the e820 table provided by the bootloader to reserve
+   its own location in memory as well as to reserve certain other
+   TXT-related regions.
+-  As part of its launch, tboot DMA protects all of RAM (using the
+   VT-d PMRs).  Thus, the kernel must be booted with 'intel_iommu=on'
+   in order to remove this blanket protection and use VT-d's
+   page-level protection.
+-  Tboot will populate a shared page with some data about itself and
+   pass this to the Linux kernel as it transfers control.
+
+   -  The location of the shared page is passed via the boot_params
+      struct as a physical address.
+
+-  The kernel will look for the tboot shared page address and, if it
+   exists, map it.
+-  As one of the checks/protections provided by TXT, it makes a copy
+   of the VT-d DMARs in a DMA-protected region of memory and verifies
+   them for correctness.  The VT-d code will detect if the kernel was
+   launched with tboot and use this copy instead of the one in the
+   ACPI table.
+-  At this point, tboot and TXT are out of the picture until a
+   shutdown (S<n>)
+-  In order to put a system into any of the sleep states after a TXT
+   launch, TXT must first be exited.  This is to prevent attacks that
+   attempt to crash the system to gain control on reboot and steal
+   data left in memory.
+
+   -  The kernel will perform all of its sleep preparation and
+      populate the shared page with the ACPI data needed to put the
+      platform in the desired sleep state.
+   -  Then the kernel jumps into tboot via the vector specified in the
+      shared page.
+   -  Tboot will clean up the environment and disable TXT, then use the
+      kernel-provided ACPI information to actually place the platform
+      into the desired sleep state.
+   -  In the case of S3, tboot will also register itself as the resume
+      vector.  This is necessary because it must re-establish the
+      measured environment upon resume.  Once the TXT environment
+      has been restored, it will restore the TPM PCRs and then
+      transfer control back to the kernel's S3 resume vector.
+      In order to preserve system integrity across S3, the kernel
+      provides tboot with a set of memory ranges (RAM and RESERVED_KERN
+      in the e820 table, but not any memory that BIOS might alter over
+      the S3 transition) that tboot will calculate a MAC (message
+      authentication code) over and then seal with the TPM. On resume
+      and once the measured environment has been re-established, tboot
+      will re-calculate the MAC and verify it against the sealed value.
+      Tboot's policy determines what happens if the verification fails.
+      Note that the c/s 194 of tboot which has the new MAC code supports
+      this.
+
+That's pretty much it for TXT support.
+
+
+Configuring the System
+======================
+
+This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels.
+
+In BIOS, the user must enable:  TPM, TXT, VT-x, VT-d.  Not all BIOSes
+allow these to be individually enabled/disabled and the screens in
+which to find them are BIOS-specific.
+
+grub.conf needs to be modified as follows::
+
+        title Linux 2.6.29-tip w/ tboot
+          root (hd0,0)
+                kernel /tboot.gz logging=serial,vga,memory
+                module /vmlinuz-2.6.29-tip intel_iommu=on ro
+                       root=LABEL=/ rhgb console=ttyS0,115200 3
+                module /initrd-2.6.29-tip.img
+                module /Q35_SINIT_17.BIN
+
+The kernel option for enabling Intel TXT support is found under the
+Security top-level menu and is called "Enable Intel(R) Trusted
+Execution Technology (TXT)".  It is considered EXPERIMENTAL and
+depends on the generic x86 support (to allow maximum flexibility in
+kernel build options), since the tboot code will detect whether the
+platform actually supports Intel TXT and thus whether any of the
+kernel code is executed.
+
+The Q35_SINIT_17.BIN file is what Intel TXT refers to as an
+Authenticated Code Module.  It is specific to the chipset in the
+system and can also be found on the Trusted Boot site.  It is an
+(unencrypted) module signed by Intel that is used as part of the
+DRTM process to verify and configure the system.  It is signed
+because it operates at a higher privilege level in the system than
+any other macrocode and its correct operation is critical to the
+establishment of the DRTM.  The process for determining the correct
+SINIT ACM for a system is documented in the SINIT-guide.txt file
+that is on the tboot SourceForge site under the SINIT ACM downloads.
diff --git a/Documentation/arch/x86/iommu.rst b/Documentation/arch/x86/iommu.rst
new file mode 100644
index 000000000000..42c7a6faa39a
--- /dev/null
+++ b/Documentation/arch/x86/iommu.rst
@@ -0,0 +1,151 @@
+=================
+x86 IOMMU Support
+=================
+
+The architecture specs can be obtained from the below locations.
+
+- Intel: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
+- AMD: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
+
+This guide gives a quick cheat sheet for some basic understanding.
+
+Basic stuff
+-----------
+
+ACPI enumerates and lists the different IOMMUs on the platform, and
+device scope relationships between devices and which IOMMU controls
+them.
+
+Some ACPI Keywords:
+
+- DMAR - Intel DMA Remapping table
+- DRHD - Intel DMA Remapping Hardware Unit Definition
+- RMRR - Intel Reserved Memory Region Reporting Structure
+- IVRS - AMD I/O Virtualization Reporting Structure
+- IVDB - AMD I/O Virtualization Definition Block
+- IVHD - AMD I/O Virtualization Hardware Definition
+
+What is Intel RMRR?
+^^^^^^^^^^^^^^^^^^^
+
+There are some devices the BIOS controls, for e.g USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+unity mappings for these regions for these devices to access these regions.
+
+What is AMD IVRS?
+^^^^^^^^^^^^^^^^^
+
+The architecture defines an ACPI-compatible data structure called an I/O
+Virtualization Reporting Structure (IVRS) that is used to convey information
+related to I/O virtualization to system software.  The IVRS describes the
+configuration and capabilities of the IOMMUs contained in the platform as
+well as information about the devices that each IOMMU virtualizes.
+
+The IVRS provides information about the following:
+
+- IOMMUs present in the platform including their capabilities and proper configuration
+- System I/O topology relevant to each IOMMU
+- Peripheral devices that cannot be otherwise enumerated
+- Memory regions used by SMI/SMM, platform firmware, and platform hardware. These are generally exclusion ranges to be configured by system software.
+
+How is an I/O Virtual Address (IOVA) generated?
+-----------------------------------------------
+
+Well behaved drivers call dma_map_*() calls before sending command to device
+that needs to perform DMA. Once DMA is completed and mapping is no longer
+required, driver performs dma_unmap_*() calls to unmap the region.
+
+Intel Specific Notes
+--------------------
+
+Graphics Problems?
+^^^^^^^^^^^^^^^^^^
+
+If you encounter issues with graphics devices, you can try adding
+option intel_iommu=igfx_off to turn off the integrated graphics engine.
+If this fixes anything, please ensure you file a bug reporting the problem.
+
+Some exceptions to IOVA
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
+The same is true for peer to peer transactions. Hence we reserve the
+address from PCI MMIO ranges so they are not allocated for IOVA addresses.
+
+AMD Specific Notes
+------------------
+
+Graphics Problems?
+^^^^^^^^^^^^^^^^^^
+
+If you encounter issues with integrated graphics devices, you can try adding
+option iommu=pt to the kernel command line use a 1:1 mapping for the IOMMU.  If
+this fixes anything, please ensure you file a bug reporting the problem.
+
+Fault reporting
+---------------
+When errors are reported, the IOMMU signals via an interrupt. The fault
+reason and device that caused it is printed on the console.
+
+
+Kernel Log Samples
+------------------
+
+Intel Boot Messages
+^^^^^^^^^^^^^^^^^^^
+
+Something like this gets printed indicating presence of DMAR tables
+in ACPI:
+
+::
+
+	ACPI: DMAR (v001 A M I  OEMDMAR  0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
+
+When DMAR is being processed and initialized by ACPI, prints DMAR locations
+and any RMRR's processed:
+
+::
+
+	ACPI DMAR:Host address width 36
+	ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
+	ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
+	ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
+	ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
+	ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
+
+When DMAR is enabled for use, you will notice:
+
+::
+
+	PCI-DMA: Using DMAR IOMMU
+
+Intel Fault reporting
+^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+	DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+	DMAR:[fault reason 05] PTE Write access is not set
+	DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+	DMAR:[fault reason 05] PTE Write access is not set
+
+AMD Boot Messages
+^^^^^^^^^^^^^^^^^
+
+Something like this gets printed indicating presence of the IOMMU:
+
+::
+
+	iommu: Default domain type: Translated
+	iommu: DMA domain TLB invalidation policy: lazy mode
+
+AMD Fault reporting
+^^^^^^^^^^^^^^^^^^^
+
+::
+
+	AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000]
+	AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000]
diff --git a/Documentation/arch/x86/kernel-stacks.rst b/Documentation/arch/x86/kernel-stacks.rst
new file mode 100644
index 000000000000..6b0bcf027ff1
--- /dev/null
+++ b/Documentation/arch/x86/kernel-stacks.rst
@@ -0,0 +1,152 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Kernel Stacks
+=============
+
+Kernel stacks on x86-64 bit
+===========================
+
+Most of the text from Keith Owens, hacked by AK
+
+x86_64 page size (PAGE_SIZE) is 4K.
+
+Like all other architectures, x86_64 has a kernel stack for every
+active thread.  These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
+These stacks contain useful data as long as a thread is alive or a
+zombie. While the thread is in user space the kernel stack is empty
+except for the thread_info structure at the bottom.
+
+In addition to the per thread stacks, there are specialized stacks
+associated with each CPU.  These stacks are only used while the kernel
+is in control on that CPU; when a CPU returns to user space the
+specialized stacks contain no useful data.  The main CPU stacks are:
+
+* Interrupt stack.  IRQ_STACK_SIZE
+
+  Used for external hardware interrupts.  If this is the first external
+  hardware interrupt (i.e. not a nested hardware interrupt) then the
+  kernel switches from the current task to the interrupt stack.  Like
+  the split thread and interrupt stacks on i386, this gives more room
+  for kernel interrupt processing without having to increase the size
+  of every per thread stack.
+
+  The interrupt stack is also used when processing a softirq.
+
+Switching to the kernel interrupt stack is done by software based on a
+per CPU interrupt nest counter. This is needed because x86-64 "IST"
+hardware stacks cannot nest without races.
+
+x86_64 also has a feature which is not available on i386, the ability
+to automatically switch to a new stack for designated events such as
+double fault or NMI, which makes it easier to handle these unusual
+events on x86_64.  This feature is called the Interrupt Stack Table
+(IST).  There can be up to 7 IST entries per CPU. The IST code is an
+index into the Task State Segment (TSS). The IST entries in the TSS
+point to dedicated stacks; each stack can be a different size.
+
+An IST is selected by a non-zero value in the IST field of an
+interrupt-gate descriptor.  When an interrupt occurs and the hardware
+loads such a descriptor, the hardware automatically sets the new stack
+pointer based on the IST value, then invokes the interrupt handler.  If
+the interrupt came from user mode, then the interrupt handler prologue
+will switch back to the per-thread stack.  If software wants to allow
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler.  (This is occasionally
+done, e.g. for debug exceptions.)
+
+Events with different IST codes (i.e. with different stacks) can be
+nested.  For example, a debug interrupt can safely be interrupted by an
+NMI.  arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
+pointers on entry to and exit from all IST events, in theory allowing
+IST events with the same code to be nested.  However in most cases, the
+stack size allocated to an IST assumes no nesting for the same code.
+If that assumption is ever broken then the stacks will become corrupt.
+
+The currently assigned IST stacks are:
+
+* ESTACK_DF.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for interrupt 8 - Double Fault Exception (#DF).
+
+  Invoked when handling one exception causes another exception. Happens
+  when the kernel is very confused (e.g. kernel stack pointer corrupt).
+  Using a separate stack allows the kernel to recover from it well enough
+  in many cases to still output an oops.
+
+* ESTACK_NMI.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for non-maskable interrupts (NMI).
+
+  NMI can be delivered at any time, including when the kernel is in the
+  middle of switching stacks.  Using IST for NMI events avoids making
+  assumptions about the previous state of the kernel stack.
+
+* ESTACK_DB.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for hardware debug interrupts (interrupt 1) and for software
+  debug interrupts (INT3).
+
+  When debugging a kernel, debug interrupts (both hardware and
+  software) can occur at any time.  Using IST for these interrupts
+  avoids making assumptions about the previous state of the kernel
+  stack.
+
+  To handle nested #DB correctly there exist two instances of DB stacks. On
+  #DB entry the IST stackpointer for #DB is switched to the second instance
+  so a nested #DB starts from a clean stack. The nested #DB switches
+  the IST stackpointer to a guard hole to catch triple nesting.
+
+* ESTACK_MCE.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for interrupt 18 - Machine Check Exception (#MC).
+
+  MCE can be delivered at any time, including when the kernel is in the
+  middle of switching stacks.  Using IST for MCE events avoids making
+  assumptions about the previous state of the kernel stack.
+
+For more details see the Intel IA32 or AMD AMD64 architecture manuals.
+
+
+Printing backtraces on x86
+==========================
+
+The question about the '?' preceding function names in an x86 stacktrace
+keeps popping up, here's an indepth explanation. It helps if the reader
+stares at print_context_stack() and the whole machinery in and around
+arch/x86/kernel/dumpstack.c.
+
+Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
+
+We always scan the full kernel stack for return addresses stored on
+the kernel stack(s) [1]_, from stack top to stack bottom, and print out
+anything that 'looks like' a kernel text address.
+
+If it fits into the frame pointer chain, we print it without a question
+mark, knowing that it's part of the real backtrace.
+
+If the address does not fit into our expected frame pointer chain we
+still print it, but we print a '?'. It can mean two things:
+
+ - either the address is not part of the call chain: it's just stale
+   values on the kernel stack, from earlier function calls. This is
+   the common case.
+
+ - or it is part of the call chain, but the frame pointer was not set
+   up properly within the function, so we don't recognize it.
+
+This way we will always print out the real call chain (plus a few more
+entries), regardless of whether the frame pointer was set up correctly
+or not - but in most cases we'll get the call chain right as well. The
+entries printed are strictly in stack order, so you can deduce more
+information from that as well.
+
+The most important property of this method is that we _never_ lose
+information: we always strive to print _all_ addresses on the stack(s)
+that look like kernel text addresses, so if debug information is wrong,
+we still print out the real call chain as well - just with more question
+marks than ideal.
+
+.. [1] For things like IRQ and IST stacks, we also scan those stacks, in
+       the right order, and try to cross from one stack into another
+       reconstructing the call chain. This works most of the time.
diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst
new file mode 100644
index 000000000000..5d4330be200f
--- /dev/null
+++ b/Documentation/arch/x86/mds.rst
@@ -0,0 +1,193 @@
+Microarchitectural Data Sampling (MDS) mitigation
+=================================================
+
+.. _mds:
+
+Overview
+--------
+
+Microarchitectural Data Sampling (MDS) is a family of side channel attacks
+on internal buffers in Intel CPUs. The variants are:
+
+ - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+ - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+ - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
+ - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)
+
+MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
+dependent load (store-to-load forwarding) as an optimization. The forward
+can also happen to a faulting or assisting load operation for a different
+memory address, which can be exploited under certain conditions. Store
+buffers are partitioned between Hyper-Threads so cross thread forwarding is
+not possible. But if a thread enters or exits a sleep state the store
+buffer is repartitioned which can expose data from one thread to the other.
+
+MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
+L1 miss situations and to hold data which is returned or sent in response
+to a memory or I/O operation. Fill buffers can forward data to a load
+operation and also write data to the cache. When the fill buffer is
+deallocated it can retain the stale data of the preceding operations which
+can then be forwarded to a faulting or assisting load operation, which can
+be exploited under certain conditions. Fill buffers are shared between
+Hyper-Threads so cross thread leakage is possible.
+
+MLPDS leaks Load Port Data. Load ports are used to perform load operations
+from memory or I/O. The received data is then forwarded to the register
+file or a subsequent operation. In some implementations the Load Port can
+contain stale data from a previous operation which can be forwarded to
+faulting or assisting loads under certain conditions, which again can be
+exploited eventually. Load ports are shared between Hyper-Threads so cross
+thread leakage is possible.
+
+MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
+memory that takes a fault or assist can leave data in a microarchitectural
+structure that may later be observed using one of the same methods used by
+MSBDS, MFBDS or MLPDS.
+
+Exposure assumptions
+--------------------
+
+It is assumed that attack code resides in user space or in a guest with one
+exception. The rationale behind this assumption is that the code construct
+needed for exploiting MDS requires:
+
+ - to control the load to trigger a fault or assist
+
+ - to have a disclosure gadget which exposes the speculatively accessed
+   data for consumption through a side channel.
+
+ - to control the pointer through which the disclosure gadget exposes the
+   data
+
+The existence of such a construct in the kernel cannot be excluded with
+100% certainty, but the complexity involved makes it extremly unlikely.
+
+There is one exception, which is untrusted BPF. The functionality of
+untrusted BPF is limited, but it needs to be thoroughly investigated
+whether it can be used to create such a construct.
+
+
+Mitigation strategy
+-------------------
+
+All variants have the same mitigation strategy at least for the single CPU
+thread case (SMT off): Force the CPU to clear the affected buffers.
+
+This is achieved by using the otherwise unused and obsolete VERW
+instruction in combination with a microcode update. The microcode clears
+the affected CPU buffers when the VERW instruction is executed.
+
+For virtualization there are two ways to achieve CPU buffer
+clearing. Either the modified VERW instruction or via the L1D Flush
+command. The latter is issued when L1TF mitigation is enabled so the extra
+VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
+be issued.
+
+If the VERW instruction with the supplied segment selector argument is
+executed on a CPU without the microcode update there is no side effect
+other than a small number of pointlessly wasted CPU cycles.
+
+This does not protect against cross Hyper-Thread attacks except for MSBDS
+which is only exploitable cross Hyper-thread when one of the Hyper-Threads
+enters a C-state.
+
+The kernel provides a function to invoke the buffer clearing:
+
+    mds_clear_cpu_buffers()
+
+The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
+(idle) transitions.
+
+As a special quirk to address virtualization scenarios where the host has
+the microcode updated, but the hypervisor does not (yet) expose the
+MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
+hope that it might actually clear the buffers. The state is reflected
+accordingly.
+
+According to current knowledge additional mitigations inside the kernel
+itself are not required because the necessary gadgets to expose the leaked
+data cannot be controlled in a way which allows exploitation from malicious
+user space or VM guests.
+
+Kernel internal mitigation modes
+--------------------------------
+
+ ======= ============================================================
+ off      Mitigation is disabled. Either the CPU is not affected or
+          mds=off is supplied on the kernel command line
+
+ full     Mitigation is enabled. CPU is affected and MD_CLEAR is
+          advertised in CPUID.
+
+ vmwerv	  Mitigation is enabled. CPU is affected and MD_CLEAR is not
+	  advertised in CPUID. That is mainly for virtualization
+	  scenarios where the host has the updated microcode but the
+	  hypervisor does not expose MD_CLEAR in CPUID. It's a best
+	  effort approach without guarantee.
+ ======= ============================================================
+
+If the CPU is affected and mds=off is not supplied on the kernel command
+line then the kernel selects the appropriate mitigation mode depending on
+the availability of the MD_CLEAR CPUID bit.
+
+Mitigation points
+-----------------
+
+1. Return to user space
+^^^^^^^^^^^^^^^^^^^^^^^
+
+   When transitioning from kernel to user space the CPU buffers are flushed
+   on affected CPUs when the mitigation is not disabled on the kernel
+   command line. The migitation is enabled through the static key
+   mds_user_clear.
+
+   The mitigation is invoked in prepare_exit_to_usermode() which covers
+   all but one of the kernel to user space transitions.  The exception
+   is when we return from a Non Maskable Interrupt (NMI), which is
+   handled directly in do_nmi().
+
+   (The reason that NMI is special is that prepare_exit_to_usermode() can
+    enable IRQs.  In NMI context, NMIs are blocked, and we don't want to
+    enable IRQs with NMIs blocked.)
+
+
+2. C-State transition
+^^^^^^^^^^^^^^^^^^^^^
+
+   When a CPU goes idle and enters a C-State the CPU buffers need to be
+   cleared on affected CPUs when SMT is active. This addresses the
+   repartitioning of the store buffer when one of the Hyper-Threads enters
+   a C-State.
+
+   When SMT is inactive, i.e. either the CPU does not support it or all
+   sibling threads are offline CPU buffer clearing is not required.
+
+   The idle clearing is enabled on CPUs which are only affected by MSBDS
+   and not by any other MDS variant. The other MDS variants cannot be
+   protected against cross Hyper-Thread attacks because the Fill Buffer and
+   the Load Ports are shared. So on CPUs affected by other variants, the
+   idle clearing would be a window dressing exercise and is therefore not
+   activated.
+
+   The invocation is controlled by the static key mds_idle_clear which is
+   switched depending on the chosen mitigation mode and the SMT state of
+   the system.
+
+   The buffer clear is only invoked before entering the C-State to prevent
+   that stale data from the idling CPU from spilling to the Hyper-Thread
+   sibling after the store buffer got repartitioned and all entries are
+   available to the non idle sibling.
+
+   When coming out of idle the store buffer is partitioned again so each
+   sibling has half of it available. The back from idle CPU could be then
+   speculatively exposed to contents of the sibling. The buffers are
+   flushed either on exit to user space or on VMENTER so malicious code
+   in user space or the guest cannot speculatively access them.
+
+   The mitigation is hooked into all variants of halt()/mwait(), but does
+   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
+   has been superseded by the intel_idle driver around 2010 and is
+   preferred on all affected CPUs which are expected to gain the MD_CLEAR
+   functionality in microcode. Aside of that the IO-Port mechanism is a
+   legacy interface which is only used on older systems which are either
+   not affected or do not receive microcode updates anymore.
diff --git a/Documentation/arch/x86/microcode.rst b/Documentation/arch/x86/microcode.rst
new file mode 100644
index 000000000000..b627c6f36bcf
--- /dev/null
+++ b/Documentation/arch/x86/microcode.rst
@@ -0,0 +1,240 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+The Linux Microcode Loader
+==========================
+
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+          - Borislav Petkov <bp@suse.de>
+	  - Ashok Raj <ashok.raj@intel.com>
+
+The kernel has a x86 microcode loading facility which is supposed to
+provide microcode loading methods in the OS. Potential use cases are
+updating the microcode on platforms beyond the OEM End-Of-Life support,
+and updating the microcode on long-running systems without rebooting.
+
+The loader supports three loading methods:
+
+Early load microcode
+====================
+
+The kernel can update microcode very early during boot. Loading
+microcode early can fix CPU issues before they are observed during
+kernel boot time.
+
+The microcode is stored in an initrd file. During boot, it is read from
+it and loaded into the CPU cores.
+
+The format of the combined initrd image is microcode in (uncompressed)
+cpio format followed by the (possibly compressed) initrd image. The
+loader parses the combined initrd image during boot.
+
+The microcode files in cpio name space are:
+
+on Intel:
+  kernel/x86/microcode/GenuineIntel.bin
+on AMD  :
+  kernel/x86/microcode/AuthenticAMD.bin
+
+During BSP (BootStrapping Processor) boot (pre-SMP), the kernel
+scans the microcode file in the initrd. If microcode matching the
+CPU is found, it will be applied in the BSP and later on in all APs
+(Application Processors).
+
+The loader also saves the matching microcode for the CPU in memory.
+Thus, the cached microcode patch is applied when CPUs resume from a
+sleep state.
+
+Here's a crude example how to prepare an initrd with microcode (this is
+normally done automatically by the distribution, when recreating the
+initrd, so you don't really have to do it yourself. It is documented
+here for future reference only).
+::
+
+  #!/bin/bash
+
+  if [ -z "$1" ]; then
+      echo "You need to supply an initrd file"
+      exit 1
+  fi
+
+  INITRD="$1"
+
+  DSTDIR=kernel/x86/microcode
+  TMPDIR=/tmp/initrd
+
+  rm -rf $TMPDIR
+
+  mkdir $TMPDIR
+  cd $TMPDIR
+  mkdir -p $DSTDIR
+
+  if [ -d /lib/firmware/amd-ucode ]; then
+          cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin
+  fi
+
+  if [ -d /lib/firmware/intel-ucode ]; then
+          cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin
+  fi
+
+  find . | cpio -o -H newc >../ucode.cpio
+  cd ..
+  mv $INITRD $INITRD.orig
+  cat ucode.cpio $INITRD.orig > $INITRD
+
+  rm -rf $TMPDIR
+
+
+The system needs to have the microcode packages installed into
+/lib/firmware or you need to fixup the paths above if yours are
+somewhere else and/or you've downloaded them directly from the processor
+vendor's site.
+
+Late loading
+============
+
+You simply install the microcode packages your distro supplies and
+run::
+
+  # echo 1 > /sys/devices/system/cpu/microcode/reload
+
+as root.
+
+The loading mechanism looks for microcode blobs in
+/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation
+packages already put them there.
+
+Since kernel 5.19, late loading is not enabled by default.
+
+The /dev/cpu/microcode method has been removed in 5.19.
+
+Why is late loading dangerous?
+==============================
+
+Synchronizing all CPUs
+----------------------
+
+The microcode engine which receives the microcode update is shared
+between the two logical threads in a SMT system. Therefore, when
+the update is executed on one SMT thread of the core, the sibling
+"automatically" gets the update.
+
+Since the microcode can "simulate" MSRs too, while the microcode update
+is in progress, those simulated MSRs transiently cease to exist. This
+can result in unpredictable results if the SMT sibling thread happens to
+be in the middle of an access to such an MSR. The usual observation is
+that such MSR accesses cause #GPs to be raised to signal that former are
+not present.
+
+The disappearing MSRs are just one common issue which is being observed.
+Any other instruction that's being patched and gets concurrently
+executed by the other SMT sibling, can also result in similar,
+unpredictable behavior.
+
+To eliminate this case, a stop_machine()-based CPU synchronization was
+introduced as a way to guarantee that all logical CPUs will not execute
+any code but just wait in a spin loop, polling an atomic variable.
+
+While this took care of device or external interrupts, IPIs including
+LVT ones, such as CMCI etc, it cannot address other special interrupts
+that can't be shut off. Those are Machine Check (#MC), System Management
+(#SMI) and Non-Maskable interrupts (#NMI).
+
+Machine Checks
+--------------
+
+Machine Checks (#MC) are non-maskable. There are two kinds of MCEs.
+Fatal un-recoverable MCEs and recoverable MCEs. While un-recoverable
+errors are fatal, recoverable errors can also happen in kernel context
+are also treated as fatal by the kernel.
+
+On certain Intel machines, MCEs are also broadcast to all threads in a
+system. If one thread is in the middle of executing WRMSR, a MCE will be
+taken at the end of the flow. Either way, they will wait for the thread
+performing the wrmsr(0x79) to rendezvous in the MCE handler and shutdown
+eventually if any of the threads in the system fail to check in to the
+MCE rendezvous.
+
+To be paranoid and get predictable behavior, the OS can choose to set
+MCG_STATUS.MCIP. Since MCEs can be at most one in a system, if an
+MCE was signaled, the above condition will promote to a system reset
+automatically. OS can turn off MCIP at the end of the update for that
+core.
+
+System Management Interrupt
+---------------------------
+
+SMIs are also broadcast to all CPUs in the platform. Microcode update
+requests exclusive access to the core before writing to MSR 0x79. So if
+it does happen such that, one thread is in WRMSR flow, and the 2nd got
+an SMI, that thread will be stopped in the first instruction in the SMI
+handler.
+
+Since the secondary thread is stopped in the first instruction in SMI,
+there is very little chance that it would be in the middle of executing
+an instruction being patched. Plus OS has no way to stop SMIs from
+happening.
+
+Non-Maskable Interrupts
+-----------------------
+
+When thread0 of a core is doing the microcode update, if thread1 is
+pulled into NMI, that can cause unpredictable behavior due to the
+reasons above.
+
+OS can choose a variety of methods to avoid running into this situation.
+
+
+Is the microcode suitable for late loading?
+-------------------------------------------
+
+Late loading is done when the system is fully operational and running
+real workloads. Late loading behavior depends on what the base patch on
+the CPU is before upgrading to the new patch.
+
+This is true for Intel CPUs.
+
+Consider, for example, a CPU has patch level 1 and the update is to
+patch level 3.
+
+Between patch1 and patch3, patch2 might have deprecated a software-visible
+feature.
+
+This is unacceptable if software is even potentially using that feature.
+For instance, say MSR_X is no longer available after an update,
+accessing that MSR will cause a #GP fault.
+
+Basically there is no way to declare a new microcode update suitable
+for late-loading. This is another one of the problems that caused late
+loading to be not enabled by default.
+
+Builtin microcode
+=================
+
+The loader supports also loading of a builtin microcode supplied through
+the regular builtin firmware method CONFIG_EXTRA_FIRMWARE. Only 64-bit is
+currently supported.
+
+Here's an example::
+
+  CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
+  CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
+
+This basically means, you have the following tree structure locally::
+
+  /lib/firmware/
+  |-- amd-ucode
+  ...
+  |   |-- microcode_amd_fam15h.bin
+  ...
+  |-- intel-ucode
+  ...
+  |   |-- 06-3a-09
+  ...
+
+so that the build system can find those files and integrate them into
+the final kernel image. The early loader finds them and applies them.
+
+Needless to say, this method is not the most flexible one because it
+requires rebuilding the kernel each time updated microcode from the CPU
+vendor is available.
diff --git a/Documentation/arch/x86/mtrr.rst b/Documentation/arch/x86/mtrr.rst
new file mode 100644
index 000000000000..f65ef034da7a
--- /dev/null
+++ b/Documentation/arch/x86/mtrr.rst
@@ -0,0 +1,354 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+MTRR (Memory Type Range Register) control
+=========================================
+
+:Authors: - Richard Gooch <rgooch@atnf.csiro.au> - 3 Jun 1999
+          - Luis R. Rodriguez <mcgrof@do-not-panic.com> - April 9, 2015
+
+
+Phasing out MTRR use
+====================
+
+MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by
+drivers on Linux is now completely phased out, device drivers should use
+arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on
+non-PAT systems while a no-op but equally effective on PAT enabled systems.
+
+Even if Linux does not use MTRRs directly, some x86 platform firmware may still
+set up MTRRs early before booting the OS. They do this as some platform
+firmware may still have implemented access to MTRRs which would be controlled
+and handled by the platform firmware directly. An example of platform use of
+MTRRs is through the use of SMI handlers, one case could be for fan control,
+the platform code would need uncachable access to some of its fan control
+registers. Such platform access does not need any Operating System MTRR code in
+place other than mtrr_type_lookup() to ensure any OS specific mapping requests
+are aligned with platform MTRR setup. If MTRRs are only set up by the platform
+firmware code though and the OS does not make any specific MTRR mapping
+requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID.
+
+For details refer to Documentation/arch/x86/pat.rst.
+
+.. tip::
+  On Intel P6 family processors (Pentium Pro, Pentium II and later)
+  the Memory Type Range Registers (MTRRs) may be used to control
+  processor access to memory ranges. This is most useful when you have
+  a video (VGA) card on a PCI or AGP bus. Enabling write-combining
+  allows bus write transfers to be combined into a larger transfer
+  before bursting over the PCI/AGP bus. This can increase performance
+  of image write operations 2.5 times or more.
+
+  The Cyrix 6x86, 6x86MX and M II processors have Address Range
+  Registers (ARRs) which provide a similar functionality to MTRRs. For
+  these, the ARRs are used to emulate the MTRRs.
+
+  The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
+  MTRRs. These are supported.  The AMD Athlon family provide 8 Intel
+  style MTRRs.
+
+  The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These
+  are supported.
+
+  The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs.
+
+  The CONFIG_MTRR option creates a /proc/mtrr file which may be used
+  to manipulate your MTRRs. Typically the X server should use
+  this. This should have a reasonably generic interface so that
+  similar control registers on other processors can be easily
+  supported.
+
+There are two interfaces to /proc/mtrr: one is an ASCII interface
+which allows you to read and write. The other is an ioctl()
+interface. The ASCII interface is meant for administration. The
+ioctl() interface is meant for C programs (i.e. the X server). The
+interfaces are described below, with sample commands and C code.
+
+
+Reading MTRRs from the shell
+============================
+::
+
+  % cat /proc/mtrr
+  reg00: base=0x00000000 (   0MB), size= 128MB: write-back, count=1
+  reg01: base=0x08000000 ( 128MB), size=  64MB: write-back, count=1
+
+Creating MTRRs from the C-shell::
+
+  # echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr
+
+or if you use bash::
+
+  # echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr
+
+And the result thereof::
+
+  % cat /proc/mtrr
+  reg00: base=0x00000000 (   0MB), size= 128MB: write-back, count=1
+  reg01: base=0x08000000 ( 128MB), size=  64MB: write-back, count=1
+  reg02: base=0xf8000000 (3968MB), size=   4MB: write-combining, count=1
+
+This is for video RAM at base address 0xf8000000 and size 4 megabytes. To
+find out your base address, you need to look at the output of your X
+server, which tells you where the linear framebuffer address is. A
+typical line that you may get is::
+
+  (--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000
+
+Note that you should only use the value from the X server, as it may
+move the framebuffer base address, so the only value you can trust is
+that reported by the X server.
+
+To find out the size of your framebuffer (what, you don't actually
+know?), the following line will tell you::
+
+  (--) S3: videoram:  4096k
+
+That's 4 megabytes, which is 0x400000 bytes (in hexadecimal).
+A patch is being written for XFree86 which will make this automatic:
+in other words the X server will manipulate /proc/mtrr using the
+ioctl() interface, so users won't have to do anything. If you use a
+commercial X server, lobby your vendor to add support for MTRRs.
+
+
+Creating overlapping MTRRs
+==========================
+::
+
+  %echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr
+  %echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr
+
+And the results::
+
+  % cat /proc/mtrr
+  reg00: base=0x00000000 (   0MB), size=  64MB: write-back, count=1
+  reg01: base=0xfb000000 (4016MB), size=  16MB: write-combining, count=1
+  reg02: base=0xfb000000 (4016MB), size=   4kB: uncachable, count=1
+
+Some cards (especially Voodoo Graphics boards) need this 4 kB area
+excluded from the beginning of the region because it is used for
+registers.
+
+NOTE: You can only create type=uncachable region, if the first
+region that you created is type=write-combining.
+
+
+Removing MTRRs from the C-shel
+==============================
+::
+
+  % echo "disable=2" >! /proc/mtrr
+
+or using bash::
+
+  % echo "disable=2" >| /proc/mtrr
+
+
+Reading MTRRs from a C program using ioctl()'s
+==============================================
+::
+
+  /*  mtrr-show.c
+
+      Source file for mtrr-show (example program to show MTRRs using ioctl()'s)
+
+      Copyright (C) 1997-1998  Richard Gooch
+
+      This program is free software; you can redistribute it and/or modify
+      it under the terms of the GNU General Public License as published by
+      the Free Software Foundation; either version 2 of the License, or
+      (at your option) any later version.
+
+      This program is distributed in the hope that it will be useful,
+      but WITHOUT ANY WARRANTY; without even the implied warranty of
+      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+      GNU General Public License for more details.
+
+      You should have received a copy of the GNU General Public License
+      along with this program; if not, write to the Free Software
+      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+      Richard Gooch may be reached by email at  rgooch@atnf.csiro.au
+      The postal address is:
+        Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
+  */
+
+  /*
+      This program will use an ioctl() on /proc/mtrr to show the current MTRR
+      settings. This is an alternative to reading /proc/mtrr.
+
+
+      Written by      Richard Gooch   17-DEC-1997
+
+      Last updated by Richard Gooch   2-MAY-1998
+
+
+  */
+  #include <stdio.h>
+  #include <stdlib.h>
+  #include <string.h>
+  #include <sys/types.h>
+  #include <sys/stat.h>
+  #include <fcntl.h>
+  #include <sys/ioctl.h>
+  #include <errno.h>
+  #include <asm/mtrr.h>
+
+  #define TRUE 1
+  #define FALSE 0
+  #define ERRSTRING strerror (errno)
+
+  static char *mtrr_strings[MTRR_NUM_TYPES] =
+  {
+      "uncachable",               /* 0 */
+      "write-combining",          /* 1 */
+      "?",                        /* 2 */
+      "?",                        /* 3 */
+      "write-through",            /* 4 */
+      "write-protect",            /* 5 */
+      "write-back",               /* 6 */
+  };
+
+  int main ()
+  {
+      int fd;
+      struct mtrr_gentry gentry;
+
+      if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 )
+      {
+    if (errno == ENOENT)
+    {
+        fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
+        stderr);
+        exit (1);
+    }
+    fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
+    exit (2);
+      }
+      for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0;
+    ++gentry.regnum)
+      {
+    if (gentry.size < 1)
+    {
+        fprintf (stderr, "Register: %u disabled\n", gentry.regnum);
+        continue;
+    }
+    fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n",
+      gentry.regnum, gentry.base, gentry.size,
+      mtrr_strings[gentry.type]);
+      }
+      if (errno == EINVAL) exit (0);
+      fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
+      exit (3);
+  }   /*  End Function main  */
+
+
+Creating MTRRs from a C programme using ioctl()'s
+=================================================
+::
+
+  /*  mtrr-add.c
+
+      Source file for mtrr-add (example programme to add an MTRRs using ioctl())
+
+      Copyright (C) 1997-1998  Richard Gooch
+
+      This program is free software; you can redistribute it and/or modify
+      it under the terms of the GNU General Public License as published by
+      the Free Software Foundation; either version 2 of the License, or
+      (at your option) any later version.
+
+      This program is distributed in the hope that it will be useful,
+      but WITHOUT ANY WARRANTY; without even the implied warranty of
+      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+      GNU General Public License for more details.
+
+      You should have received a copy of the GNU General Public License
+      along with this program; if not, write to the Free Software
+      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+      Richard Gooch may be reached by email at  rgooch@atnf.csiro.au
+      The postal address is:
+        Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
+  */
+
+  /*
+      This programme will use an ioctl() on /proc/mtrr to add an entry. The first
+      available mtrr is used. This is an alternative to writing /proc/mtrr.
+
+
+      Written by      Richard Gooch   17-DEC-1997
+
+      Last updated by Richard Gooch   2-MAY-1998
+
+
+  */
+  #include <stdio.h>
+  #include <string.h>
+  #include <stdlib.h>
+  #include <unistd.h>
+  #include <sys/types.h>
+  #include <sys/stat.h>
+  #include <fcntl.h>
+  #include <sys/ioctl.h>
+  #include <errno.h>
+  #include <asm/mtrr.h>
+
+  #define TRUE 1
+  #define FALSE 0
+  #define ERRSTRING strerror (errno)
+
+  static char *mtrr_strings[MTRR_NUM_TYPES] =
+  {
+      "uncachable",               /* 0 */
+      "write-combining",          /* 1 */
+      "?",                        /* 2 */
+      "?",                        /* 3 */
+      "write-through",            /* 4 */
+      "write-protect",            /* 5 */
+      "write-back",               /* 6 */
+  };
+
+  int main (int argc, char **argv)
+  {
+      int fd;
+      struct mtrr_sentry sentry;
+
+      if (argc != 4)
+      {
+    fprintf (stderr, "Usage:\tmtrr-add base size type\n");
+    exit (1);
+      }
+      sentry.base = strtoul (argv[1], NULL, 0);
+      sentry.size = strtoul (argv[2], NULL, 0);
+      for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type)
+      {
+    if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break;
+      }
+      if (sentry.type >= MTRR_NUM_TYPES)
+      {
+    fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]);
+    exit (2);
+      }
+      if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 )
+      {
+    if (errno == ENOENT)
+    {
+        fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
+        stderr);
+        exit (3);
+    }
+    fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
+    exit (4);
+      }
+      if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1)
+      {
+    fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
+    exit (5);
+      }
+      fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n");
+      sleep (5);
+      close (fd);
+      fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n",
+      stderr);
+  }   /*  End Function main  */
diff --git a/Documentation/arch/x86/orc-unwinder.rst b/Documentation/arch/x86/orc-unwinder.rst
new file mode 100644
index 000000000000..cdb257015bd9
--- /dev/null
+++ b/Documentation/arch/x86/orc-unwinder.rst
@@ -0,0 +1,182 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+ORC unwinder
+============
+
+Overview
+========
+
+The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+=====================
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1]_ have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+============
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers.  This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep.  In a basic
+performance test by Jiri Slaby [2]_, the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF.  ORC
+unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
+than DWARF-based eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations.  But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames.  But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+===========================
+
+The ORC data is generated by objtool.  With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch.  So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data.  However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files.  But asm annotations were tried
+in the past and were found to be unmaintainable.  They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code.  And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations.  It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow.  If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete.  (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation.  Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis.  But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+===============================
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/objtool.txt.  After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact.  The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime.  The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+=========
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies.  Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3]_  Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+.. [1] https://lore.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+.. [2] https://lore.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+.. [3] http://dustin.wikidot.com/half-orcs-and-orcs
diff --git a/Documentation/arch/x86/pat.rst b/Documentation/arch/x86/pat.rst
new file mode 100644
index 000000000000..5d901771016d
--- /dev/null
+++ b/Documentation/arch/x86/pat.rst
@@ -0,0 +1,240 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PAT (Page Attribute Table)
+==========================
+
+x86 Page Attribute Table (PAT) allows for setting the memory attribute at the
+page level granularity. PAT is complementary to the MTRR settings which allows
+for setting of memory types over physical address ranges. However, PAT is
+more flexible than MTRR due to its capability to set attributes at page level
+and also due to the fact that there are no hardware limitations on number of
+such attribute settings allowed. Added flexibility comes with guidelines for
+not having memory type aliasing for the same physical memory with multiple
+virtual addresses.
+
+PAT allows for different types of memory attributes. The most commonly used
+ones that will be supported at this time are:
+
+===  ==============
+WB   Write-back
+UC   Uncached
+WC   Write-combined
+WT   Write-through
+UC-  Uncached Minus
+===  ==============
+
+
+PAT APIs
+========
+
+There are many different APIs in the kernel that allows setting of memory
+attributes at the page level. In order to avoid aliasing, these interfaces
+should be used thoughtfully. Below is a table of interfaces available,
+their intended usage and their memory attribute relationships. Internally,
+these APIs use a reserve_memtype()/free_memtype() interface on the physical
+address range to avoid any aliasing.
+
++------------------------+----------+--------------+------------------+
+| API                    |    RAM   |  ACPI,...    |  Reserved/Holes  |
++------------------------+----------+--------------+------------------+
+| ioremap                |    --    |    UC-       |       UC-        |
++------------------------+----------+--------------+------------------+
+| ioremap_cache          |    --    |    WB        |       WB         |
++------------------------+----------+--------------+------------------+
+| ioremap_uc             |    --    |    UC        |       UC         |
++------------------------+----------+--------------+------------------+
+| ioremap_wc             |    --    |    --        |       WC         |
++------------------------+----------+--------------+------------------+
+| ioremap_wt             |    --    |    --        |       WT         |
++------------------------+----------+--------------+------------------+
+| set_memory_uc,         |    UC-   |    --        |       --         |
+| set_memory_wb          |          |              |                  |
++------------------------+----------+--------------+------------------+
+| set_memory_wc,         |    WC    |    --        |       --         |
+| set_memory_wb          |          |              |                  |
++------------------------+----------+--------------+------------------+
+| set_memory_wt,         |    WT    |    --        |       --         |
+| set_memory_wb          |          |              |                  |
++------------------------+----------+--------------+------------------+
+| pci sysfs resource     |    --    |    --        |       UC-        |
++------------------------+----------+--------------+------------------+
+| pci sysfs resource_wc  |    --    |    --        |       WC         |
+| is IORESOURCE_PREFETCH |          |              |                  |
++------------------------+----------+--------------+------------------+
+| pci proc               |    --    |    --        |       UC-        |
+| !PCIIOC_WRITE_COMBINE  |          |              |                  |
++------------------------+----------+--------------+------------------+
+| pci proc               |    --    |    --        |       WC         |
+| PCIIOC_WRITE_COMBINE   |          |              |                  |
++------------------------+----------+--------------+------------------+
+| /dev/mem               |    --    |   WB/WC/UC-  |    WB/WC/UC-     |
+| read-write             |          |              |                  |
++------------------------+----------+--------------+------------------+
+| /dev/mem               |    --    |    UC-       |       UC-        |
+| mmap SYNC flag         |          |              |                  |
++------------------------+----------+--------------+------------------+
+| /dev/mem               |    --    |   WB/WC/UC-  |  WB/WC/UC-       |
+| mmap !SYNC flag        |          |              |                  |
+| and                    |          |(from existing|  (from existing  |
+| any alias to this area |          |alias)        |  alias)          |
++------------------------+----------+--------------+------------------+
+| /dev/mem               |    --    |    WB        |       WB         |
+| mmap !SYNC flag        |          |              |                  |
+| no alias to this area  |          |              |                  |
+| and                    |          |              |                  |
+| MTRR says WB           |          |              |                  |
++------------------------+----------+--------------+------------------+
+| /dev/mem               |    --    |    --        |       UC-        |
+| mmap !SYNC flag        |          |              |                  |
+| no alias to this area  |          |              |                  |
+| and                    |          |              |                  |
+| MTRR says !WB          |          |              |                  |
++------------------------+----------+--------------+------------------+
+
+
+Advanced APIs for drivers
+=========================
+
+A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
+vmf_insert_pfn.
+
+Drivers wanting to export some pages to userspace do it by using mmap
+interface and a combination of:
+
+  1) pgprot_noncached()
+  2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
+
+With PAT support, a new API pgprot_writecombine is being added. So, drivers can
+continue to use the above sequence, with either pgprot_noncached() or
+pgprot_writecombine() in step 1, followed by step 2.
+
+In addition, step 2 internally tracks the region as UC or WC in memtype
+list in order to ensure no conflicting mapping.
+
+Note that this set of APIs only works with IO (non RAM) regions. If driver
+wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
+as step 0 above and also track the usage of those pages and use set_memory_wb()
+before the page is freed to free pool.
+
+MTRR effects on PAT / non-PAT systems
+=====================================
+
+The following table provides the effects of using write-combining MTRRs when
+using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally
+mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will
+be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add()
+is made, should already have been ioremapped with WC attributes or PAT entries,
+this can be done by using ioremap_wc() / set_memory_wc().  Devices which
+combine areas of IO memory desired to remain uncacheable with areas where
+write-combining is desirable should consider use of ioremap_uc() followed by
+set_memory_wc() to white-list effective write-combined areas.  Such use is
+nevertheless discouraged as the effective memory type is considered
+implementation defined, yet this strategy can be used as last resort on devices
+with size-constrained regions where otherwise MTRR write-combining would
+otherwise not be effective.
+::
+
+  ====  =======  ===  =========================  =====================
+  MTRR  Non-PAT  PAT  Linux ioremap value        Effective memory type
+  ====  =======  ===  =========================  =====================
+        PAT                                        Non-PAT |  PAT
+        |PCD                                               |
+        ||PWT                                              |
+        |||                                                |
+  WC    000      WB   _PAGE_CACHE_MODE_WB             WC   |   WC
+  WC    001      WC   _PAGE_CACHE_MODE_WC             WC*  |   WC
+  WC    010      UC-  _PAGE_CACHE_MODE_UC_MINUS       WC*  |   UC
+  WC    011      UC   _PAGE_CACHE_MODE_UC             UC   |   UC
+  ====  =======  ===  =========================  =====================
+
+  (*) denotes implementation defined and is discouraged
+
+.. note:: -- in the above table mean "Not suggested usage for the API". Some
+  of the --'s are strictly enforced by the kernel. Some others are not really
+  enforced today, but may be enforced in future.
+
+For ioremap and pci access through /sys or /proc - The actual type returned
+can be more restrictive, in case of any existing aliasing for that address.
+For example: If there is an existing uncached mapping, a new ioremap_wc can
+return uncached mapping in place of write-combine requested.
+
+set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver
+will first make a region uc, wc or wt and switch it back to wb after use.
+
+Over time writes to /proc/mtrr will be deprecated in favor of using PAT based
+interfaces. Users writing to /proc/mtrr are suggested to use above interfaces.
+
+Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access
+types.
+
+Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges.
+
+
+PAT debugging
+=============
+
+With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by::
+
+  # mount -t debugfs debugfs /sys/kernel/debug
+  # cat /sys/kernel/debug/x86/pat_memtype_list
+  PAT memtype list:
+  uncached-minus @ 0x7fadf000-0x7fae0000
+  uncached-minus @ 0x7fb19000-0x7fb1a000
+  uncached-minus @ 0x7fb1a000-0x7fb1b000
+  uncached-minus @ 0x7fb1b000-0x7fb1c000
+  uncached-minus @ 0x7fb1c000-0x7fb1d000
+  uncached-minus @ 0x7fb1d000-0x7fb1e000
+  uncached-minus @ 0x7fb1e000-0x7fb25000
+  uncached-minus @ 0x7fb25000-0x7fb26000
+  uncached-minus @ 0x7fb26000-0x7fb27000
+  uncached-minus @ 0x7fb27000-0x7fb28000
+  uncached-minus @ 0x7fb28000-0x7fb2e000
+  uncached-minus @ 0x7fb2e000-0x7fb2f000
+  uncached-minus @ 0x7fb2f000-0x7fb30000
+  uncached-minus @ 0x7fb31000-0x7fb32000
+  uncached-minus @ 0x80000000-0x90000000
+
+This list shows physical address ranges and various PAT settings used to
+access those physical address ranges.
+
+Another, more verbose way of getting PAT related debug messages is with
+"debugpat" boot parameter. With this parameter, various debug messages are
+printed to dmesg log.
+
+PAT Initialization
+==================
+
+The following table describes how PAT is initialized under various
+configurations. The PAT MSR must be updated by Linux in order to support WC
+and WT attributes. Otherwise, the PAT MSR has the value programmed in it
+by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests.
+
+ ==== ===== ==========================  =========  =======
+ MTRR PAT   Call Sequence               PAT State  PAT MSR
+ ==== ===== ==========================  =========  =======
+ E    E     MTRR -> PAT init            Enabled    OS
+ E    D     MTRR -> PAT init            Disabled    -
+ D    E     MTRR -> PAT disable         Disabled   BIOS
+ D    D     MTRR -> PAT disable         Disabled    -
+ -    np/E  PAT  -> PAT disable         Disabled   BIOS
+ -    np/D  PAT  -> PAT disable         Disabled    -
+ E    !P/E  MTRR -> PAT init            Disabled   BIOS
+ D    !P/E  MTRR -> PAT disable         Disabled   BIOS
+ !M   !P/E  MTRR stub -> PAT disable    Disabled   BIOS
+ ==== ===== ==========================  =========  =======
+
+  Legend
+
+ ========= =======================================
+ E         Feature enabled in CPU
+ D	   Feature disabled/unsupported in CPU
+ np	   "nopat" boot option specified
+ !P	   CONFIG_X86_PAT option unset
+ !M	   CONFIG_MTRR option unset
+ Enabled   PAT state set to enabled
+ Disabled  PAT state set to disabled
+ OS        PAT initializes PAT MSR with OS setting
+ BIOS      PAT keeps PAT MSR with BIOS setting
+ ========= =======================================
+
diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst
new file mode 100644
index 000000000000..4b858a9bad8d
--- /dev/null
+++ b/Documentation/arch/x86/pti.rst
@@ -0,0 +1,195 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Page Table Isolation (PTI)
+==========================
+
+Overview
+========
+
+Page Table Isolation (pti, previously known as KAISER [1]_) is a
+countermeasure against attacks on the shared user/kernel address
+space such as the "Meltdown" approach [2]_.
+
+To mitigate this class of attacks, we create an independent set of
+page tables for use only when running userspace applications.  When
+the kernel is entered via syscalls, interrupts or exceptions, the
+page tables are switched to the full "kernel" copy.  When the system
+switches back to user mode, the user copy is used again.
+
+The userspace page tables contain only a minimal amount of kernel
+data: only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor table
+(IDT).  There are a few strictly unnecessary things that get mapped
+such as the first C function when entering an interrupt (see
+comments in pti.c).
+
+This approach helps to ensure that side-channel attacks leveraging
+the paging structures do not function when PTI is enabled.  It can be
+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
+Once enabled at compile-time, it can be disabled at boot with the
+'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
+
+Page Table Management
+=====================
+
+When PTI is enabled, the kernel manages two sets of page tables.
+The first set is very similar to the single set which is present in
+kernels without PTI.  This includes a complete mapping of userspace
+that the kernel can use for things like copy_to_user().
+
+Although _complete_, the user portion of the kernel page tables is
+crippled by setting the NX bit in the top level.  This ensures
+that any missed kernel->user CR3 switch will immediately crash
+userspace upon executing its first instruction.
+
+The userspace page tables map only the kernel data needed to enter
+and exit the kernel.  This data is entirely contained in the 'struct
+cpu_entry_area' structure which is placed in the fixmap which gives
+each CPU's copy of the area a compile-time-fixed virtual address.
+
+For new userspace mappings, the kernel makes the entries in its
+page tables like normal.  The only difference is when the kernel
+makes entries in the top (PGD) level.  In addition to setting the
+entry in the main kernel PGD, a copy of the entry is made in the
+userspace page tables' PGD.
+
+This sharing at the PGD level also inherently shares all the lower
+layers of the page tables.  This leaves a single, shared set of
+userspace page tables to manage.  One PTE to lock, one set of
+accessed bits, dirty bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes an additional 4k per process).
+  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
+     aligned so that it can be mapped by setting a single PMD
+     entry.  This consumes nearly 2MB of RAM once the kernel
+     is decompressed, but no space in the kernel image itself.
+
+2. Runtime Cost
+
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and are required at every entry and exit.
+  b. A "trampoline" must be used for SYSCALL entry.  This
+     trampoline depends on a smaller set of resources than the
+     non-PTI SYSCALL entry code, so requires mapping fewer
+     things into the userspace page tables.  The downside is
+     that stacks must be switched at entry time.
+  c. Global pages are disabled for all kernel structures not
+     mapped into both kernel and userspace page tables.  This
+     feature of the MMU allows different processes to share TLB
+     entries mapping the kernel.  Losing the feature means more
+     TLB misses after a context switch.  The actual loss of
+     performance is very small, however, never exceeding 1%.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when switching page
+     tables by setting a special bit in CR3 when the page tables
+     are changed.  This makes switching the page tables (at context
+     switch, or kernel entry/exit) cheaper.  But, on systems with
+     PCID support, the context switch code must flush both the user
+     and kernel entries out of the TLB.  The user PCID TLB flush is
+     deferred until the exit to userspace, minimizing the cost.
+     See intel.com/sdm for the gory PCID/INVPCID details.
+  e. The userspace page tables must be populated for each new
+     process.  Even without PTI, the shared kernel mappings
+     are created by copying top-level (PGD) entries into each
+     new process.  But, with PTI, there are now *two* kernel
+     mappings: one in the kernel page tables that maps everything
+     and one for the entry/exit structures.  At fork(), we need to
+     copy both.
+  f. In addition to the fork()-time copying, there must also
+     be an update to the userspace PGD any time a set_pgd() is done
+     on a PGD used to map userspace.  This ensures that the kernel
+     and userspace copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+  h. INVPCID is a TLB-flushing instruction which allows flushing
+     of TLB entries for non-current PCIDs.  Some systems support
+     PCIDs, but do not support INVPCID.  On these systems, addresses
+     can only be flushed from the TLB for the current PCID.  When
+     flushing a kernel address, we need to flush all PCIDs, so a
+     single kernel address flush will require a TLB-flushing CR3
+     write upon the next use of every PCID.
+
+Possible Future Work
+====================
+1. We can be more careful about not actually writing to CR3
+   unless its value is actually changed.
+2. Allow PTI to be enabled/disabled at runtime in addition to the
+   boot-time switching.
+
+Testing
+========
+
+To test stability of PTI, the following test procedure is recommended,
+ideally doing all of these in parallel:
+
+1. Set CONFIG_DEBUG_ENTRY=y
+2. Run several copies of all of the tools/testing/selftests/x86/ tests
+   (excluding MPX and protection_keys) in a loop on multiple CPUs for
+   several minutes.  These tests frequently uncover corner cases in the
+   kernel entry code.  In general, old kernels might cause these tests
+   themselves to crash, but they should never crash the kernel.
+3. Run the 'perf' tool in a mode (top or record) that generates many
+   frequent performance monitoring non-maskable interrupts (see "NMI"
+   in /proc/interrupts).  This exercises the NMI entry/exit code which
+   is known to trigger bugs in code paths that did not expect to be
+   interrupted, including nested NMIs.  Using "-c" boosts the rate of
+   NMIs, and using two -c with separate counters encourages nested NMIs
+   and less deterministic behavior.
+   ::
+
+	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
+
+4. Launch a KVM virtual machine.
+5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
+   This has been a lightly-tested code path and needs extra scrutiny.
+
+Debugging
+=========
+
+Bugs in PTI cause a few different signatures of crashes
+that are worth noting here.
+
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-pti-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not pti-mapped.
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
+.. [1] https://gruss.cc/files/kaiser.pdf
+.. [2] https://meltdownattack.com/meltdown.pdf
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
new file mode 100644
index 000000000000..387ccbcb558f
--- /dev/null
+++ b/Documentation/arch/x86/resctrl.rst
@@ -0,0 +1,1447 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===========================================
+User Interface for Resource Control feature
+===========================================
+
+:Copyright: |copy| 2016 Intel Corporation
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+          - Tony Luck <tony.luck@intel.com>
+          - Vikas Shivappa <vikas.shivappa@intel.com>
+
+
+Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
+AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
+
+This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
+flag bits:
+
+===============================================	================================
+RDT (Resource Director Technology) Allocation	"rdt_a"
+CAT (Cache Allocation Technology)		"cat_l3", "cat_l2"
+CDP (Code and Data Prioritization)		"cdp_l3", "cdp_l2"
+CQM (Cache QoS Monitoring)			"cqm_llc", "cqm_occup_llc"
+MBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
+MBA (Memory Bandwidth Allocation)		"mba"
+SMBA (Slow Memory Bandwidth Allocation)         ""
+BMEC (Bandwidth Monitoring Event Configuration) ""
+===============================================	================================
+
+Historically, new features were made visible by default in /proc/cpuinfo. This
+resulted in the feature flags becoming hard to parse by humans. Adding a new
+flag to /proc/cpuinfo should be avoided if user space can obtain information
+about the feature from resctrl's info directory.
+
+To use the feature mount the file system::
+
+ # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
+
+mount options are:
+
+"cdp":
+	Enable code/data prioritization in L3 cache allocations.
+"cdpl2":
+	Enable code/data prioritization in L2 cache allocations.
+"mba_MBps":
+	Enable the MBA Software Controller(mba_sc) to specify MBA
+	bandwidth in MBps
+
+L2 and L3 CDP are controlled separately.
+
+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control.  Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".
+
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.
+
+Info directory
+==============
+
+The 'info' directory contains information about the enabled
+resources. Each resource has its own subdirectory. The subdirectory
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2)  subdirectory contains the following files
+related to allocation:
+
+"num_closids":
+		The number of CLOSIDs which are valid for this
+		resource. The kernel uses the smallest number of
+		CLOSIDs of all enabled resources as limit.
+"cbm_mask":
+		The bitmask which is valid for this resource.
+		This mask is equivalent to 100%.
+"min_cbm_bits":
+		The minimum number of consecutive bits which
+		must be set when writing a mask.
+
+"shareable_bits":
+		Bitmask of shareable resource with other executing
+		entities (e.g. I/O). User can use this when
+		setting up exclusive cache partitions. Note that
+		some platforms support devices that have their
+		own settings for cache use which can over-ride
+		these bits.
+"bit_usage":
+		Annotated capacity bitmasks showing how all
+		instances of the resource are used. The legend is:
+
+			"0":
+			      Corresponding region is unused. When the system's
+			      resources have been allocated and a "0" is found
+			      in "bit_usage" it is a sign that resources are
+			      wasted.
+
+			"H":
+			      Corresponding region is used by hardware only
+			      but available for software use. If a resource
+			      has bits set in "shareable_bits" but not all
+			      of these bits appear in the resource groups'
+			      schematas then the bits appearing in
+			      "shareable_bits" but no resource group will
+			      be marked as "H".
+			"X":
+			      Corresponding region is available for sharing and
+			      used by hardware and software. These are the
+			      bits that appear in "shareable_bits" as
+			      well as a resource group's allocation.
+			"S":
+			      Corresponding region is used by software
+			      and available for sharing.
+			"E":
+			      Corresponding region is used exclusively by
+			      one resource group. No sharing allowed.
+			"P":
+			      Corresponding region is pseudo-locked. No
+			      sharing allowed.
+
+Memory bandwidth(MB) subdirectory contains the following files
+with respect to allocation:
+
+"min_bandwidth":
+		The minimum memory bandwidth percentage which
+		user can request.
+
+"bandwidth_gran":
+		The granularity in which the memory bandwidth
+		percentage is allocated. The allocated
+		b/w percentage is rounded off to the next
+		control step available on the hardware. The
+		available bandwidth control steps are:
+		min_bandwidth + N * bandwidth_gran.
+
+"delay_linear":
+		Indicates if the delay scale is linear or
+		non-linear. This field is purely informational
+		only.
+
+"thread_throttle_mode":
+		Indicator on Intel systems of how tasks running on threads
+		of a physical core are throttled in cases where they
+		request different memory bandwidth percentages:
+
+		"max":
+			the smallest percentage is applied
+			to all threads
+		"per-thread":
+			bandwidth percentages are directly applied to
+			the threads running on the core
+
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids":
+		The number of RMIDs available. This is the
+		upper bound for how many "CTRL_MON" + "MON"
+		groups can be created.
+
+"mon_features":
+		Lists the monitoring events if
+		monitoring is enabled for the resource.
+		Example::
+
+			# cat /sys/fs/resctrl/info/L3_MON/mon_features
+			llc_occupancy
+			mbm_total_bytes
+			mbm_local_bytes
+
+		If the system supports Bandwidth Monitoring Event
+		Configuration (BMEC), then the bandwidth events will
+		be configurable. The output will be::
+
+			# cat /sys/fs/resctrl/info/L3_MON/mon_features
+			llc_occupancy
+			mbm_total_bytes
+			mbm_total_bytes_config
+			mbm_local_bytes
+			mbm_local_bytes_config
+
+"mbm_total_bytes_config", "mbm_local_bytes_config":
+	Read/write files containing the configuration for the mbm_total_bytes
+	and mbm_local_bytes events, respectively, when the Bandwidth
+	Monitoring Event Configuration (BMEC) feature is supported.
+	The event configuration settings are domain specific and affect
+	all the CPUs in the domain. When either event configuration is
+	changed, the bandwidth counters for all RMIDs of both events
+	(mbm_total_bytes as well as mbm_local_bytes) are cleared for that
+	domain. The next read for every RMID will report "Unavailable"
+	and subsequent reads will report the valid value.
+
+	Following are the types of events supported:
+
+	====    ========================================================
+	Bits    Description
+	====    ========================================================
+	6       Dirty Victims from the QOS domain to all types of memory
+	5       Reads to slow memory in the non-local NUMA domain
+	4       Reads to slow memory in the local NUMA domain
+	3       Non-temporal writes to non-local NUMA domain
+	2       Non-temporal writes to local NUMA domain
+	1       Reads to memory in the non-local NUMA domain
+	0       Reads to memory in the local NUMA domain
+	====    ========================================================
+
+	By default, the mbm_total_bytes configuration is set to 0x7f to count
+	all the event types and the mbm_local_bytes configuration is set to
+	0x15 to count all the local memory events.
+
+	Examples:
+
+	* To view the current configuration::
+	  ::
+
+	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+	    0=0x7f;1=0x7f;2=0x7f;3=0x7f
+
+	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+	    0=0x15;1=0x15;3=0x15;4=0x15
+
+	* To change the mbm_total_bytes to count only reads on domain 0,
+	  the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
+	  (in hexadecimal 0x33):
+	  ::
+
+	    # echo  "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+
+	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+	    0=0x33;1=0x7f;2=0x7f;3=0x7f
+
+	* To change the mbm_local_bytes to count all the slow memory reads on
+	  domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
+	  in binary (in hexadecimal 0x30):
+	  ::
+
+	    # echo  "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+
+	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+	    0=0x30;1=0x30;3=0x15;4=0x15
+
+"max_threshold_occupancy":
+		Read/write file provides the largest value (in
+		bytes) at which a previously used LLC_occupancy
+		counter can be considered for re-use.
+
+Finally, in the top level of the "info" directory there is a file
+named "last_cmd_status". This is reset with every "command" issued
+via the file system (making new directories or writing to any of the
+control files). If the command was successful, it will read as "ok".
+If the command failed, it will provide more information that can be
+conveyed in the error returns from file operations. E.g.
+::
+
+	# echo L3:0=f7 > schemata
+	bash: echo: write error: Invalid argument
+	# cat info/last_cmd_status
+	mask f7 has non-consecutive 1-bits
+
+Resource alloc and monitor groups
+=================================
+
+Resource groups are represented as directories in the resctrl file
+system.  The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+All groups contain the following files:
+
+"tasks":
+	Reading this file shows the list of all tasks that belong to
+	this group. Writing a task id to the file will add a task to the
+	group. If the group is a CTRL_MON group the task is removed from
+	whichever previous CTRL_MON group owned the task and also from
+	any MON group that owned the task. If the group is a MON group,
+	then the task must already belong to the CTRL_MON parent of this
+	group. The task is removed from any previous MON group.
+
+
+"cpus":
+	Reading this file shows a bitmask of the logical CPUs owned by
+	this group. Writing a mask to this file will add and remove
+	CPUs to/from this group. As with the tasks file a hierarchy is
+	maintained where MON groups may only include CPUs owned by the
+	parent CTRL_MON group.
+	When the resource group is in pseudo-locked mode this file will
+	only be readable, reflecting the CPUs associated with the
+	pseudo-locked region.
+
+
+"cpus_list":
+	Just like "cpus", only using ranges of CPUs instead of bitmasks.
+
+
+When control is enabled all CTRL_MON groups will also contain:
+
+"schemata":
+	A list of all the resources available to this group.
+	Each resource has its own line and format - see below for details.
+
+"size":
+	Mirrors the display of the "schemata" file to display the size in
+	bytes of each allocation instead of the bits representing the
+	allocation.
+
+"mode":
+	The "mode" of the resource group dictates the sharing of its
+	allocations. A "shareable" resource group allows sharing of its
+	allocations while an "exclusive" resource group does not. A
+	cache pseudo-locked region is created by first writing
+	"pseudo-locksetup" to the "mode" file before writing the cache
+	pseudo-locked region's schemata to the resource group's "schemata"
+	file. On successful pseudo-locked region creation the mode will
+	automatically change to "pseudo-locked".
+
+When monitoring is enabled all MON groups will also contain:
+
+"mon_data":
+	This contains a set of files organized by L3 domain and by
+	RDT event. E.g. on a system with two L3 domains there will
+	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
+	directories have one file per event (e.g. "llc_occupancy",
+	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+	files provide a read out of the current value of the event for
+	all tasks in the group. In CTRL_MON groups these files provide
+	the sum for all tasks in the CTRL_MON group and all tasks in
+	MON groups. Please see example section for more details on usage.
+
+Resource allocation rules
+-------------------------
+
+When a task is running the following rules define which resources are
+available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+   for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+   CPU that is assigned to some specific group, then the schemata for the
+   CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+   then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+   on a CPU that is assigned to some specific group, then the RDT events
+   for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+   "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+===============================================
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps).  To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The maximum value of the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/cbm_mask". Intel hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
+of the capacity of the cache. You could partition the cache into four
+equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+Memory bandwidth Allocation and monitoring
+==========================================
+
+For Memory bandwidth resource, by default the user controls the resource
+by indicating the percentage of total memory bandwidth.
+
+The minimum bandwidth percentage value for each cpu model is predefined
+and can be looked up through "info/MB/min_bandwidth". The bandwidth
+granularity that is allocated is also dependent on the cpu model and can
+be looked up at "info/MB/bandwidth_gran". The available bandwidth
+control steps are: min_bw + N * bw_gran. Intermediate values are rounded
+to the next control step available on the hardware.
+
+The bandwidth throttling is a core specific mechanism on some of Intel
+SKUs. Using a high bandwidth and a low bandwidth setting on two threads
+sharing a core may result in both threads being throttled to use the
+low bandwidth (see "thread_throttle_mode").
+
+The fact that Memory bandwidth allocation(MBA) may be a core
+specific mechanism where as memory bandwidth monitoring(MBM) is done at
+the package level may lead to confusion when users try to apply control
+via the MBA and then monitor the bandwidth to see if the controls are
+effective. Below are such scenarios:
+
+1. User may *not* see increase in actual bandwidth when percentage
+   values are increased:
+
+This can occur when aggregate L2 external bandwidth is more than L3
+external bandwidth. Consider an SKL SKU with 24 cores on a package and
+where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
+240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
+threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
+bandwidth of 100GBps although the percentage value specified is only 50%
+<< 100%. Hence increasing the bandwidth percentage will not yield any
+more bandwidth. This is because although the L2 external bandwidth still
+has capacity, the L3 external bandwidth is fully used. Also note that
+this would be dependent on number of cores the benchmark is run on.
+
+2. Same bandwidth percentage may mean different actual bandwidth
+   depending on # of threads:
+
+For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
+thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
+they have same percentage bandwidth of 10%. This is simply because as
+threads start using more cores in an rdtgroup, the actual bandwidth may
+increase or vary although user specified bandwidth percentage is same.
+
+In order to mitigate this and make the interface more user friendly,
+resctrl added support for specifying the bandwidth in MBps as well.  The
+kernel underneath would use a software feedback mechanism or a "Software
+Controller(mba_sc)" which reads the actual bandwidth using MBM counters
+and adjust the memory bandwidth percentages to ensure::
+
+	"actual bandwidth < user specified bandwidth".
+
+By default, the schemata would take the bandwidth percentage values
+where as user can switch to the "MBA software controller" mode using
+a mount option 'mba_MBps'. The schemata format is specified in the below
+sections.
+
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
+With CDP disabled the L3 schemata format is::
+
+	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
+When CDP is enabled L3 control is split into two separate resources
+so you can specify independent masks for code and data like this::
+
+	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L2 schemata file details
+------------------------
+CDP is supported at L2 using the 'cdpl2' mount option. The schemata
+format is either::
+
+	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+or
+
+	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+
+Memory bandwidth Allocation (default mode)
+------------------------------------------
+
+Memory b/w domain is L3 cache.
+::
+
+	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Memory bandwidth Allocation specified in MBps
+---------------------------------------------
+
+Memory bandwidth domain is L3 cache.
+::
+
+	MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
+
+Slow Memory Bandwidth Allocation (SMBA)
+---------------------------------------
+AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
+CXL.memory is the only supported "slow" memory device. With the
+support of SMBA, the hardware enables bandwidth allocation on
+the slow memory devices. If there are multiple such devices in
+the system, the throttling logic groups all the slow sources
+together and applies the limit on them as a whole.
+
+The presence of SMBA (with CXL.memory) is independent of slow memory
+devices presence. If there are no such devices on the system, then
+configuring SMBA will have no impact on the performance of the system.
+
+The bandwidth domain for slow memory is L3 cache. Its schemata file
+is formatted as:
+::
+
+	SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Reading/writing the schemata file
+---------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change.  E.g.
+::
+
+  # cat schemata
+  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
+  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+  # echo "L3DATA:2=3c0;" > schemata
+  # cat schemata
+  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
+  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+
+Reading/writing the schemata file (on AMD systems)
+--------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of one eighth GB/s.
+When writing to the file, you need to specify what cache id you wish to
+configure the bandwidth limit.
+
+For example, to allocate 2GB/s limit on the first cache id:
+
+::
+
+  # cat schemata
+    MB:0=2048;1=2048;2=2048;3=2048
+    L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+  # echo "MB:1=16" > schemata
+  # cat schemata
+    MB:0=2048;1=  16;2=2048;3=2048
+    L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Reading/writing the schemata file (on AMD systems) with SMBA feature
+--------------------------------------------------------------------
+Reading and writing the schemata file is the same as without SMBA in
+above section.
+
+For example, to allocate 8GB/s limit on the first cache id:
+
+::
+
+  # cat schemata
+    SMBA:0=2048;1=2048;2=2048;3=2048
+      MB:0=2048;1=2048;2=2048;3=2048
+      L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+  # echo "SMBA:1=64" > schemata
+  # cat schemata
+    SMBA:0=2048;1=  64;2=2048;3=2048
+      MB:0=2048;1=2048;2=2048;3=2048
+      L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Cache Pseudo-Locking
+====================
+CAT enables a user to specify the amount of cache space that an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+The creation of a cache pseudo-locked region is triggered by a request
+from the user to do so that is accompanied by a schemata of the region
+to be pseudo-locked. The cache pseudo-locked region is created as follows:
+
+- Create a CAT allocation CLOSNEW with a CBM matching the schemata
+  from the user of the cache region that will contain the pseudo-locked
+  memory. This region must not overlap with any current CAT allocation/CLOS
+  on the system and no future overlap with this cache region is allowed
+  while the pseudo-locked region exists.
+- Create a contiguous region of memory of the same size as the cache
+  region.
+- Flush the cache, disable hardware prefetchers, disable preemption.
+- Make CLOSNEW the active CLOS and touch the allocated memory to load
+  it into the cache.
+- Set the previous CLOS as active.
+- At this point the closid CLOSNEW can be released - the cache
+  pseudo-locked region is protected as long as its CBM does not appear in
+  any CAT allocation. Even though the cache pseudo-locked region will from
+  this point on not appear in any CBM of any CLOS an application running with
+  any CLOS will be able to access the memory in the pseudo-locked region since
+  the region continues to serve cache hits.
+- The contiguous region of memory loaded into the cache is exposed to
+  user-space as a character device.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+“locked” data from cache. Power management C-states may shrink or
+power off cache. Deeper C-states will automatically be restricted on
+pseudo-locked region creation.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. A sanity check
+within the code will not allow an application to map pseudo-locked memory
+unless it runs with affinity to cores associated with the cache on which the
+pseudo-locked region resides. The sanity check is only done during the
+initial mmap() handling, there is no enforcement afterwards and the
+application self needs to ensure it remains affine to the correct cores.
+
+Pseudo-locking is accomplished in two stages:
+
+1) During the first stage the system administrator allocates a portion
+   of cache that should be dedicated to pseudo-locking. At this time an
+   equivalent portion of memory is allocated, loaded into allocated
+   cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+   pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+A pseudo-locked region is created using the resctrl interface as follows:
+
+1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
+2) Change the new resource group's mode to "pseudo-locksetup" by writing
+   "pseudo-locksetup" to the "mode" file.
+3) Write the schemata of the pseudo-locked region to the "schemata" file. All
+   bits within the schemata should be "unused" according to the "bit_usage"
+   file.
+
+On successful pseudo-locked region creation the "mode" file will contain
+"pseudo-locked" and a new character device with the same name as the resource
+group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
+by user space in order to obtain access to the pseudo-locked memory region.
+
+An example of cache pseudo-locked region creation and usage can be found below.
+
+Cache Pseudo-Locking Debugging Interface
+----------------------------------------
+The pseudo-locking debugging interface is enabled by default (if
+CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+   from these measurements are best visualized using a hist trigger (see
+   example below). In this test the pseudo-locked region is traversed at
+   a stride of 32 bytes while hardware prefetchers and preemption
+   are disabled. This also provides a substitute visualization of cache
+   hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+   available. Depending on the levels of cache on the system the pseudo_lock_l2
+   and pseudo_lock_l3 tracepoints are available.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
+write-only file, pseudo_lock_measure, is present in this directory. The
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+
+1:
+     writing "1" to the pseudo_lock_measure file will trigger the latency
+     measurement captured in the pseudo_lock_mem_latency tracepoint. See
+     example below.
+2:
+     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l2 tracepoint. See example below.
+3:
+     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
+
+Example of latency debugging interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region and
+visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
+is set::
+
+  # :> /sys/kernel/tracing/trace
+  # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
+  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+  # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
+
+  # event histogram
+  #
+  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+  #
+
+  { latency:        456 } hitcount:          1
+  { latency:         50 } hitcount:         83
+  { latency:         36 } hitcount:         96
+  { latency:         44 } hitcount:        174
+  { latency:         48 } hitcount:        195
+  { latency:         46 } hitcount:        262
+  { latency:         42 } hitcount:        693
+  { latency:         40 } hitcount:       3204
+  { latency:         38 } hitcount:       3484
+
+  Totals:
+      Hits: 8192
+      Entries: 9
+    Dropped: 0
+
+Example of cache hits/misses debugging
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+::
+
+  # :> /sys/kernel/tracing/trace
+  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+  # cat /sys/kernel/tracing/trace
+
+  # tracer: nop
+  #
+  #                              _-----=> irqs-off
+  #                             / _----=> need-resched
+  #                            | / _---=> hardirq/softirq
+  #                            || / _--=> preempt-depth
+  #                            ||| /     delay
+  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+  #              | |       |   ||||       |         |
+  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
+
+
+Examples for RDT allocation usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1) Example 1
+
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks, minimum b/w of 10% with a memory bandwidth
+granularity of 10%.
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+  # mkdir p0 p1
+  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
+  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Similarly, tasks that are under the control of group "p0" may use a
+maximum memory b/w of 50% on socket0 and 50% on socket 1.
+Tasks in group "p1" may also use 50% memory b/w on both sockets.
+Note that unlike cache masks, memory b/w cannot specify whether these
+allocations can overlap or not. The allocations specifies the maximum
+b/w that the group may be able to use and the system admin can configure
+the b/w accordingly.
+
+If resctrl is using the software controller (mba_sc) then user can enter the
+max b/w in MB rather than the percentage values.
+::
+
+  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
+2) Example 2
+
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
+ordinary tasks::
+
+  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+::
+
+  # mkdir p0
+  # echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+::
+
+  # echo 1234 > p0/tasks
+  # taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache)::
+
+  # mkdir p1
+  # echo "L3:0=7c00;1=fffff" > p1/schemata
+  # echo 5678 > p1/tasks
+  # taskset -cp 2 5678
+
+For the same 2 socket system with memory b/w resource and CAT L3 the
+schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
+10):
+
+For our first real time task this would request 20% memory b/w on socket 0.
+::
+
+  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+For our second real time task this would request an other 20% memory b/w
+on socket 0.
+::
+
+  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+3) Example 3
+
+A single socket system which has real-time tasks running on core 4-7 and
+non real-time workload assigned to core 0-3. The real-time tasks share text
+and data, so a per task association is not required and due to interaction
+with the kernel it's desired that the kernel on these cores shares L3 with
+the tasks.
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
+cannot be used by ordinary tasks::
+
+  # echo "L3:0=3ff\nMB:0=50" > schemata
+
+Next we make a resource group for our real time cores and give it access
+to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
+socket 0.
+::
+
+  # mkdir p0
+  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
+
+Finally we move core 4-7 over to the new group and make sure that the
+kernel and the tasks running there get 50% of the cache. They should
+also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
+siblings and only the real time threads are scheduled on the cores 4-7.
+::
+
+  # echo F0 > p0/cpus
+
+4) Example 4
+
+The resource groups in previous examples were all in the default "shareable"
+mode allowing sharing of their cache allocations. If one resource group
+configures a cache allocation then nothing prevents another resource group
+to overlap with that allocation.
+
+In this example a new exclusive resource group will be created on a L2 CAT
+system with two L2 cache instances that can be configured with an 8-bit
+capacity bitmask. The new exclusive resource group will be configured to use
+25% of each cache instance.
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl/
+  # cd /sys/fs/resctrl
+
+First, we observe that the default group is configured to allocate to all L2
+cache::
+
+  # cat schemata
+  L2:0=ff;1=ff
+
+We could attempt to create the new resource group at this point, but it will
+fail because of the overlap with the schemata of the default group::
+
+  # mkdir p0
+  # echo 'L2:0=0x3;1=0x3' > p0/schemata
+  # cat p0/mode
+  shareable
+  # echo exclusive > p0/mode
+  -sh: echo: write error: Invalid argument
+  # cat info/last_cmd_status
+  schemata overlaps
+
+To ensure that there is no overlap with another resource group the default
+resource group's schemata has to change, making it possible for the new
+resource group to become exclusive.
+::
+
+  # echo 'L2:0=0xfc;1=0xfc' > schemata
+  # echo exclusive > p0/mode
+  # grep . p0/*
+  p0/cpus:0
+  p0/mode:exclusive
+  p0/schemata:L2:0=03;1=03
+  p0/size:L2:0=262144;1=262144
+
+A new resource group will on creation not overlap with an exclusive resource
+group::
+
+  # mkdir p1
+  # grep . p1/*
+  p1/cpus:0
+  p1/mode:shareable
+  p1/schemata:L2:0=fc;1=fc
+  p1/size:L2:0=786432;1=786432
+
+The bit_usage will reflect how the cache is used::
+
+  # cat info/L2/bit_usage
+  0=SSSSSSEE;1=SSSSSSEE
+
+A resource group cannot be forced to overlap with an exclusive resource group::
+
+  # echo 'L2:0=0x1;1=0x1' > p1/schemata
+  -sh: echo: write error: Invalid argument
+  # cat info/last_cmd_status
+  overlaps with exclusive group
+
+Example of Cache Pseudo-Locking
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl/
+  # cd /sys/fs/resctrl
+
+Ensure that there are bits available that can be pseudo-locked, since only
+unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
+removed from the default resource group's schemata::
+
+  # cat info/L2/bit_usage
+  0=SSSSSSSS;1=SSSSSSSS
+  # echo 'L2:1=0xfc' > schemata
+  # cat info/L2/bit_usage
+  0=SSSSSSSS;1=SSSSSS00
+
+Create a new resource group that will be associated with the pseudo-locked
+region, indicate that it will be used for a pseudo-locked region, and
+configure the requested pseudo-locked region capacity bitmask::
+
+  # mkdir newlock
+  # echo pseudo-locksetup > newlock/mode
+  # echo 'L2:1=0x3' > newlock/schemata
+
+On success the resource group's mode will change to pseudo-locked, the
+bit_usage will reflect the pseudo-locked region, and the character device
+exposing the pseudo-locked region will exist::
+
+  # cat newlock/mode
+  pseudo-locked
+  # cat info/L2/bit_usage
+  0=SSSSSSSS;1=SSSSSSPP
+  # ls -l /dev/pseudo_lock/newlock
+  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
+
+::
+
+  /*
+  * Example code to access one page of pseudo-locked cache region
+  * from user space.
+  */
+  #define _GNU_SOURCE
+  #include <fcntl.h>
+  #include <sched.h>
+  #include <stdio.h>
+  #include <stdlib.h>
+  #include <unistd.h>
+  #include <sys/mman.h>
+
+  /*
+  * It is required that the application runs with affinity to only
+  * cores associated with the pseudo-locked region. Here the cpu
+  * is hardcoded for convenience of example.
+  */
+  static int cpuid = 2;
+
+  int main(int argc, char *argv[])
+  {
+    cpu_set_t cpuset;
+    long page_size;
+    void *mapping;
+    int dev_fd;
+    int ret;
+
+    page_size = sysconf(_SC_PAGESIZE);
+
+    CPU_ZERO(&cpuset);
+    CPU_SET(cpuid, &cpuset);
+    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+    if (ret < 0) {
+      perror("sched_setaffinity");
+      exit(EXIT_FAILURE);
+    }
+
+    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+    if (dev_fd < 0) {
+      perror("open");
+      exit(EXIT_FAILURE);
+    }
+
+    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+            dev_fd, 0);
+    if (mapping == MAP_FAILED) {
+      perror("mmap");
+      close(dev_fd);
+      exit(EXIT_FAILURE);
+    }
+
+    /* Application interacts with pseudo-locked memory @mapping */
+
+    ret = munmap(mapping, page_size);
+    if (ret < 0) {
+      perror("munmap");
+      close(dev_fd);
+      exit(EXIT_FAILURE);
+    }
+
+    close(dev_fd);
+    exit(EXIT_SUCCESS);
+  }
+
+Locking between applications
+----------------------------
+
+Certain operations on the resctrl filesystem, composed of read/writes
+to/from multiple files, must be atomic.
+
+As an example, the allocation of an exclusive reservation of L3 cache
+involves:
+
+  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
+  2. Find a contiguous set of bits in the global CBM bitmask that is clear
+     in any of the directory cbmmasks
+  3. Create a new directory
+  4. Set the bits found in step 2 to the new directory "schemata" file
+
+If two applications attempt to allocate space concurrently then they can
+end up allocating the same bits so the reservations are shared instead of
+exclusive.
+
+To coordinate atomic operations on the resctrlfs and to avoid the problem
+above, the following locking procedure is recommended:
+
+Locking is based on flock, which is available in libc and also as a shell
+script command
+
+Write lock:
+
+ A) Take flock(LOCK_EX) on /sys/fs/resctrl
+ B) Read/write the directory structure.
+ C) funlock
+
+Read lock:
+
+ A) Take flock(LOCK_SH) on /sys/fs/resctrl
+ B) If success read the directory structure.
+ C) funlock
+
+Example with bash::
+
+  # Atomically read directory structure
+  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
+
+  # Read directory contents and create new subdirectory
+
+  $ cat create-dir.sh
+  find /sys/fs/resctrl/ > output.txt
+  mask = function-of(output.txt)
+  mkdir /sys/fs/resctrl/newres/
+  echo mask > /sys/fs/resctrl/newres/schemata
+
+  $ flock /sys/fs/resctrl/ ./create-dir.sh
+
+Example with C::
+
+  /*
+  * Example code do take advisory locks
+  * before accessing resctrl filesystem
+  */
+  #include <sys/file.h>
+  #include <stdlib.h>
+
+  void resctrl_take_shared_lock(int fd)
+  {
+    int ret;
+
+    /* take shared lock on resctrl filesystem */
+    ret = flock(fd, LOCK_SH);
+    if (ret) {
+      perror("flock");
+      exit(-1);
+    }
+  }
+
+  void resctrl_take_exclusive_lock(int fd)
+  {
+    int ret;
+
+    /* release lock on resctrl filesystem */
+    ret = flock(fd, LOCK_EX);
+    if (ret) {
+      perror("flock");
+      exit(-1);
+    }
+  }
+
+  void resctrl_release_lock(int fd)
+  {
+    int ret;
+
+    /* take shared lock on resctrl filesystem */
+    ret = flock(fd, LOCK_UN);
+    if (ret) {
+      perror("flock");
+      exit(-1);
+    }
+  }
+
+  void main(void)
+  {
+    int fd, ret;
+
+    fd = open("/sys/fs/resctrl", O_DIRECTORY);
+    if (fd == -1) {
+      perror("open");
+      exit(-1);
+    }
+    resctrl_take_shared_lock(fd);
+    /* code to read directory contents */
+    resctrl_release_lock(fd);
+
+    resctrl_take_exclusive_lock(fd);
+    /* code to read and write directory contents */
+    resctrl_release_lock(fd);
+  }
+
+Examples for RDT Monitoring along with allocation usage
+=======================================================
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+------------------------------------------------------------------------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+  # mkdir p0 p1
+  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+  # echo 5678 > p1/tasks
+  # echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+::
+
+  # cd /sys/fs/resctrl/p1/mon_groups
+  # mkdir m11 m12
+  # echo 5678 > m11/tasks
+  # echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+::
+
+  # cat m11/mon_data/mon_L3_00/llc_occupancy
+  16234000
+  # cat m11/mon_data/mon_L3_01/llc_occupancy
+  14789000
+  # cat m12/mon_data/mon_L3_00/llc_occupancy
+  16789000
+
+The parent ctrl_mon group shows the aggregated data.
+::
+
+  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+  31234000
+
+Example 2 (Monitor a task from its creation)
+--------------------------------------------
+On a two socket machine (one L3 cache per socket)::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+  # mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+::
+
+  # echo $$ > /sys/fs/resctrl/p1/tasks
+  # <cmd>
+
+Fetch the data::
+
+  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+  31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------------------------------------------------------------------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+  # mkdir mon_groups/m01
+  # mkdir mon_groups/m02
+
+  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+::
+
+  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+  31234000
+  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+  34555
+  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+  31234000
+  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+  32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+::
+
+  # mount -t resctrl resctrl /sys/fs/resctrl
+  # cd /sys/fs/resctrl
+  # mkdir p1
+
+Move the cpus 4-7 over to p1::
+
+  # echo f0 > p1/cpus
+
+View the llc occupancy snapshot::
+
+  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+  11234000
+
+Intel RDT Errata
+================
+
+Intel MBM Counters May Report System Memory Bandwidth Incorrectly
+-----------------------------------------------------------------
+
+Errata SKX99 for Skylake server and BDF102 for Broadwell server.
+
+Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
+according to the assigned Resource Monitor ID (RMID) for that logical
+core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
+metrics, may report incorrect system bandwidth for certain RMID values.
+
+Implication: Due to the errata, system memory bandwidth may not match
+what is reported.
+
+Workaround: MBM total and local readings are corrected according to the
+following correction factor table:
+
++---------------+---------------+---------------+-----------------+
+|core count	|rmid count	|rmid threshold	|correction factor|
++---------------+---------------+---------------+-----------------+
+|1		|8		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|2		|16		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|3		|24		|15		|0.969650	  |
++---------------+---------------+---------------+-----------------+
+|4		|32		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|6		|48		|31		|0.969650	  |
++---------------+---------------+---------------+-----------------+
+|7		|56		|47		|1.142857	  |
++---------------+---------------+---------------+-----------------+
+|8		|64		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|9		|72		|63		|1.185115	  |
++---------------+---------------+---------------+-----------------+
+|10		|80		|63		|1.066553	  |
++---------------+---------------+---------------+-----------------+
+|11		|88		|79		|1.454545	  |
++---------------+---------------+---------------+-----------------+
+|12		|96		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|13		|104		|95		|1.230769	  |
++---------------+---------------+---------------+-----------------+
+|14		|112		|95		|1.142857	  |
++---------------+---------------+---------------+-----------------+
+|15		|120		|95		|1.066667	  |
++---------------+---------------+---------------+-----------------+
+|16		|128		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|17		|136		|127		|1.254863	  |
++---------------+---------------+---------------+-----------------+
+|18		|144		|127		|1.185255	  |
++---------------+---------------+---------------+-----------------+
+|19		|152		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|20		|160		|127		|1.066667	  |
++---------------+---------------+---------------+-----------------+
+|21		|168		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|22		|176		|159		|1.454334	  |
++---------------+---------------+---------------+-----------------+
+|23		|184		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|24		|192		|127		|0.969744	  |
++---------------+---------------+---------------+-----------------+
+|25		|200		|191		|1.280246	  |
++---------------+---------------+---------------+-----------------+
+|26		|208		|191		|1.230921	  |
++---------------+---------------+---------------+-----------------+
+|27		|216		|0		|1.000000	  |
++---------------+---------------+---------------+-----------------+
+|28		|224		|191		|1.143118	  |
++---------------+---------------+---------------+-----------------+
+
+If rmid > rmid threshold, MBM total and local values should be multiplied
+by the correction factor.
+
+See:
+
+1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
+http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
+
+2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
+http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
+
+3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
+https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
+
+for further information.
diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
new file mode 100644
index 000000000000..2bcbffacbed5
--- /dev/null
+++ b/Documentation/arch/x86/sgx.rst
@@ -0,0 +1,302 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Software Guard eXtensions (SGX)
+===============================
+
+Overview
+========
+
+Software Guard eXtensions (SGX) hardware enables for user space applications
+to set aside private memory regions of code and data:
+
+* Privileged (ring-0) ENCLS functions orchestrate the construction of the
+  regions.
+* Unprivileged (ring-3) ENCLU functions allow an application to enter and
+  execute inside the regions.
+
+These memory regions are called enclaves. An enclave can be only entered at a
+fixed set of entry points. Each entry point can hold a single hardware thread
+at a time.  While the enclave is loaded from a regular binary file by using
+ENCLS functions, only the threads inside the enclave can access its memory. The
+region is denied from outside access by the CPU, and encrypted before it leaves
+from LLC.
+
+The support can be determined by
+
+	``grep sgx /proc/cpuinfo``
+
+SGX must both be supported in the processor and enabled by the BIOS.  If SGX
+appears to be unsupported on a system which has hardware support, ensure
+support is enabled in the BIOS.  If a BIOS presents a choice between "Enabled"
+and "Software Enabled" modes for SGX, choose "Enabled".
+
+Enclave Page Cache
+==================
+
+SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated
+with an enclave. It is contained in a BIOS-reserved region of physical memory.
+Unlike pages used for regular memory, pages can only be accessed from outside of
+the enclave during enclave construction with special, limited SGX instructions.
+
+Only a CPU executing inside an enclave can directly access enclave memory.
+However, a CPU executing inside an enclave may access normal memory outside the
+enclave.
+
+The kernel manages enclave memory similar to how it treats device memory.
+
+Enclave Page Types
+------------------
+
+**SGX Enclave Control Structure (SECS)**
+   Enclave's address range, attributes and other global data are defined
+   by this structure.
+
+**Regular (REG)**
+   Regular EPC pages contain the code and data of an enclave.
+
+**Thread Control Structure (TCS)**
+   Thread Control Structure pages define the entry points to an enclave and
+   track the execution state of an enclave thread.
+
+**Version Array (VA)**
+   Version Array pages contain 512 slots, each of which can contain a version
+   number for a page evicted from the EPC.
+
+Enclave Page Cache Map
+----------------------
+
+The processor tracks EPC pages in a hardware metadata structure called the
+*Enclave Page Cache Map (EPCM)*.  The EPCM contains an entry for each EPC page
+which describes the owning enclave, access rights and page type among the other
+things.
+
+EPCM permissions are separate from the normal page tables.  This prevents the
+kernel from, for instance, allowing writes to data which an enclave wishes to
+remain read-only.  EPCM permissions may only impose additional restrictions on
+top of normal x86 page permissions.
+
+For all intents and purposes, the SGX architecture allows the processor to
+invalidate all EPCM entries at will.  This requires that software be prepared to
+handle an EPCM fault at any time.  In practice, this can happen on events like
+power transitions when the ephemeral key that encrypts enclave memory is lost.
+
+Application interface
+=====================
+
+Enclave build functions
+-----------------------
+
+In addition to the traditional compiler and linker build process, SGX has a
+separate enclave “build” process.  Enclaves must be built before they can be
+executed (entered). The first step in building an enclave is opening the
+**/dev/sgx_enclave** device.  Since enclave memory is protected from direct
+access, special privileged instructions are then used to copy data into enclave
+pages and establish enclave page permissions.
+
+.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
+   :functions: sgx_ioc_enclave_create
+               sgx_ioc_enclave_add_pages
+               sgx_ioc_enclave_init
+               sgx_ioc_enclave_provision
+
+Enclave runtime management
+--------------------------
+
+Systems supporting SGX2 additionally support changes to initialized
+enclaves: modifying enclave page permissions and type, and dynamically
+adding and removing of enclave pages. When an enclave accesses an address
+within its address range that does not have a backing page then a new
+regular page will be dynamically added to the enclave. The enclave is
+still required to run EACCEPT on the new page before it can be used.
+
+.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
+   :functions: sgx_ioc_enclave_restrict_permissions
+               sgx_ioc_enclave_modify_types
+               sgx_ioc_enclave_remove_pages
+
+Enclave vDSO
+------------
+
+Entering an enclave can only be done through SGX-specific EENTER and ERESUME
+functions, and is a non-trivial process.  Because of the complexity of
+transitioning to and from an enclave, enclaves typically utilize a library to
+handle the actual transitions.  This is roughly analogous to how glibc
+implementations are used by most applications to wrap system calls.
+
+Another crucial characteristic of enclaves is that they can generate exceptions
+as part of their normal operation that need to be handled in the enclave or are
+unique to SGX.
+
+Instead of the traditional signal mechanism to handle these exceptions, SGX
+can leverage special exception fixup provided by the vDSO.  The kernel-provided
+vDSO function wraps low-level transitions to/from the enclave like EENTER and
+ERESUME.  The vDSO function intercepts exceptions that would otherwise generate
+a signal and return the fault information directly to its caller.  This avoids
+the need to juggle signal handlers.
+
+.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h
+   :functions: vdso_sgx_enter_enclave_t
+
+ksgxd
+=====
+
+SGX support includes a kernel thread called *ksgxd*.
+
+EPC sanitization
+----------------
+
+ksgxd is started when SGX initializes.  Enclave memory is typically ready
+for use when the processor powers on or resets.  However, if SGX has been in
+use since the reset, enclave pages may be in an inconsistent state.  This might
+occur after a crash and kexec() cycle, for instance.  At boot, ksgxd
+reinitializes all enclave pages so that they can be allocated and re-used.
+
+The sanitization is done by going through EPC address space and applying the
+EREMOVE function to each physical page. Some enclave pages like SECS pages have
+hardware dependencies on other pages which prevents EREMOVE from functioning.
+Executing two EREMOVE passes removes the dependencies.
+
+Page reclaimer
+--------------
+
+Similar to the core kswapd, ksgxd, is responsible for managing the
+overcommitment of enclave memory.  If the system runs out of enclave memory,
+*ksgxd* “swaps” enclave memory to normal memory.
+
+Launch Control
+==============
+
+SGX provides a launch control mechanism. After all enclave pages have been
+copied, kernel executes EINIT function, which initializes the enclave. Only after
+this the CPU can execute inside the enclave.
+
+EINIT function takes an RSA-3072 signature of the enclave measurement.  The function
+checks that the measurement is correct and signature is signed with the key
+hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the
+SHA256 of a public key.
+
+Those MSRs can be configured by the BIOS to be either readable or writable.
+Linux supports only writable configuration in order to give full control to the
+kernel on launch control policy. Before calling EINIT function, the driver sets
+the MSRs to match the enclave's signing key.
+
+Encryption engines
+==================
+
+In order to conceal the enclave data while it is out of the CPU package, the
+memory controller has an encryption engine to transparently encrypt and decrypt
+enclave memory.
+
+In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to
+encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in
+SRAM to maintain integrity of the encrypted data. This provides integrity and
+anti-replay protection but does not scale to large memory sizes because the time
+required to update the Merkle tree grows logarithmically in relation to the
+memory size.
+
+CPUs starting from Icelake use Total Memory Encryption (TME) in the place of
+MEE. TME-based SGX implementations do not have an integrity Merkle tree, which
+means integrity and replay-attacks are not mitigated.  B, it includes
+additional changes to prevent cipher text from being returned and SW memory
+aliases from being created.
+
+DMA to enclave memory is blocked by range registers on both MEE and TME systems
+(SDM section 41.10).
+
+Usage Models
+============
+
+Shared Library
+--------------
+
+Sensitive data and the code that acts on it is partitioned from the application
+into a separate library. The library is then linked as a DSO which can be loaded
+into an enclave. The application can then make individual function calls into
+the enclave through special SGX instructions. A run-time within the enclave is
+configured to marshal function parameters into and out of the enclave and to
+call the correct library function.
+
+Application Container
+---------------------
+
+An application may be loaded into a container enclave which is specially
+configured with a library OS and run-time which permits the application to run.
+The enclave run-time and library OS work together to execute the application
+when a thread enters the enclave.
+
+Impact of Potential Kernel SGX Bugs
+===================================
+
+EPC leaks
+---------
+
+When EPC page leaks happen, a WARNING like this is shown in dmesg:
+
+"EREMOVE returned ... and an EPC page was leaked.  SGX may become unusable..."
+
+This is effectively a kernel use-after-free of an EPC page, and due
+to the way SGX works, the bug is detected at freeing. Rather than
+adding the page back to the pool of available EPC pages, the kernel
+intentionally leaks the page to avoid additional errors in the future.
+
+When this happens, the kernel will likely soon leak more EPC pages, and
+SGX will likely become unusable because the memory available to SGX is
+limited. However, while this may be fatal to SGX, the rest of the kernel
+is unlikely to be impacted and should continue to work.
+
+As a result, when this happpens, user should stop running any new
+SGX workloads, (or just any new workloads), and migrate all valuable
+workloads. Although a machine reboot can recover all EPC memory, the bug
+should be reported to Linux developers.
+
+
+Virtual EPC
+===========
+
+The implementation has also a virtual EPC driver to support SGX enclaves
+in guests. Unlike the SGX driver, an EPC page allocated by the virtual
+EPC driver doesn't have a specific enclave associated with it. This is
+because KVM doesn't track how a guest uses EPC pages.
+
+As a result, the SGX core page reclaimer doesn't support reclaiming EPC
+pages allocated to KVM guests through the virtual EPC driver. If the
+user wants to deploy SGX applications both on the host and in guests
+on the same machine, the user should reserve enough EPC (by taking out
+total virtual EPC size of all SGX VMs from the physical EPC size) for
+host SGX applications so they can run with acceptable performance.
+
+Architectural behavior is to restore all EPC pages to an uninitialized
+state also after a guest reboot.  Because this state can be reached only
+through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc``
+provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction
+on all pages in the virtual EPC.
+
+``EREMOVE`` can fail for three reasons.  Userspace must pay attention
+to expected failures and handle them as follows:
+
+1. Page removal will always fail when any thread is running in the
+   enclave to which the page belongs.  In this case the ioctl will
+   return ``EBUSY`` independent of whether it has successfully removed
+   some pages; userspace can avoid these failures by preventing execution
+   of any vcpu which maps the virtual EPC.
+
+2. Page removal will cause a general protection fault if two calls to
+   ``EREMOVE`` happen concurrently for pages that refer to the same
+   "SECS" metadata pages.  This can happen if there are concurrent
+   invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc``
+   file descriptor in the guest is closed at the same time as
+   ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``.
+   This can be avoided in userspace by serializing calls to the ioctl()
+   and to close(), but in general it should not be a problem.
+
+3. Finally, page removal will fail for SECS metadata pages which still
+   have child pages.  Child pages can be removed by executing
+   ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors
+   mapped into the guest.  This means that the ioctl() must be called
+   twice: an initial set of calls to remove child pages and a subsequent
+   set of calls to remove SECS pages.  The second set of calls is only
+   required for those mappings that returned a nonzero value from the
+   first call.  It indicates a bug in the kernel or the userspace client
+   if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
+   a return code other than 0.
diff --git a/Documentation/arch/x86/sva.rst b/Documentation/arch/x86/sva.rst
new file mode 100644
index 000000000000..33cb05005982
--- /dev/null
+++ b/Documentation/arch/x86/sva.rst
@@ -0,0 +1,286 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+Shared Virtual Addressing (SVA) with ENQCMD
+===========================================
+
+Background
+==========
+
+Shared Virtual Addressing (SVA) allows the processor and device to use the
+same virtual addresses avoiding the need for software to translate virtual
+addresses to physical addresses. SVA is what PCIe calls Shared Virtual
+Memory (SVM).
+
+In addition to the convenience of using application virtual addresses
+by the device, it also doesn't require pinning pages for DMA.
+PCIe Address Translation Services (ATS) along with Page Request Interface
+(PRI) allow devices to function much the same way as the CPU handling
+application page-faults. For more information please refer to the PCIe
+specification Chapter 10: ATS Specification.
+
+Use of SVA requires IOMMU support in the platform. IOMMU is also
+required to support the PCIe features ATS and PRI. ATS allows devices
+to cache translations for virtual addresses. The IOMMU driver uses the
+mmu_notifier() support to keep the device TLB cache and the CPU cache in
+sync. When an ATS lookup fails for a virtual address, the device should
+use the PRI in order to request the virtual address to be paged into the
+CPU page tables. The device must use ATS again in order the fetch the
+translation before use.
+
+Shared Hardware Workqueues
+==========================
+
+Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
+the use of Shared Work Queues (SWQ) by both applications and Virtual
+Machines (VM's). This allows better hardware utilization vs. hard
+partitioning resources that could result in under utilization. In order to
+allow the hardware to distinguish the context for which work is being
+executed in the hardware by SWQ interface, SIOV uses Process Address Space
+ID (PASID), which is a 20-bit number defined by the PCIe SIG.
+
+PASID value is encoded in all transactions from the device. This allows the
+IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
+Resource Identifier (RID) which is the Bus/Device/Function.
+
+
+ENQCMD
+======
+
+ENQCMD is a new instruction on Intel platforms that atomically submits a
+work descriptor to a device. The descriptor includes the operation to be
+performed, virtual addresses of all parameters, virtual address of a completion
+record, and the PASID (process address space ID) of the current process.
+
+ENQCMD works with non-posted semantics and carries a status back if the
+command was accepted by hardware. This allows the submitter to know if the
+submission needs to be retried or other device specific mechanisms to
+implement fairness or ensure forward progress should be provided.
+
+ENQCMD is the glue that ensures applications can directly submit commands
+to the hardware and also permits hardware to be aware of application context
+to perform I/O operations via use of PASID.
+
+Process Address Space Tagging
+=============================
+
+A new thread-scoped MSR (IA32_PASID) provides the connection between
+user processes and the rest of the hardware. When an application first
+accesses an SVA-capable device, this MSR is initialized with a newly
+allocated PASID. The driver for the device calls an IOMMU-specific API
+that sets up the routing for DMA and page-requests.
+
+For example, the Intel Data Streaming Accelerator (DSA) uses
+iommu_sva_bind_device(), which will do the following:
+
+- Allocate the PASID, and program the process page-table (%cr3 register) in the
+  PASID context entries.
+- Register for mmu_notifier() to track any page-table invalidations to keep
+  the device TLB in sync. For example, when a page-table entry is invalidated,
+  the IOMMU propagates the invalidation to the device TLB. This will force any
+  future access by the device to this virtual address to participate in
+  ATS. If the IOMMU responds with proper response that a page is not
+  present, the device would request the page to be paged in via the PCIe PRI
+  protocol before performing I/O.
+
+This MSR is managed with the XSAVE feature set as "supervisor state" to
+ensure the MSR is updated during context switch.
+
+PASID Management
+================
+
+The kernel must allocate a PASID on behalf of each process which will use
+ENQCMD and program it into the new MSR to communicate the process identity to
+platform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
+from this process.  When a user submits a work descriptor to a device using the
+ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
+value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
+with the same PASID. The platform IOMMU uses the PASID in the transaction to
+perform address translation. The IOMMU APIs setup the corresponding PASID
+entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
+x86).
+
+The MSR must be configured on each logical CPU before any application
+thread can interact with a device. Threads that belong to the same
+process share the same page tables, thus the same MSR value.
+
+PASID Life Cycle Management
+===========================
+
+PASID is initialized as IOMMU_PASID_INVALID (-1) when a process is created.
+
+Only processes that access SVA-capable devices need to have a PASID
+allocated. This allocation happens when a process opens/binds an SVA-capable
+device but finds no PASID for this process. Subsequent binds of the same, or
+other devices will share the same PASID.
+
+Although the PASID is allocated to the process by opening a device,
+it is not active in any of the threads of that process. It's loaded to the
+IA32_PASID MSR lazily when a thread tries to submit a work descriptor
+to a device using the ENQCMD.
+
+That first access will trigger a #GP fault because the IA32_PASID MSR
+has not been initialized with the PASID value assigned to the process
+when the device was opened. The Linux #GP handler notes that a PASID has
+been allocated for the process, and so initializes the IA32_PASID MSR
+and returns so that the ENQCMD instruction is re-executed.
+
+On fork(2) or exec(2) the PASID is removed from the process as it no
+longer has the same address space that it had when the device was opened.
+
+On clone(2) the new task shares the same address space, so will be
+able to use the PASID allocated to the process. The IA32_PASID is not
+preemptively initialized as the PASID value might not be allocated yet or
+the kernel does not know whether this thread is going to access the device
+and the cleared IA32_PASID MSR reduces context switch overhead by xstate
+init optimization. Since #GP faults have to be handled on any threads that
+were created before the PASID was assigned to the mm of the process, newly
+created threads might as well be treated in a consistent way.
+
+Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
+all threads in unbind, free the PASID lazily only on mm exit.
+
+If a process does a close(2) of the device file descriptor and munmap(2)
+of the device MMIO portal, then the driver will unbind the device. The
+PASID is still marked VALID in the PASID_MSR for any threads in the
+process that accessed the device. But this is harmless as without the
+MMIO portal they cannot submit new work to the device.
+
+Relationships
+=============
+
+ * Each process has many threads, but only one PASID.
+ * Devices have a limited number (~10's to 1000's) of hardware workqueues.
+   The device driver manages allocating hardware workqueues.
+ * A single mmap() maps a single hardware workqueue as a "portal" and
+   each portal maps down to a single workqueue.
+ * For each device with which a process interacts, there must be
+   one or more mmap()'d portals.
+ * Many threads within a process can share a single portal to access
+   a single device.
+ * Multiple processes can separately mmap() the same portal, in
+   which case they still share one device hardware workqueue.
+ * The single process-wide PASID is used by all threads to interact
+   with all devices.  There is not, for instance, a PASID for each
+   thread or each thread<->device pair.
+
+FAQ
+===
+
+* What is SVA/SVM?
+
+Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
+work in the same address space, i.e., to share it. Some call it Shared
+Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
+POSIX Shared Memory and Secure Virtual Machines which were terms already in
+circulation.
+
+* What is a PASID?
+
+A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
+(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
+PASID is included in all transactions between the platform and the device.
+
+* How are shared workqueues different?
+
+Traditionally, in order for userspace applications to interact with hardware,
+there is a separate hardware instance required per process. For example,
+consider doorbells as a mechanism of informing hardware about work to process.
+Each doorbell is required to be spaced 4k (or page-size) apart for process
+isolation. This requires hardware to provision that space and reserve it in
+MMIO. This doesn't scale as the number of threads becomes quite large. The
+hardware also manages the queue depth for Shared Work Queues (SWQ), and
+consumers don't need to track queue depth. If there is no space to accept
+a command, the device will return an error indicating retry.
+
+A user should check Deferrable Memory Write (DMWr) capability on the device
+and only submits ENQCMD when the device supports it. In the new DMWr PCIe
+terminology, devices need to support DMWr completer capability. In addition,
+it requires all switch ports to support DMWr routing and must be enabled by
+the PCIe subsystem, much like how PCIe atomic operations are managed for
+instance.
+
+SWQ allows hardware to provision just a single address in the device. When
+used with ENQCMD to submit work, the device can distinguish the process
+submitting the work since it will include the PASID assigned to that
+process. This helps the device scale to a large number of processes.
+
+* Is this the same as a user space device driver?
+
+Communicating with the device via the shared workqueue is much simpler
+than a full blown user space driver. The kernel driver does all the
+initialization of the hardware. User space only needs to worry about
+submitting work and processing completions.
+
+* Is this the same as SR-IOV?
+
+Single Root I/O Virtualization (SR-IOV) focuses on providing independent
+hardware interfaces for virtualizing hardware. Hence, it's required to be
+almost fully functional interface to software supporting the traditional
+BARs, space for interrupts via MSI-X, its own register layout.
+Virtual Functions (VFs) are assisted by the Physical Function (PF)
+driver.
+
+Scalable I/O Virtualization builds on the PASID concept to create device
+instances for virtualization. SIOV requires host software to assist in
+creating virtual devices; each virtual device is represented by a PASID
+along with the bus/device/function of the device.  This allows device
+hardware to optimize device resource creation and can grow dynamically on
+demand. SR-IOV creation and management is very static in nature. Consult
+references below for more details.
+
+* Why not just create a virtual function for each app?
+
+Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
+duplicated hardware for PCI config space and interrupts such as MSI-X.
+Resources such as interrupts have to be hard partitioned between VFs at
+creation time, and cannot scale dynamically on demand. The VFs are not
+completely independent from the Physical Function (PF). Most VFs require
+some communication and assistance from the PF driver. SIOV, in contrast,
+creates a software-defined device where all the configuration and control
+aspects are mediated via the slow path. The work submission and completion
+happen without any mediation.
+
+* Does this support virtualization?
+
+ENQCMD can be used from within a guest VM. In these cases, the VMM helps
+with setting up a translation table to translate from Guest PASID to Host
+PASID. Please consult the ENQCMD instruction set reference for more
+details.
+
+* Does memory need to be pinned?
+
+When devices support SVA along with platform hardware such as IOMMU
+supporting such devices, there is no need to pin memory for DMA purposes.
+Devices that support SVA also support other PCIe features that remove the
+pinning requirement for memory.
+
+Device TLB support - Device requests the IOMMU to lookup an address before
+use via Address Translation Service (ATS) requests.  If the mapping exists
+but there is no page allocated by the OS, IOMMU hardware returns that no
+mapping exists.
+
+Device requests the virtual address to be mapped via Page Request
+Interface (PRI). Once the OS has successfully completed the mapping, it
+returns the response back to the device. The device requests again for
+a translation and continues.
+
+IOMMU works with the OS in managing consistency of page-tables with the
+device. When removing pages, it interacts with the device to remove any
+device TLB entry that might have been cached before removing the mappings from
+the OS.
+
+References
+==========
+
+VT-D:
+https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
+
+SIOV:
+https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
+
+ENQCMD in ISE:
+https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
+
+DSA spec:
+https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
new file mode 100644
index 000000000000..dc8d9fd2c3f7
--- /dev/null
+++ b/Documentation/arch/x86/tdx.rst
@@ -0,0 +1,261 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Intel Trust Domain Extensions (TDX)
+=====================================
+
+Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
+the host and physical attacks by isolating the guest register state and by
+encrypting the guest memory. In TDX, a special module running in a special
+mode sits between the host and the guest and manages the guest/host
+separation.
+
+Since the host cannot directly access guest registers or memory, much
+normal functionality of a hypervisor must be moved into the guest. This is
+implemented using a Virtualization Exception (#VE) that is handled by the
+guest kernel. A #VE is handled entirely inside the guest kernel, but some
+require the hypervisor to be consulted.
+
+TDX includes new hypercall-like mechanisms for communicating from the
+guest to the hypervisor or the TDX module.
+
+New TDX Exceptions
+==================
+
+TDX guests behave differently from bare-metal and traditional VMX guests.
+In TDX guests, otherwise normal instructions or memory accesses can cause
+#VE or #GP exceptions.
+
+Instructions marked with an '*' conditionally cause exceptions.  The
+details for these instructions are discussed below.
+
+Instruction-based #VE
+---------------------
+
+- Port I/O (INS, OUTS, IN, OUT)
+- HLT
+- MONITOR, MWAIT
+- WBINVD, INVD
+- VMCALL
+- RDMSR*,WRMSR*
+- CPUID*
+
+Instruction-based #GP
+---------------------
+
+- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
+  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
+- ENCLS, ENCLU
+- GETSEC
+- RSM
+- ENQCMD
+- RDMSR*,WRMSR*
+
+RDMSR/WRMSR Behavior
+--------------------
+
+MSR access behavior falls into three categories:
+
+- #GP generated
+- #VE generated
+- "Just works"
+
+In general, the #GP MSRs should not be used in guests.  Their use likely
+indicates a bug in the guest.  The guest may try to handle the #GP with a
+hypercall but it is unlikely to succeed.
+
+The #VE MSRs are typically able to be handled by the hypervisor.  Guests
+can make a hypercall to the hypervisor to handle the #VE.
+
+The "just works" MSRs do not need any special guest handling.  They might
+be implemented by directly passing through the MSR to the hardware or by
+trapping and handling in the TDX module.  Other than possibly being slow,
+these MSRs appear to function just as they would on bare metal.
+
+CPUID Behavior
+--------------
+
+For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
+return values (in guest EAX/EBX/ECX/EDX) are configurable by the
+hypervisor. For such cases, the Intel TDX module architecture defines two
+virtualization types:
+
+- Bit fields for which the hypervisor controls the value seen by the guest
+  TD.
+
+- Bit fields for which the hypervisor configures the value such that the
+  guest TD either sees their native value or a value of 0.  For these bit
+  fields, the hypervisor can mask off the native values, but it can not
+  turn *on* values.
+
+A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
+not know how to handle. The guest kernel may ask the hypervisor for the
+value with a hypercall.
+
+#VE on Memory Accesses
+======================
+
+There are essentially two classes of TDX memory: private and shared.
+Private memory receives full TDX protections.  Its content is protected
+against access from the hypervisor.  Shared memory is expected to be
+shared between guest and hypervisor and does not receive full TDX
+protections.
+
+A TD guest is in control of whether its memory accesses are treated as
+private or shared.  It selects the behavior with a bit in its page table
+entries.  This helps ensure that a guest does not place sensitive
+information in shared memory, exposing it to the untrusted hypervisor.
+
+#VE on Shared Memory
+--------------------
+
+Access to shared mappings can cause a #VE.  The hypervisor ultimately
+controls whether a shared memory access causes a #VE, so the guest must be
+careful to only reference shared pages it can safely handle a #VE.  For
+instance, the guest should be careful not to access shared memory in the
+#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
+
+Shared mapping content is entirely controlled by the hypervisor. The guest
+should only use shared mappings for communicating with the hypervisor.
+Shared mappings must never be used for sensitive memory content like kernel
+stacks.  A good rule of thumb is that hypervisor-shared memory should be
+treated the same as memory mapped to userspace.  Both the hypervisor and
+userspace are completely untrusted.
+
+MMIO for virtual devices is implemented as shared memory.  The guest must
+be careful not to access device MMIO regions unless it is also prepared to
+handle a #VE.
+
+#VE on Private Pages
+--------------------
+
+An access to private mappings can also cause a #VE.  Since all kernel
+memory is also private memory, the kernel might theoretically need to
+handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
+TDX guests ensure that all guest memory has been "accepted" before memory
+is used by the kernel.
+
+A modest amount of memory (typically 512M) is pre-accepted by the firmware
+before the kernel runs to ensure that the kernel can start up without
+being subjected to a #VE.
+
+The hypervisor is permitted to unilaterally move accepted pages to a
+"blocked" state. However, if it does this, page access will not generate a
+#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
+to handle the exception.
+
+Linux #VE handler
+=================
+
+Just like page faults or #GP's, #VE exceptions can be either handled or be
+fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
+An unhandled kernel #VE results in an oops.
+
+Handling nested exceptions on x86 is typically nasty business.  A #VE
+could be interrupted by an NMI which triggers another #VE and hilarity
+ensues.  The TDX #VE architecture anticipated this scenario and includes a
+feature to make it slightly less nasty.
+
+During #VE handling, the TDX module ensures that all interrupts (including
+NMIs) are blocked.  The block remains in place until the guest makes a
+TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
+or a new #VE can be delivered.
+
+However, the guest kernel must still be careful to avoid potential
+#VE-triggering actions (discussed above) while this block is in place.
+While the block is in place, any #VE is elevated to a double fault (#DF)
+which is not recoverable.
+
+MMIO handling
+=============
+
+In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
+mapping which will cause a VMEXIT on access, and then the hypervisor
+emulates the access.  That is not possible in TDX guests because VMEXIT
+will expose the register state to the host. TDX guests don't trust the host
+and can't have their state exposed to the host.
+
+In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
+guest #VE handler then emulates the MMIO instruction inside the guest and
+converts it into a controlled TDCALL to the host, rather than exposing
+guest state to the host.
+
+MMIO addresses on x86 are just special physical addresses. They can
+theoretically be accessed with any instruction that accesses memory.
+However, the kernel instruction decoding method is limited. It is only
+designed to decode instructions like those generated by io.h macros.
+
+MMIO access via other means (like structure overlays) may result in an
+oops.
+
+Shared Memory Conversions
+=========================
+
+All TDX guest memory starts out as private at boot.  This memory can not
+be accessed by the hypervisor.  However, some kernel users like device
+drivers might have a need to share data with the hypervisor.  To do this,
+memory must be converted between shared and private.  This can be
+accomplished using some existing memory encryption helpers:
+
+ * set_memory_decrypted() converts a range of pages to shared.
+ * set_memory_encrypted() converts memory back to private.
+
+Device drivers are the primary user of shared memory, but there's no need
+to touch every driver. DMA buffers and ioremap() do the conversions
+automatically.
+
+TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
+converted to shared on boot.
+
+For coherent DMA allocation, the DMA buffer gets converted on the
+allocation. Check force_dma_unencrypted() for details.
+
+Attestation
+===========
+
+Attestation is used to verify the TDX guest trustworthiness to other
+entities before provisioning secrets to the guest. For example, a key
+server may want to use attestation to verify that the guest is the
+desired one before releasing the encryption keys to mount the encrypted
+rootfs or a secondary drive.
+
+The TDX module records the state of the TDX guest in various stages of
+the guest boot process using the build time measurement register (MRTD)
+and runtime measurement registers (RTMR). Measurements related to the
+guest initial configuration and firmware image are recorded in the MRTD
+register. Measurements related to initial state, kernel image, firmware
+image, command line options, initrd, ACPI tables, etc are recorded in
+RTMR registers. For more details, as an example, please refer to TDX
+Virtual Firmware design specification, section titled "TD Measurement".
+At TDX guest runtime, the attestation process is used to attest to these
+measurements.
+
+The attestation process consists of two steps: TDREPORT generation and
+Quote generation.
+
+TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
+from the TDX module. TDREPORT is a fixed-size data structure generated by
+the TDX module which contains guest-specific information (such as build
+and boot measurements), platform security version, and the MAC to protect
+the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
+as input and included in the TDREPORT. Typically it can be some nonce
+provided by attestation service so the TDREPORT can be verified uniquely.
+More details about the TDREPORT can be found in Intel TDX Module
+specification, section titled "TDG.MR.REPORT Leaf".
+
+After getting the TDREPORT, the second step of the attestation process
+is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
+by design can only be verified on the local platform as the MAC key is
+bound to the platform. To support remote verification of the TDREPORT,
+TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
+and convert it to a remotely verifiable Quote. Method of sending TDREPORT
+to QE is implementation specific. Attestation software can choose
+whatever communication channel available (i.e. vsock or TCP/IP) to
+send the TDREPORT to QE and receive the Quote.
+
+References
+==========
+
+TDX reference material is collected here:
+
+https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
diff --git a/Documentation/arch/x86/tlb.rst b/Documentation/arch/x86/tlb.rst
new file mode 100644
index 000000000000..82ec58ae63a8
--- /dev/null
+++ b/Documentation/arch/x86/tlb.rst
@@ -0,0 +1,83 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+The TLB
+=======
+
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+
+ 1. Flush the entire TLB with a two-instruction sequence.  This is
+    a quick operation, but it causes collateral damage: TLB entries
+    from areas other than the one we are trying to flush will be
+    destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+    time.  This could potentially cost many more instructions, but
+    it is a much more precise operation, causing no collateral
+    damage to other TLB entries.
+
+Which method to do depends on a few things:
+
+ 1. The size of the flush being performed.  A flush of the entire
+    address space is obviously better performed by flushing the
+    entire TLB than doing 2^48/PAGE_SIZE individual flushes.
+ 2. The contents of the TLB.  If the TLB is empty, then there will
+    be no collateral damage caused by doing the global flush, and
+    all of the individual flush will have ended up being wasted
+    work.
+ 3. The size of the TLB.  The larger the TLB, the more collateral
+    damage we do with a full flush.  So, the larger the TLB, the
+    more attractive an individual flush looks.  Data and
+    instructions have separate TLBs, as do different page sizes.
+ 4. The microarchitecture.  The TLB has become a multi-level
+    cache on modern CPUs, and the global flushes have become more
+    expensive relative to single-page flushes.
+
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush.  The
+sizes of the flush will vary greatly depending on the workload as
+well.  There is essentially no "right" point to choose.
+
+You may be doing too many individual invalidations if you see the
+invlpg instruction (or instructions _near_ it) show up high in
+profiles.  If you believe that individual invalidations being
+called too often, you can lower the tunable::
+
+	/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
+
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of the individual flushes.
+Setting it to 1 is a very conservative setting and it should
+never need to be 0 under normal circumstances.
+
+Despite the fact that a single individual flush on x86 is
+guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
+flushes.  THP is treated exactly the same as normal memory.
+
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this::
+
+  perf stat -e
+    cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+    cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+    cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+    cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+    cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+    cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+
+That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
+may have differently-named counters, but they should at least
+be there in some form.  You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+
+.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
+   says: "One execution of INVLPG is sufficient even for a page
+   with size greater than 4 KBytes."
diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst
new file mode 100644
index 000000000000..7f58010ea86a
--- /dev/null
+++ b/Documentation/arch/x86/topology.rst
@@ -0,0 +1,234 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+x86 Topology
+============
+
+This documents and clarifies the main aspects of x86 topology modelling and
+representation in the kernel. Update/change when doing changes to the
+respective code.
+
+The architecture-agnostic topology definitions are in
+Documentation/admin-guide/cputopology.rst. This file holds x86-specific
+differences/specialities which must not necessarily apply to the generic
+definitions. Thus, the way to read up on Linux topology on x86 is to start
+with the generic one and look at this one in parallel for the x86 specifics.
+
+Needless to say, code should use the generic functions - this file is *only*
+here to *document* the inner workings of x86 topology.
+
+Started by Thomas Gleixner <tglx@linutronix.de> and Borislav Petkov <bp@alien8.de>.
+
+The main aim of the topology facilities is to present adequate interfaces to
+code which needs to know/query/use the structure of the running system wrt
+threads, cores, packages, etc.
+
+The kernel does not care about the concept of physical sockets because a
+socket has no relevance to software. It's an electromechanical component. In
+the past a socket always contained a single package (see below), but with the
+advent of Multi Chip Modules (MCM) a socket can hold more than one package. So
+there might be still references to sockets in the code, but they are of
+historical nature and should be cleaned up.
+
+The topology of a system is described in the units of:
+
+    - packages
+    - cores
+    - threads
+
+Package
+=======
+Packages contain a number of cores plus shared resources, e.g. DRAM
+controller, shared caches etc.
+
+Modern systems may also use the term 'Die' for package.
+
+AMD nomenclature for package is 'Node'.
+
+Package-related topology information in the kernel:
+
+  - cpuinfo_x86.x86_max_cores:
+
+    The number of cores in a package. This information is retrieved via CPUID.
+
+  - cpuinfo_x86.x86_max_dies:
+
+    The number of dies in a package. This information is retrieved via CPUID.
+
+  - cpuinfo_x86.cpu_die_id:
+
+    The physical ID of the die. This information is retrieved via CPUID.
+
+  - cpuinfo_x86.phys_proc_id:
+
+    The physical ID of the package. This information is retrieved via CPUID
+    and deduced from the APIC IDs of the cores in the package.
+
+    Modern systems use this value for the socket. There may be multiple
+    packages within a socket. This value may differ from cpu_die_id.
+
+  - cpuinfo_x86.logical_proc_id:
+
+    The logical ID of the package. As we do not trust BIOSes to enumerate the
+    packages in a consistent way, we introduced the concept of logical package
+    ID so we can sanely calculate the number of maximum possible packages in
+    the system and have the packages enumerated linearly.
+
+  - topology_max_packages():
+
+    The maximum possible number of packages in the system. Helpful for per
+    package facilities to preallocate per package information.
+
+  - cpu_llc_id:
+
+    A per-CPU variable containing:
+
+      - On Intel, the first APIC ID of the list of CPUs sharing the Last Level
+        Cache
+
+      - On AMD, the Node ID or Core Complex ID containing the Last Level
+        Cache. In general, it is a number identifying an LLC uniquely on the
+        system.
+
+Cores
+=====
+A core consists of 1 or more threads. It does not matter whether the threads
+are SMT- or CMT-type threads.
+
+AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
+"core".
+
+Core-related topology information in the kernel:
+
+  - smp_num_siblings:
+
+    The number of threads in a core. The number of threads in a package can be
+    calculated by::
+
+	threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
+
+
+Threads
+=======
+A thread is a single scheduling unit. It's the equivalent to a logical Linux
+CPU.
+
+AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always
+uses "thread".
+
+Thread-related topology information in the kernel:
+
+  - topology_core_cpumask():
+
+    The cpumask contains all online threads in the package to which a thread
+    belongs.
+
+    The number of online threads is also printed in /proc/cpuinfo "siblings."
+
+  - topology_sibling_cpumask():
+
+    The cpumask contains all online threads in the core to which a thread
+    belongs.
+
+  - topology_logical_package_id():
+
+    The logical package ID to which a thread belongs.
+
+  - topology_physical_package_id():
+
+    The physical package ID to which a thread belongs.
+
+  - topology_core_id();
+
+    The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo
+    "core_id."
+
+
+
+System topology examples
+========================
+
+.. note::
+  The alternative Linux CPU enumeration depends on how the BIOS enumerates the
+  threads. Many BIOSes enumerate all threads 0 first and then all threads 1.
+  That has the "advantage" that the logical Linux CPU numbers of threads 0 stay
+  the same whether threads are enabled or not. That's merely an implementation
+  detail and has no practical impact.
+
+1) Single Package, Single Core::
+
+   [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+
+2) Single Package, Dual Core
+
+   a) One thread per core::
+
+	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+		    -> [core 1] -> [thread 0] -> Linux CPU 1
+
+   b) Two threads per core::
+
+	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+				-> [thread 1] -> Linux CPU 1
+		    -> [core 1] -> [thread 0] -> Linux CPU 2
+				-> [thread 1] -> Linux CPU 3
+
+      Alternative enumeration::
+
+	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+				-> [thread 1] -> Linux CPU 2
+		    -> [core 1] -> [thread 0] -> Linux CPU 1
+				-> [thread 1] -> Linux CPU 3
+
+      AMD nomenclature for CMT systems::
+
+	[node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
+				     -> [Compute Unit Core 1] -> Linux CPU 1
+		 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
+				     -> [Compute Unit Core 1] -> Linux CPU 3
+
+4) Dual Package, Dual Core
+
+   a) One thread per core::
+
+	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+		    -> [core 1] -> [thread 0] -> Linux CPU 1
+
+	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
+		    -> [core 1] -> [thread 0] -> Linux CPU 3
+
+   b) Two threads per core::
+
+	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+				-> [thread 1] -> Linux CPU 1
+		    -> [core 1] -> [thread 0] -> Linux CPU 2
+				-> [thread 1] -> Linux CPU 3
+
+	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 4
+				-> [thread 1] -> Linux CPU 5
+		    -> [core 1] -> [thread 0] -> Linux CPU 6
+				-> [thread 1] -> Linux CPU 7
+
+      Alternative enumeration::
+
+	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+				-> [thread 1] -> Linux CPU 4
+		    -> [core 1] -> [thread 0] -> Linux CPU 1
+				-> [thread 1] -> Linux CPU 5
+
+	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
+				-> [thread 1] -> Linux CPU 6
+		    -> [core 1] -> [thread 0] -> Linux CPU 3
+				-> [thread 1] -> Linux CPU 7
+
+      AMD nomenclature for CMT systems::
+
+	[node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
+				     -> [Compute Unit Core 1] -> Linux CPU 1
+		 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
+				     -> [Compute Unit Core 1] -> Linux CPU 3
+
+	[node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4
+				     -> [Compute Unit Core 1] -> Linux CPU 5
+		 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6
+				     -> [Compute Unit Core 1] -> Linux CPU 7
diff --git a/Documentation/arch/x86/tsx_async_abort.rst b/Documentation/arch/x86/tsx_async_abort.rst
new file mode 100644
index 000000000000..583ddc185ba2
--- /dev/null
+++ b/Documentation/arch/x86/tsx_async_abort.rst
@@ -0,0 +1,117 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+TSX Async Abort (TAA) mitigation
+================================
+
+.. _tsx_async_abort:
+
+Overview
+--------
+
+TSX Async Abort (TAA) is a side channel attack on internal buffers in some
+Intel processors similar to Microachitectural Data Sampling (MDS).  In this
+case certain loads may speculatively pass invalid data to dependent operations
+when an asynchronous abort condition is pending in a Transactional
+Synchronization Extensions (TSX) transaction.  This includes loads with no
+fault or assist condition. Such loads may speculatively expose stale data from
+the same uarch data structures as in MDS, with same scope of exposure i.e.
+same-thread and cross-thread. This issue affects all current processors that
+support TSX.
+
+Mitigation strategy
+-------------------
+
+a) TSX disable - one of the mitigations is to disable TSX. A new MSR
+IA32_TSX_CTRL will be available in future and current processors after
+microcode update which can be used to disable TSX. In addition, it
+controls the enumeration of the TSX feature bits (RTM and HLE) in CPUID.
+
+b) Clear CPU buffers - similar to MDS, clearing the CPU buffers mitigates this
+vulnerability. More details on this approach can be found in
+:ref:`Documentation/admin-guide/hw-vuln/mds.rst <mds>`.
+
+Kernel internal mitigation modes
+--------------------------------
+
+ =============    ============================================================
+ off              Mitigation is disabled. Either the CPU is not affected or
+                  tsx_async_abort=off is supplied on the kernel command line.
+
+ tsx disabled     Mitigation is enabled. TSX feature is disabled by default at
+                  bootup on processors that support TSX control.
+
+ verw             Mitigation is enabled. CPU is affected and MD_CLEAR is
+                  advertised in CPUID.
+
+ ucode needed     Mitigation is enabled. CPU is affected and MD_CLEAR is not
+                  advertised in CPUID. That is mainly for virtualization
+                  scenarios where the host has the updated microcode but the
+                  hypervisor does not expose MD_CLEAR in CPUID. It's a best
+                  effort approach without guarantee.
+ =============    ============================================================
+
+If the CPU is affected and the "tsx_async_abort" kernel command line parameter is
+not provided then the kernel selects an appropriate mitigation depending on the
+status of RTM and MD_CLEAR CPUID bits.
+
+Below tables indicate the impact of tsx=on|off|auto cmdline options on state of
+TAA mitigation, VERW behavior and TSX feature for various combinations of
+MSR_IA32_ARCH_CAPABILITIES bits.
+
+1. "tsx=off"
+
+=========  =========  ============  ============  ==============  ===================  ======================
+MSR_IA32_ARCH_CAPABILITIES bits     Result with cmdline tsx=off
+----------------------------------  -------------------------------------------------------------------------
+TAA_NO     MDS_NO     TSX_CTRL_MSR  TSX state     VERW can clear  TAA mitigation       TAA mitigation
+                                    after bootup  CPU buffers     tsx_async_abort=off  tsx_async_abort=full
+=========  =========  ============  ============  ==============  ===================  ======================
+    0          0           0         HW default         Yes           Same as MDS           Same as MDS
+    0          0           1        Invalid case   Invalid case       Invalid case          Invalid case
+    0          1           0         HW default         No         Need ucode update     Need ucode update
+    0          1           1          Disabled          Yes           TSX disabled          TSX disabled
+    1          X           1          Disabled           X             None needed           None needed
+=========  =========  ============  ============  ==============  ===================  ======================
+
+2. "tsx=on"
+
+=========  =========  ============  ============  ==============  ===================  ======================
+MSR_IA32_ARCH_CAPABILITIES bits     Result with cmdline tsx=on
+----------------------------------  -------------------------------------------------------------------------
+TAA_NO     MDS_NO     TSX_CTRL_MSR  TSX state     VERW can clear  TAA mitigation       TAA mitigation
+                                    after bootup  CPU buffers     tsx_async_abort=off  tsx_async_abort=full
+=========  =========  ============  ============  ==============  ===================  ======================
+    0          0           0         HW default        Yes            Same as MDS          Same as MDS
+    0          0           1        Invalid case   Invalid case       Invalid case         Invalid case
+    0          1           0         HW default        No          Need ucode update     Need ucode update
+    0          1           1          Enabled          Yes               None              Same as MDS
+    1          X           1          Enabled          X              None needed          None needed
+=========  =========  ============  ============  ==============  ===================  ======================
+
+3. "tsx=auto"
+
+=========  =========  ============  ============  ==============  ===================  ======================
+MSR_IA32_ARCH_CAPABILITIES bits     Result with cmdline tsx=auto
+----------------------------------  -------------------------------------------------------------------------
+TAA_NO     MDS_NO     TSX_CTRL_MSR  TSX state     VERW can clear  TAA mitigation       TAA mitigation
+                                    after bootup  CPU buffers     tsx_async_abort=off  tsx_async_abort=full
+=========  =========  ============  ============  ==============  ===================  ======================
+    0          0           0         HW default    Yes                Same as MDS           Same as MDS
+    0          0           1        Invalid case  Invalid case        Invalid case          Invalid case
+    0          1           0         HW default    No              Need ucode update     Need ucode update
+    0          1           1          Disabled      Yes               TSX disabled          TSX disabled
+    1          X           1          Enabled       X                 None needed           None needed
+=========  =========  ============  ============  ==============  ===================  ======================
+
+In the tables, TSX_CTRL_MSR is a new bit in MSR_IA32_ARCH_CAPABILITIES that
+indicates whether MSR_IA32_TSX_CTRL is supported.
+
+There are two control bits in IA32_TSX_CTRL MSR:
+
+      Bit 0: When set it disables the Restricted Transactional Memory (RTM)
+             sub-feature of TSX (will force all transactions to abort on the
+             XBEGIN instruction).
+
+      Bit 1: When set it disables the enumeration of the RTM and HLE feature
+             (i.e. it will make CPUID(EAX=7).EBX{bit4} and
+             CPUID(EAX=7).EBX{bit11} read as 0).
diff --git a/Documentation/arch/x86/usb-legacy-support.rst b/Documentation/arch/x86/usb-legacy-support.rst
new file mode 100644
index 000000000000..e01c08b7c981
--- /dev/null
+++ b/Documentation/arch/x86/usb-legacy-support.rst
@@ -0,0 +1,50 @@
+
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+USB Legacy support
+==================
+
+:Author: Vojtech Pavlik <vojtech@suse.cz>, January 2004
+
+
+Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a
+feature that allows one to use the USB mouse and keyboard as if they were
+their classic PS/2 counterparts.  This means one can use an USB keyboard to
+type in LILO for example.
+
+It has several drawbacks, though:
+
+1) On some machines, the emulated PS/2 mouse takes over even when no USB
+   mouse is present and a real PS/2 mouse is present.  In that case the extra
+   features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
+   not be available.
+
+2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
+   system crashes, because the SMM BIOS is not expecting to be in PAE mode.
+   The Intel E7505 is a typical machine where this happens.
+
+3) If AMD64 64-bit mode is enabled, again system crashes often happen,
+   because the SMM BIOS isn't expecting the CPU to be in 64-bit mode.  The
+   BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
+   yet.
+
+Solutions:
+
+Problem 1)
+  can be solved by loading the USB drivers prior to loading the
+  PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into
+  the kernel unconditionally, this means the USB drivers need to be
+  compiled-in, too.
+
+Problem 2)
+  can currently only be solved by either disabling HIGHMEM64G
+  in the kernel config or USB Legacy support in the BIOS. A BIOS update
+  could help, but so far no such update exists.
+
+Problem 3)
+  is usually fixed by a BIOS update. Check the board
+  manufacturers web site. If an update is not available, disable USB
+  Legacy support in the BIOS. If this alone doesn't help, try also adding
+  idle=poll on the kernel command line. The BIOS may be entering the SMM
+  on the HLT instruction as well.
diff --git a/Documentation/arch/x86/x86_64/5level-paging.rst b/Documentation/arch/x86/x86_64/5level-paging.rst
new file mode 100644
index 000000000000..71f882f4a173
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/5level-paging.rst
@@ -0,0 +1,67 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+5-level paging
+==============
+
+Overview
+========
+Original x86-64 was limited by 4-level paging to 256 TiB of virtual address
+space and 64 TiB of physical address space. We are already bumping into
+this limit: some vendors offer servers with 64 TiB of memory today.
+
+To overcome the limitation upcoming hardware will introduce support for
+5-level paging. It is a straight-forward extension of the current page
+table structure adding one more layer of translation.
+
+It bumps the limits to 128 PiB of virtual address space and 4 PiB of
+physical address space. This "ought to be enough for anybody" ©.
+
+QEMU 2.9 and later support 5-level paging.
+
+Virtual memory layout for 5-level paging is described in
+Documentation/arch/x86/x86_64/mm.rst
+
+
+Enabling 5-level paging
+=======================
+CONFIG_X86_5LEVEL=y enables the feature.
+
+Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
+In this case additional page table level -- p4d -- will be folded at
+runtime.
+
+User-space and large virtual address space
+==========================================
+On x86, 5-level paging enables 56-bit userspace virtual address space.
+Not all user space is ready to handle wide addresses. It's known that
+at least some JIT compilers use higher bits in pointers to encode their
+information. It collides with valid pointers with 5-level paging and
+leads to crashes.
+
+To mitigate this, we are not going to allocate virtual address space
+above 47-bit by default.
+
+But userspace can ask for allocation from full address space by
+specifying hint address (with or without MAP_FIXED) above 47-bits.
+
+If hint address set above 47-bit, but MAP_FIXED is not specified, we try
+to look for unmapped area by specified address. If it's already
+occupied, we look for unmapped area in *full* address space, rather than
+from 47-bit window.
+
+A high hint address would only affect the allocation in question, but not
+any future mmap()s.
+
+Specifying high hint address on older kernel or on machine without 5-level
+paging support is safe. The hint will be ignored and kernel will fall back
+to allocation from 47-bit address space.
+
+This approach helps to easily make application's memory allocator aware
+about large address space without manually tracking allocated virtual
+address space.
+
+One important case we need to handle here is interaction with MPX.
+MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
+need to make sure that MPX cannot be enabled we already have VMA above
+the boundary and forbid creating such VMAs once MPX is enabled.
diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst
new file mode 100644
index 000000000000..137432d34109
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/boot-options.rst
@@ -0,0 +1,319 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+AMD64 Specific Boot Options
+===========================
+
+There are many others (usually documented in driver documentation), but
+only the AMD64 specific ones are listed here.
+
+Machine check
+=============
+Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
+
+   mce=off
+		Disable machine check
+   mce=no_cmci
+		Disable CMCI(Corrected Machine Check Interrupt) that
+		Intel processor supports.  Usually this disablement is
+		not recommended, but it might be handy if your hardware
+		is misbehaving.
+		Note that you'll get more problems without CMCI than with
+		due to the shared banks, i.e. you might get duplicated
+		error logs.
+   mce=dont_log_ce
+		Don't make logs for corrected errors.  All events reported
+		as corrected are silently cleared by OS.
+		This option will be useful if you have no interest in any
+		of corrected errors.
+   mce=ignore_ce
+		Disable features for corrected errors, e.g. polling timer
+		and CMCI.  All events reported as corrected are not cleared
+		by OS and remained in its error banks.
+		Usually this disablement is not recommended, however if
+		there is an agent checking/clearing corrected errors
+		(e.g. BIOS or hardware monitoring applications), conflicting
+		with OS's error handling, and you cannot deactivate the agent,
+		then this option will be a help.
+   mce=no_lmce
+		Do not opt-in to Local MCE delivery. Use legacy method
+		to broadcast MCEs.
+   mce=bootlog
+		Enable logging of machine checks left over from booting.
+		Disabled by default on AMD Fam10h and older because some BIOS
+		leave bogus ones.
+		If your BIOS doesn't do that it's a good idea to enable though
+		to make sure you log even machine check events that result
+		in a reboot. On Intel systems it is enabled by default.
+   mce=nobootlog
+		Disable boot machine check logging.
+   mce=monarchtimeout (number)
+		monarchtimeout:
+		Sets the time in us to wait for other CPUs on machine checks. 0
+		to disable.
+   mce=bios_cmci_threshold
+		Don't overwrite the bios-set CMCI threshold. This boot option
+		prevents Linux from overwriting the CMCI threshold set by the
+		bios. Without this option, Linux always sets the CMCI
+		threshold to 1. Enabling this may make memory predictive failure
+		analysis less effective if the bios sets thresholds for memory
+		errors since we will not see details for all errors.
+   mce=recovery
+		Force-enable recoverable machine check code paths
+
+   nomce (for compatibility with i386)
+		same as mce=off
+
+   Everything else is in sysfs now.
+
+APICs
+=====
+
+   apic
+	Use IO-APIC. Default
+
+   noapic
+	Don't use the IO-APIC.
+
+   disableapic
+	Don't use the local APIC
+
+   nolapic
+     Don't use the local APIC (alias for i386 compatibility)
+
+   pirq=...
+	See Documentation/arch/x86/i386/IO-APIC.rst
+
+   noapictimer
+	Don't set up the APIC timer
+
+   no_timer_check
+	Don't check the IO-APIC timer. This can work around
+	problems with incorrect timer initialization on some boards.
+
+   apicpmtimer
+	Do APIC timer calibration using the pmtimer. Implies
+	apicmaintimer. Useful when your PIT timer is totally broken.
+
+Timing
+======
+
+  notsc
+    Deprecated, use tsc=unstable instead.
+
+  nohpet
+    Don't use the HPET timer.
+
+Idle loop
+=========
+
+  idle=poll
+    Don't do power saving in the idle loop using HLT, but poll for rescheduling
+    event. This will make the CPUs eat a lot more power, but may be useful
+    to get slightly better performance in multiprocessor benchmarks. It also
+    makes some profiling using performance counters more accurate.
+    Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
+    CPUs) this option has no performance advantage over the normal idle loop.
+    It may also interact badly with hyperthreading.
+
+Rebooting
+=========
+
+   reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old]
+      bios
+        Use the CPU reboot vector for warm reset
+      warm
+        Don't set the cold reboot flag
+      cold
+        Set the cold reboot flag
+      triple
+        Force a triple fault (init)
+      kbd
+        Use the keyboard controller. cold reset (default)
+      acpi
+        Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
+        the ACPI reset does not work, the reboot path attempts the reset
+        using the keyboard controller.
+      efi
+        Use efi reset_system runtime service. If EFI is not configured or
+        the EFI reset does not work, the reboot path attempts the reset using
+        the keyboard controller.
+      pci
+        Use a write to the PCI config space register 0xcf9 to trigger reboot.
+
+   Using warm reset will be much faster especially on big memory
+   systems because the BIOS will not go through the memory check.
+   Disadvantage is that not all hardware will be completely reinitialized
+   on reboot so there may be boot problems on some systems.
+
+   reboot=force
+     Don't stop other CPUs on reboot. This can make reboot more reliable
+     in some cases.
+
+   reboot=default
+     There are some built-in platform specific "quirks" - you may see:
+     "reboot: <name> series board detected. Selecting <type> for reboots."
+     In the case where you think the quirk is in error (e.g. you have
+     newer BIOS, or newer board) using this option will ignore the built-in
+     quirk table, and use the generic default reboot actions.
+
+NUMA
+====
+
+  numa=off
+    Only set up a single NUMA node spanning all memory.
+
+  numa=noacpi
+    Don't parse the SRAT table for NUMA setup
+
+  numa=nohmat
+    Don't parse the HMAT table for NUMA setup, or soft-reserved memory
+    partitioning.
+
+  numa=fake=<size>[MG]
+    If given as a memory unit, fills all system RAM with nodes of
+    size interleaved over physical nodes.
+
+  numa=fake=<N>
+    If given as an integer, fills all system RAM with N fake nodes
+    interleaved over physical nodes.
+
+  numa=fake=<N>U
+    If given as an integer followed by 'U', it will divide each
+    physical node into N emulated nodes.
+
+ACPI
+====
+
+  acpi=off
+    Don't enable ACPI
+  acpi=ht
+    Use ACPI boot table parsing, but don't enable ACPI interpreter
+  acpi=force
+    Force ACPI on (currently not needed)
+  acpi=strict
+    Disable out of spec ACPI workarounds.
+  acpi_sci={edge,level,high,low}
+    Set up ACPI SCI interrupt.
+  acpi=noirq
+    Don't route interrupts
+  acpi=nocmcff
+    Disable firmware first mode for corrected errors. This
+    disables parsing the HEST CMC error source to check if
+    firmware has set the FF flag. This may result in
+    duplicate corrected error reports.
+
+PCI
+===
+
+  pci=off
+    Don't use PCI
+  pci=conf1
+    Use conf1 access.
+  pci=conf2
+    Use conf2 access.
+  pci=rom
+    Assign ROMs.
+  pci=assign-busses
+    Assign busses
+  pci=irqmask=MASK
+    Set PCI interrupt mask to MASK
+  pci=lastbus=NUMBER
+    Scan up to NUMBER busses, no matter what the mptable says.
+  pci=noacpi
+    Don't use ACPI to set up PCI interrupt routing.
+
+IOMMU (input/output memory management unit)
+===========================================
+Multiple x86-64 PCI-DMA mapping implementations exist, for example:
+
+   1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all
+      (e.g. because you have < 3 GB memory).
+      Kernel boot message: "PCI-DMA: Disabling IOMMU"
+
+   2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
+      Kernel boot message: "PCI-DMA: using GART IOMMU"
+
+   3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
+      e.g. if there is no hardware IOMMU in the system and it is need because
+      you have >3GB memory or told the kernel to us it (iommu=soft))
+      Kernel boot message: "PCI-DMA: Using software bounce buffering
+      for IO (SWIOTLB)"
+
+::
+
+  iommu=[<size>][,noagp][,off][,force][,noforce]
+  [,memaper[=<order>]][,merge][,fullflush][,nomerge]
+  [,noaperture]
+
+General iommu options:
+
+    off
+      Don't initialize and use any kind of IOMMU.
+    noforce
+      Don't force hardware IOMMU usage when it is not needed. (default).
+    force
+      Force the use of the hardware IOMMU even when it is
+      not actually needed (e.g. because < 3 GB memory).
+    soft
+      Use software bounce buffering (SWIOTLB) (default for
+      Intel machines). This can be used to prevent the usage
+      of an available hardware IOMMU.
+
+iommu options only relevant to the AMD GART hardware IOMMU:
+
+    <size>
+      Set the size of the remapping area in bytes.
+    allowed
+      Overwrite iommu off workarounds for specific chipsets.
+    fullflush
+      Flush IOMMU on each allocation (default).
+    nofullflush
+      Don't use IOMMU fullflush.
+    memaper[=<order>]
+      Allocate an own aperture over RAM with size 32MB<<order.
+      (default: order=1, i.e. 64MB)
+    merge
+      Do scatter-gather (SG) merging. Implies "force" (experimental).
+    nomerge
+      Don't do scatter-gather (SG) merging.
+    noaperture
+      Ask the IOMMU not to touch the aperture for AGP.
+    noagp
+      Don't initialize the AGP driver and use full aperture.
+    panic
+      Always panic when IOMMU overflows.
+
+iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
+implementation:
+
+    swiotlb=<slots>[,force,noforce]
+      <slots>
+        Prereserve that many 2K slots for the software IO bounce buffering.
+      force
+        Force all IO through the software TLB.
+      noforce
+        Do not initialize the software TLB.
+
+
+Miscellaneous
+=============
+
+  nogbpages
+    Do not use GB pages for kernel direct mappings.
+  gbpages
+    Use GB pages for kernel direct mappings.
+
+
+AMD SEV (Secure Encrypted Virtualization)
+=========================================
+Options relating to AMD SEV, specified via the following format:
+
+::
+
+   sev=option1[,option2]
+
+The available options are:
+
+   debug
+     Enable debug messages.
diff --git a/Documentation/arch/x86/x86_64/cpu-hotplug-spec.rst b/Documentation/arch/x86/x86_64/cpu-hotplug-spec.rst
new file mode 100644
index 000000000000..8d1c91f0c880
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/cpu-hotplug-spec.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================
+Firmware support for CPU hotplug under Linux/x86-64
+===================================================
+
+Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
+know in advance of boot time the maximum number of CPUs that could be plugged
+into the system. ACPI 3.0 currently has no official way to supply
+this information from the firmware to the operating system.
+
+In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
+ACPI 3.0 specification).  ACPI already has the concept of disabled LAPIC
+objects by setting the Enabled bit in the LAPIC object to zero.
+
+For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
+CPU is already available in the MADT. If the CPU is not available yet
+it should have its LAPIC Enabled bit set to 0. Linux will use the number
+of disabled LAPICs to compute the maximum number of future CPUs.
+
+In the worst case the user can overwrite this choice using a command line
+option (additional_cpus=...), but it is recommended to supply the correct
+number (or a reasonable approximation of it, with erring towards more not less)
+in the MADT to avoid manual configuration.
diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
new file mode 100644
index 000000000000..ba74617d4999
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Fake NUMA For CPUSets
+=====================
+
+:Author: David Rientjes <rientjes@cs.washington.edu>
+
+Using numa=fake and CPUSets for Resource Management
+
+This document describes how the numa=fake x86_64 command-line option can be used
+in conjunction with cpusets for coarse memory management.  Using this feature,
+you can create fake NUMA nodes that represent contiguous chunks of memory and
+assign them to cpusets and their attached tasks.  This is a way of limiting the
+amount of system memory that are available to a certain class of tasks.
+
+For more information on the features of cpusets, see
+Documentation/admin-guide/cgroup-v1/cpusets.rst.
+There are a number of different configurations you can use for your needs.  For
+more information on the numa=fake command line option and its various ways of
+configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst.
+
+For the purposes of this introduction, we'll assume a very primitive NUMA
+emulation setup of "numa=fake=4*512,".  This will split our system memory into
+four equal chunks of 512M each that we can now use to assign to cpusets.  As
+you become more familiar with using this combination for resource control,
+you'll determine a better setup to minimize the number of nodes you have to deal
+with.
+
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
+
+	Faking node 0 at 0000000000000000-0000000020000000 (512MB)
+	Faking node 1 at 0000000020000000-0000000040000000 (512MB)
+	Faking node 2 at 0000000040000000-0000000060000000 (512MB)
+	Faking node 3 at 0000000060000000-0000000080000000 (512MB)
+	...
+	On node 0 totalpages: 130975
+	On node 1 totalpages: 131072
+	On node 2 totalpages: 131072
+	On node 3 totalpages: 131072
+
+Now following the instructions for mounting the cpusets filesystem from
+Documentation/admin-guide/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
+address spaces) to individual cpusets::
+
+	[root@xroads /]# mkdir exampleset
+	[root@xroads /]# mount -t cpuset none exampleset
+	[root@xroads /]# mkdir exampleset/ddset
+	[root@xroads /]# cd exampleset/ddset
+	[root@xroads /exampleset/ddset]# echo 0-1 > cpus
+	[root@xroads /exampleset/ddset]# echo 0-1 > mems
+
+Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
+memory allocations (1G).
+
+You can now assign tasks to these cpusets to limit the memory resources
+available to them according to the fake nodes assigned as mems::
+
+	[root@xroads /exampleset/ddset]# echo $$ > tasks
+	[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
+	[1] 13425
+
+Notice the difference between the system memory usage as reported by
+/proc/meminfo between the restricted cpuset case above and the unrestricted
+case (i.e. running the same 'dd' command without assigning it to a fake NUMA
+cpuset):
+
+	========	============	==========
+	Name		Unrestricted	Restricted
+	========	============	==========
+	MemTotal	3091900 kB	3091900 kB
+	MemFree		42113 kB	1513236 kB
+	========	============	==========
+
+This allows for coarse memory management for the tasks you assign to particular
+cpusets.  Since cpusets can form a hierarchy, you can create some pretty
+interesting combinations of use-cases for various classes of tasks for your
+memory management needs.
diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst
new file mode 100644
index 000000000000..50960e09e1f6
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/fsgs.rst
@@ -0,0 +1,199 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Using FS and GS segments in user space applications
+===================================================
+
+The x86 architecture supports segmentation. Instructions which access
+memory can use segment register based addressing mode. The following
+notation is used to address a byte within a segment:
+
+  Segment-register:Byte-address
+
+The segment base address is added to the Byte-address to compute the
+resulting virtual address which is accessed. This allows to access multiple
+instances of data with the identical Byte-address, i.e. the same code. The
+selection of a particular instance is purely based on the base-address in
+the segment register.
+
+In 32-bit mode the CPU provides 6 segments, which also support segment
+limits. The limits can be used to enforce address space protections.
+
+In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is
+always 0 to provide a full 64bit address space. The FS and GS segments are
+still functional in 64-bit mode.
+
+Common FS and GS usage
+------------------------------
+
+The FS segment is commonly used to address Thread Local Storage (TLS). FS
+is usually managed by runtime code or a threading library. Variables
+declared with the '__thread' storage class specifier are instantiated per
+thread and the compiler emits the FS: address prefix for accesses to these
+variables. Each thread has its own FS base address so common code can be
+used without complex address offset calculations to access the per thread
+instances. Applications should not use FS for other purposes when they use
+runtimes or threading libraries which manage the per thread FS.
+
+The GS segment has no common use and can be used freely by
+applications. GCC and Clang support GS based addressing via address space
+identifiers.
+
+Reading and writing the FS/GS base address
+------------------------------------------
+
+There exist two mechanisms to read and write the FS/GS base address:
+
+ - the arch_prctl() system call
+
+ - the FSGSBASE instruction family
+
+Accessing FS/GS base with arch_prctl()
+--------------------------------------
+
+ The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all
+ kernel versions.
+
+ Reading the base:
+
+   arch_prctl(ARCH_GET_FS, &fsbase);
+   arch_prctl(ARCH_GET_GS, &gsbase);
+
+ Writing the base:
+
+   arch_prctl(ARCH_SET_FS, fsbase);
+   arch_prctl(ARCH_SET_GS, gsbase);
+
+ The ARCH_SET_GS prctl may be disabled depending on kernel configuration
+ and security settings.
+
+Accessing FS/GS base with the FSGSBASE instructions
+---------------------------------------------------
+
+ With the Ivy Bridge CPU generation Intel introduced a new set of
+ instructions to access the FS and GS base registers directly from user
+ space. These instructions are also supported on AMD Family 17H CPUs. The
+ following instructions are available:
+
+  =============== ===========================
+  RDFSBASE %reg   Read the FS base register
+  RDGSBASE %reg   Read the GS base register
+  WRFSBASE %reg   Write the FS base register
+  WRGSBASE %reg   Write the GS base register
+  =============== ===========================
+
+ The instructions avoid the overhead of the arch_prctl() syscall and allow
+ more flexible usage of the FS/GS addressing modes in user space
+ applications. This does not prevent conflicts between threading libraries
+ and runtimes which utilize FS and applications which want to use it for
+ their own purpose.
+
+FSGSBASE instructions enablement
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If
+ available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs.
+
+ The availability of the instructions does not enable them
+ automatically. The kernel has to enable them explicitly in CR4. The
+ reason for this is that older kernels make assumptions about the values in
+ the GS register and enforce them when GS base is set via
+ arch_prctl(). Allowing user space to write arbitrary values to GS base
+ would violate these assumptions and cause malfunction.
+
+ On kernels which do not enable FSGSBASE the execution of the FSGSBASE
+ instructions will fault with a #UD exception.
+
+ The kernel provides reliable information about the enabled state in the
+ ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the
+ kernel has FSGSBASE instructions enabled and applications can use them.
+ The following code example shows how this detection works::
+
+   #include <sys/auxv.h>
+   #include <elf.h>
+
+   /* Will be eventually in asm/hwcap.h */
+   #ifndef HWCAP2_FSGSBASE
+   #define HWCAP2_FSGSBASE        (1 << 1)
+   #endif
+
+   ....
+
+   unsigned val = getauxval(AT_HWCAP2);
+
+   if (val & HWCAP2_FSGSBASE)
+        printf("FSGSBASE enabled\n");
+
+FSGSBASE instructions compiler support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE
+instructions. Clang 5 supports them as well.
+
+  =================== ===========================
+  _readfsbase_u64()   Read the FS base register
+  _readfsbase_u64()   Read the GS base register
+  _writefsbase_u64()  Write the FS base register
+  _writegsbase_u64()  Write the GS base register
+  =================== ===========================
+
+To utilize these instrinsics <immintrin.h> must be included in the source
+code and the compiler option -mfsgsbase has to be added.
+
+Compiler support for FS/GS based addressing
+-------------------------------------------
+
+GCC version 6 and newer provide support for FS/GS based addressing via
+Named Address Spaces. GCC implements the following address space
+identifiers for x86:
+
+  ========= ====================================
+  __seg_fs  Variable is addressed relative to FS
+  __seg_gs  Variable is addressed relative to GS
+  ========= ====================================
+
+The preprocessor symbols __SEG_FS and __SEG_GS are defined when these
+address spaces are supported. Code which implements fallback modes should
+check whether these symbols are defined. Usage example::
+
+  #ifdef __SEG_GS
+
+  long data0 = 0;
+  long data1 = 1;
+
+  long __seg_gs *ptr;
+
+  /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */
+  ....
+
+  /* Set GS base to point to data0 */
+  _writegsbase_u64(&data0);
+
+  /* Access offset 0 of GS */
+  ptr = 0;
+  printf("data0 = %ld\n", *ptr);
+
+  /* Set GS base to point to data1 */
+  _writegsbase_u64(&data1);
+  /* ptr still addresses offset 0! */
+  printf("data1 = %ld\n", *ptr);
+
+
+Clang does not provide the GCC address space identifiers, but it provides
+address spaces via an attribute based mechanism in Clang 2.6 and newer
+versions:
+
+ ==================================== =====================================
+  __attribute__((address_space(256))  Variable is addressed relative to GS
+  __attribute__((address_space(257))  Variable is addressed relative to FS
+ ==================================== =====================================
+
+FS/GS based addressing with inline assembly
+-------------------------------------------
+
+In case the compiler does not support address spaces, inline assembly can
+be used for FS/GS based addressing mode::
+
+	mov %fs:offset, %reg
+	mov %gs:offset, %reg
+
+	mov %reg, %fs:offset
+	mov %reg, %gs:offset
diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst
new file mode 100644
index 000000000000..a56070fc8e77
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+x86_64 Support
+==============
+
+.. toctree::
+   :maxdepth: 2
+
+   boot-options
+   uefi
+   mm
+   5level-paging
+   fake-numa-for-cpusets
+   cpu-hotplug-spec
+   machinecheck
+   fsgs
diff --git a/Documentation/arch/x86/x86_64/machinecheck.rst b/Documentation/arch/x86/x86_64/machinecheck.rst
new file mode 100644
index 000000000000..cea12ee97200
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/machinecheck.rst
@@ -0,0 +1,33 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================================
+Configurable sysfs parameters for the x86-64 machine check code
+===============================================================
+
+Machine checks report internal hardware error conditions detected
+by the CPU. Uncorrected errors typically cause a machine check
+(often with panic), corrected ones cause a machine check log entry.
+
+Machine checks are organized in banks (normally associated with
+a hardware subsystem) and subevents in a bank. The exact meaning
+of the banks and subevent is CPU specific.
+
+mcelog knows how to decode them.
+
+When you see the "Machine check errors logged" message in the system
+log then mcelog should run to collect and decode machine check entries
+from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
+
+Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
+(N = CPU number).
+
+The directory contains some configurable entries. See
+Documentation/ABI/testing/sysfs-mce for more details.
+
+TBD document entries for AMD threshold interrupt configuration
+
+For more details about the x86 machine check architecture
+see the Intel and AMD architecture manuals from their developer websites.
+
+For more details about the architecture
+see http://one.firstfloor.org/~andi/mce.pdf
diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
new file mode 100644
index 000000000000..35e5e18c83d0
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/mm.rst
@@ -0,0 +1,157 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Memory Management
+=================
+
+Complete virtual memory map with 4-level page tables
+====================================================
+
+.. note::
+
+ - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
+   from the top of the 64-bit address space. It's easier to understand the layout
+   when seen both in absolute addresses and in distance-from-top notation.
+
+   For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
+   64-bit address space (ffffffffffffffff).
+
+   Note that as we get closer to the top of the address space, the notation changes
+   from TB to GB and then MB/KB.
+
+ - "16M TB" might look weird at first sight, but it's an easier way to visualize size
+   notation than "16 EB", which few will recognize at first sight as 16 exabytes.
+   It also shows it nicely how incredibly large 64-bit address space is.
+
+::
+
+  ========================================================================================================================
+      Start addr    |   Offset   |     End addr     |  Size   | VM area description
+  ========================================================================================================================
+                    |            |                  |         |
+   0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
+  __________________|____________|__________________|_________|___________________________________________________________
+                    |            |                  |         |
+   0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
+                    |            |                  |         |     virtual memory addresses up to the -128 TB
+                    |            |                  |         |     starting offset of kernel mappings.
+  __________________|____________|__________________|_________|___________________________________________________________
+                                                              |
+                                                              | Kernel-space virtual memory, shared between all processes:
+  ____________________________________________________________|___________________________________________________________
+                    |            |                  |         |
+   ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
+   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
+   ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
+   ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
+   ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
+   ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
+   ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
+   ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
+   ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
+  __________________|____________|__________________|_________|____________________________________________________________
+                                                              |
+                                                              | Identical layout to the 56-bit one from here on:
+  ____________________________________________________________|____________________________________________________________
+                    |            |                  |         |
+   fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
+                    |            |                  |         | vaddr_end for KASLR
+   fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
+   fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
+   ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
+   ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
+   ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
+   ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |-2048    MB |                  |         |
+   ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
+   ffffffffff000000 |  -16    MB |                  |         |
+      FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
+   ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
+   ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
+  __________________|____________|__________________|_________|___________________________________________________________
+
+
+Complete virtual memory map with 5-level page tables
+====================================================
+
+.. note::
+
+ - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
+   from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
+   offset and many of the regions expand to support the much larger physical
+   memory supported.
+
+::
+
+  ========================================================================================================================
+      Start addr    |   Offset   |     End addr     |  Size   | VM area description
+  ========================================================================================================================
+                    |            |                  |         |
+   0000000000000000 |    0       | 00ffffffffffffff |   64 PB | user-space virtual memory, different per mm
+  __________________|____________|__________________|_________|___________________________________________________________
+                    |            |                  |         |
+   0100000000000000 |  +64    PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
+                    |            |                  |         |     virtual memory addresses up to the -64 PB
+                    |            |                  |         |     starting offset of kernel mappings.
+  __________________|____________|__________________|_________|___________________________________________________________
+                                                              |
+                                                              | Kernel-space virtual memory, shared between all processes:
+  ____________________________________________________________|___________________________________________________________
+                    |            |                  |         |
+   ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
+   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
+   ff11000000000000 |  -59.75 PB | ff90ffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
+   ff91000000000000 |  -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
+   ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
+   ffd2000000000000 |  -11.5  PB | ffd3ffffffffffff |  0.5 PB | ... unused hole
+   ffd4000000000000 |  -11    PB | ffd5ffffffffffff |  0.5 PB | virtual memory map (vmemmap_base)
+   ffd6000000000000 |  -10.5  PB | ffdeffffffffffff | 2.25 PB | ... unused hole
+   ffdf000000000000 |   -8.25 PB | fffffbffffffffff |   ~8 PB | KASAN shadow memory
+  __________________|____________|__________________|_________|____________________________________________________________
+                                                              |
+                                                              | Identical layout to the 47-bit one from here on:
+  ____________________________________________________________|____________________________________________________________
+                    |            |                  |         |
+   fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
+                    |            |                  |         | vaddr_end for KASLR
+   fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
+   fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
+   ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
+   ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
+   ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
+   ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |-2048    MB |                  |         |
+   ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
+   ffffffffff000000 |  -16    MB |                  |         |
+      FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
+   ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
+   ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
+  __________________|____________|__________________|_________|___________________________________________________________
+
+Architecture defines a 64-bit virtual address. Implementations can support
+less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
+through to the most-significant implemented bit are sign extended.
+This causes hole between user space and kernel addresses if you interpret them
+as unsigned.
+
+The direct mapping covers all memory in the system up to the highest
+memory address (this means in some cases it can also include PCI memory
+holes).
+
+We map EFI runtime services in the 'efi_pgd' PGD in a 64GB large virtual
+memory window (this size is arbitrary, it can be raised later if needed).
+The mappings are not part of any other kernel PGD and are only available
+during EFI runtime calls.
+
+Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
+physical memory, vmalloc/ioremap space and virtual memory map are randomized.
+Their order is preserved but their base will be offset early at boot time.
+
+Be very careful vs. KASLR when changing anything here. The KASLR address
+range must not overlap with anything except the KASAN shadow area, which is
+correct as KASAN disables KASLR.
+
+For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
+hole: ffffffffffff4111
diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst
new file mode 100644
index 000000000000..fbc30c9a071d
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/uefi.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+General note on [U]EFI x86_64 support
+=====================================
+
+The nomenclature EFI and UEFI are used interchangeably in this document.
+
+Although the tools below are _not_ needed for building the kernel,
+the needed bootloader support and associated tools for x86_64 platforms
+with EFI firmware and specifications are listed below.
+
+1. UEFI specification:  http://www.uefi.org
+
+2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
+   support. Elilo with x86_64 support can be used.
+
+3. x86_64 platform with EFI/UEFI firmware.
+
+Mechanics
+---------
+
+- Build the kernel with the following configuration::
+
+	CONFIG_FB_EFI=y
+	CONFIG_FRAMEBUFFER_CONSOLE=y
+
+  If EFI runtime services are expected, the following configuration should
+  be selected::
+
+	CONFIG_EFI=y
+	CONFIG_EFIVAR_FS=y or m		# optional
+
+- Create a VFAT partition on the disk
+- Copy the following to the VFAT partition:
+
+	elilo bootloader with x86_64 support, elilo configuration file,
+	kernel image built in first step and corresponding
+	initrd. Instructions on building elilo and its dependencies
+	can be found in the elilo sourceforge project.
+
+- Boot to EFI shell and invoke elilo choosing the kernel image built
+  in first step.
+- If some or all EFI runtime services don't work, you can try following
+  kernel command line parameters to turn off some or all EFI runtime
+  services.
+
+	noefi
+		turn off all EFI runtime services
+	reboot_type=k
+		turn off EFI reboot runtime service
+
+- If the EFI memory map has additional entries not in the E820 map,
+  you can include those entries in the kernels memory map of available
+  physical RAM by using the following kernel command line parameter.
+
+	add_efi_memmap
+		include EFI memory map of available physical RAM
diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst
new file mode 100644
index 000000000000..ae5c69e48b11
--- /dev/null
+++ b/Documentation/arch/x86/xstate.rst
@@ -0,0 +1,174 @@
+Using XSTATE features in user space applications
+================================================
+
+The x86 architecture supports floating-point extensions which are
+enumerated via CPUID. Applications consult CPUID and use XGETBV to
+evaluate which features have been enabled by the kernel XCR0.
+
+Up to AVX-512 and PKRU states, these features are automatically enabled by
+the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
+are enabled by XCR0 as well, but the first use of related instruction is
+trapped by the kernel because by default the required large XSTATE buffers
+are not allocated automatically.
+
+The purpose for dynamic features
+--------------------------------
+
+Legacy userspace libraries often have hard-coded, static sizes for
+alternate signal stacks, often using MINSIGSTKSZ which is typically 2KB.
+That stack must be able to store at *least* the signal frame that the
+kernel sets up before jumping into the signal handler. That signal frame
+must include an XSAVE buffer defined by the CPU.
+
+However, that means that the size of signal stacks is dynamic, not static,
+because different CPUs have differently-sized XSAVE buffers. A compiled-in
+size of 2KB with existing applications is too small for new CPU features
+like AMX. Instead of universally requiring larger stack, with the dynamic
+enabling, the kernel can enforce userspace applications to have
+properly-sized altstacks.
+
+Using dynamically enabled XSTATE features in user space applications
+--------------------------------------------------------------------
+
+The kernel provides an arch_prctl(2) based mechanism for applications to
+request the usage of such features. The arch_prctl(2) options related to
+this are:
+
+-ARCH_GET_XCOMP_SUPP
+
+ arch_prctl(ARCH_GET_XCOMP_SUPP, &features);
+
+ ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
+ type uint64_t. The second argument is a pointer to that storage.
+
+-ARCH_GET_XCOMP_PERM
+
+ arch_prctl(ARCH_GET_XCOMP_PERM, &features);
+
+ ARCH_GET_XCOMP_PERM stores the features for which the userspace process
+ has permission in userspace storage of type uint64_t. The second argument
+ is a pointer to that storage.
+
+-ARCH_REQ_XCOMP_PERM
+
+ arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);
+
+ ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
+ feature or a feature set. A feature set can be mapped to a facility, e.g.
+ AMX, and can require one or more XSTATE components to be enabled.
+
+ The feature argument is the number of the highest XSTATE component which
+ is required for a facility to work.
+
+When requesting permission for a feature, the kernel checks the
+availability. The kernel ensures that sigaltstacks in the process's tasks
+are large enough to accommodate the resulting large signal frame. It
+enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
+sigaltstack(2) calls. If an installed sigaltstack is smaller than the
+resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
+sigaltstack(2) results in -ENOMEM if the requested altstack is too small
+for the permitted features.
+
+Permission, when granted, is valid per process. Permissions are inherited
+on fork(2) and cleared on exec(3).
+
+The first use of an instruction related to a dynamically enabled feature is
+trapped by the kernel. The trap handler checks whether the process has
+permission to use the feature. If the process has no permission then the
+kernel sends SIGILL to the application. If the process has permission then
+the handler allocates a larger xstate buffer for the task so the large
+state can be context switched. In the unlikely cases that the allocation
+fails, the kernel sends SIGSEGV.
+
+AMX TILE_DATA enabling example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Below is the example of how userspace applications enable
+TILE_DATA dynamically:
+
+  1. The application first needs to query the kernel for AMX
+     support::
+
+        #include <asm/prctl.h>
+        #include <sys/syscall.h>
+        #include <stdio.h>
+        #include <unistd.h>
+
+        #ifndef ARCH_GET_XCOMP_SUPP
+        #define ARCH_GET_XCOMP_SUPP  0x1021
+        #endif
+
+        #ifndef ARCH_XCOMP_TILECFG
+        #define ARCH_XCOMP_TILECFG   17
+        #endif
+
+        #ifndef ARCH_XCOMP_TILEDATA
+        #define ARCH_XCOMP_TILEDATA  18
+        #endif
+
+        #define MASK_XCOMP_TILE      ((1 << ARCH_XCOMP_TILECFG) | \
+                                      (1 << ARCH_XCOMP_TILEDATA))
+
+        unsigned long features;
+        long rc;
+
+        ...
+
+        rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_SUPP, &features);
+
+        if (!rc && (features & MASK_XCOMP_TILE) == MASK_XCOMP_TILE)
+            printf("AMX is available.\n");
+
+  2. After that, determining support for AMX, an application must
+     explicitly ask permission to use it::
+
+        #ifndef ARCH_REQ_XCOMP_PERM
+        #define ARCH_REQ_XCOMP_PERM  0x1023
+        #endif
+
+        ...
+
+        rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, ARCH_XCOMP_TILEDATA);
+
+        if (!rc)
+            printf("AMX is ready for use.\n");
+
+Note this example does not include the sigaltstack preparation.
+
+Dynamic features in signal frames
+---------------------------------
+
+Dynamcally enabled features are not written to the signal frame upon signal
+entry if the feature is in its initial configuration.  This differs from
+non-dynamic features which are always written regardless of their
+configuration.  Signal handlers can examine the XSAVE buffer's XSTATE_BV
+field to determine if a features was written.
+
+Dynamic features for virtual machines
+-------------------------------------
+
+The permission for the guest state component needs to be managed separately
+from the host, as they are exclusive to each other. A coupled of options
+are extended to control the guest permission:
+
+-ARCH_GET_XCOMP_GUEST_PERM
+
+ arch_prctl(ARCH_GET_XCOMP_GUEST_PERM, &features);
+
+ ARCH_GET_XCOMP_GUEST_PERM is a variant of ARCH_GET_XCOMP_PERM. So it
+ provides the same semantics and functionality but for the guest
+ components.
+
+-ARCH_REQ_XCOMP_GUEST_PERM
+
+ arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM, feature_nr);
+
+ ARCH_REQ_XCOMP_GUEST_PERM is a variant of ARCH_REQ_XCOMP_PERM. It has the
+ same semantics for the guest permission. While providing a similar
+ functionality, this comes with a constraint. Permission is frozen when the
+ first VCPU is created. Any attempt to change permission after that point
+ is going to be rejected. So, the permission has to be requested before the
+ first VCPU creation.
+
+Note that some VMMs may have already established a set of supported state
+components. These options are not presumed to support any particular VMM.
diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
new file mode 100644
index 000000000000..45aa9cceb4f1
--- /dev/null
+++ b/Documentation/arch/x86/zero-page.rst
@@ -0,0 +1,47 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+Zero Page
+=========
+The additional fields in struct boot_params as a part of 32-bit boot
+protocol of kernel. These should be filled by bootloader or 16-bit
+real-mode setup code of the kernel. References/settings to it mainly
+are in::
+
+  arch/x86/include/uapi/asm/bootparam.h
+
+===========	=====	=======================	=================================================
+Offset/Size	Proto	Name			Meaning
+
+000/040		ALL	screen_info		Text mode or frame buffer information
+						(struct screen_info)
+040/014		ALL	apm_bios_info		APM BIOS information (struct apm_bios_info)
+058/008		ALL	tboot_addr      	Physical address of tboot shared page
+060/010		ALL	ist_info		Intel SpeedStep (IST) BIOS support information
+						(struct ist_info)
+070/008		ALL	acpi_rsdp_addr		Physical address of ACPI RSDP table
+080/010		ALL	hd0_info		hd0 disk parameter, OBSOLETE!!
+090/010		ALL	hd1_info		hd1 disk parameter, OBSOLETE!!
+0A0/010		ALL	sys_desc_table		System description table (struct sys_desc_table),
+						OBSOLETE!!
+0B0/010		ALL	olpc_ofw_header		OLPC's OpenFirmware CIF and friends
+0C0/004		ALL	ext_ramdisk_image	ramdisk_image high 32bits
+0C4/004		ALL	ext_ramdisk_size	ramdisk_size high 32bits
+0C8/004		ALL	ext_cmd_line_ptr	cmd_line_ptr high 32bits
+13C/004		ALL	cc_blob_address		Physical address of Confidential Computing blob
+140/080		ALL	edid_info		Video mode setup (struct edid_info)
+1C0/020		ALL	efi_info		EFI 32 information (struct efi_info)
+1E0/004		ALL	alt_mem_k		Alternative mem check, in KB
+1E4/004		ALL	scratch			Scratch field for the kernel setup code
+1E8/001		ALL	e820_entries		Number of entries in e820_table (below)
+1E9/001		ALL	eddbuf_entries		Number of entries in eddbuf (below)
+1EA/001		ALL	edd_mbr_sig_buf_entries	Number of entries in edd_mbr_sig_buffer
+						(below)
+1EB/001		ALL     kbd_status      	Numlock is enabled
+1EC/001		ALL     secure_boot		Secure boot is enabled in the firmware
+1EF/001		ALL	sentinel		Used to detect broken bootloaders
+290/040		ALL	edd_mbr_sig_buffer	EDD MBR signatures
+2D0/A00		ALL	e820_table		E820 memory map table
+						(array of struct e820_entry)
+D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
+===========	=====	=======================	=================================================
diff --git a/Documentation/arch/xtensa/atomctl.rst b/Documentation/arch/xtensa/atomctl.rst
new file mode 100644
index 000000000000..1ecbd0ba9a2e
--- /dev/null
+++ b/Documentation/arch/xtensa/atomctl.rst
@@ -0,0 +1,51 @@
+===========================================
+Atomic Operation Control (ATOMCTL) Register
+===========================================
+
+We Have Atomic Operation Control (ATOMCTL) Register.
+This register determines the effect of using a S32C1I instruction
+with various combinations of:
+
+     1. With and without an Coherent Cache Controller which
+        can do Atomic Transactions to the memory internally.
+
+     2. With and without An Intelligent Memory Controller which
+        can do Atomic Transactions itself.
+
+The Core comes up with a default value of for the three types of cache ops::
+
+      0x28: (WB: Internal, WT: Internal, BY:Exception)
+
+On the FPGA Cards we typically simulate an Intelligent Memory controller
+which can implement  RCW transactions. For FPGA cards with an External
+Memory controller we let it to the atomic operations internally while
+doing a Cached (WB) transaction and use the Memory RCW for un-cached
+operations.
+
+For systems without an coherent cache controller, non-MX, we always
+use the memory controllers RCW, thought non-MX controlers likely
+support the Internal Operation.
+
+CUSTOMER-WARNING:
+   Virtually all customers buy their memory controllers from vendors that
+   don't support atomic RCW memory transactions and will likely want to
+   configure this register to not use RCW.
+
+Developers might find using RCW in Bypass mode convenient when testing
+with the cache being bypassed; for example studying cache alias problems.
+
+See Section 4.3.12.4 of ISA; Bits::
+
+                             WB     WT      BY
+                           5   4 | 3   2 | 1   0
+
+=========    ==================      ==================      ===============
+  2 Bit
+  Field
+  Values     WB - Write Back         WT - Write Thru         BY - Bypass
+=========    ==================      ==================      ===============
+    0        Exception               Exception               Exception
+    1        RCW Transaction         RCW Transaction         RCW Transaction
+    2        Internal Operation      Internal Operation      Reserved
+    3        Reserved                Reserved                Reserved
+=========    ==================      ==================      ===============
diff --git a/Documentation/arch/xtensa/booting.rst b/Documentation/arch/xtensa/booting.rst
new file mode 100644
index 000000000000..e1b83707e5b6
--- /dev/null
+++ b/Documentation/arch/xtensa/booting.rst
@@ -0,0 +1,22 @@
+=====================================
+Passing boot parameters to the kernel
+=====================================
+
+Boot parameters are represented as a TLV list in the memory. Please see
+arch/xtensa/include/asm/bootparam.h for definition of the bp_tag structure and
+tag value constants. First entry in the list must have type BP_TAG_FIRST, last
+entry must have type BP_TAG_LAST. The address of the first list entry is
+passed to the kernel in the register a2. The address type depends on MMU type:
+
+- For configurations without MMU, with region protection or with MPU the
+  address must be the physical address.
+- For configurations with region translarion MMU or with MMUv3 and CONFIG_MMU=n
+  the address must be a valid address in the current mapping. The kernel will
+  not change the mapping on its own.
+- For configurations with MMUv2 the address must be a virtual address in the
+  default virtual mapping (0xd0000000..0xffffffff).
+- For configurations with MMUv3 and CONFIG_MMU=y the address may be either a
+  virtual or physical address. In either case it must be within the default
+  virtual mapping. It is considered physical if it is within the range of
+  physical addresses covered by the default KSEG mapping (XCHAL_KSEG_PADDR..
+  XCHAL_KSEG_PADDR + XCHAL_KSEG_SIZE), otherwise it is considered virtual.
diff --git a/Documentation/arch/xtensa/features.rst b/Documentation/arch/xtensa/features.rst
new file mode 100644
index 000000000000..6b92c7bfa19d
--- /dev/null
+++ b/Documentation/arch/xtensa/features.rst
@@ -0,0 +1,3 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. kernel-feat:: $srctree/Documentation/features xtensa
diff --git a/Documentation/arch/xtensa/index.rst b/Documentation/arch/xtensa/index.rst
new file mode 100644
index 000000000000..69952446a9be
--- /dev/null
+++ b/Documentation/arch/xtensa/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+Xtensa Architecture
+===================
+
+.. toctree::
+   :maxdepth: 1
+
+   atomctl
+   booting
+   mmu
+
+   features
diff --git a/Documentation/arch/xtensa/mmu.rst b/Documentation/arch/xtensa/mmu.rst
new file mode 100644
index 000000000000..450573afa31a
--- /dev/null
+++ b/Documentation/arch/xtensa/mmu.rst
@@ -0,0 +1,198 @@
+=============================
+MMUv3 initialization sequence
+=============================
+
+The code in the initialize_mmu macro sets up MMUv3 memory mapping
+identically to MMUv2 fixed memory mapping. Depending on
+CONFIG_INITIALIZE_XTENSA_MMU_INSIDE_VMLINUX symbol this code is
+located in addresses it was linked for (symbol undefined), or not
+(symbol defined), so it needs to be position-independent.
+
+The code has the following assumptions:
+
+  - This code fragment is run only on an MMU v3.
+  - TLBs are in their reset state.
+  - ITLBCFG and DTLBCFG are zero (reset state).
+  - RASID is 0x04030201 (reset state).
+  - PS.RING is zero (reset state).
+  - LITBASE is zero (reset state, PC-relative literals); required to be PIC.
+
+TLB setup proceeds along the following steps.
+
+  Legend:
+
+    - VA = virtual address (two upper nibbles of it);
+    - PA = physical address (two upper nibbles of it);
+    - pc = physical range that contains this code;
+
+After step 2, we jump to virtual address in the range 0x40000000..0x5fffffff
+or 0x00000000..0x1fffffff, depending on whether the kernel was loaded below
+0x40000000 or above. That address corresponds to next instruction to execute
+in this code. After step 4, we jump to intended (linked) address of this code.
+The scheme below assumes that the kernel is loaded below 0x40000000.
+
+ ====== =====  =====  =====  =====   ====== =====  =====
+ -      Step0  Step1  Step2  Step3          Step4  Step5
+
+   VA      PA     PA     PA     PA     VA      PA     PA
+ ====== =====  =====  =====  =====   ====== =====  =====
+ E0..FF -> E0  -> E0  -> E0          F0..FF -> F0  -> F0
+ C0..DF -> C0  -> C0  -> C0          E0..EF -> F0  -> F0
+ A0..BF -> A0  -> A0  -> A0          D8..DF -> 00  -> 00
+ 80..9F -> 80  -> 80  -> 80          D0..D7 -> 00  -> 00
+ 60..7F -> 60  -> 60  -> 60
+ 40..5F -> 40         -> pc  -> pc   40..5F -> pc
+ 20..3F -> 20  -> 20  -> 20
+ 00..1F -> 00  -> 00  -> 00
+ ====== =====  =====  =====  =====   ====== =====  =====
+
+The default location of IO peripherals is above 0xf0000000. This may be changed
+using a "ranges" property in a device tree simple-bus node. See the Devicetree
+Specification, section 4.5 for details on the syntax and semantics of
+simple-bus nodes. The following limitations apply:
+
+1. Only top level simple-bus nodes are considered
+
+2. Only one (first) simple-bus node is considered
+
+3. Empty "ranges" properties are not supported
+
+4. Only the first triplet in the "ranges" property is considered
+
+5. The parent-bus-address value is rounded down to the nearest 256MB boundary
+
+6. The IO area covers the entire 256MB segment of parent-bus-address; the
+   "ranges" triplet length field is ignored
+
+
+MMUv3 address space layouts.
+============================
+
+Default MMUv2-compatible layout::
+
+                        Symbol                   VADDR       Size
+  +------------------+
+  | Userspace        |                           0x00000000  TASK_SIZE
+  +------------------+                           0x40000000
+  +------------------+
+  | Page table       |  XCHAL_PAGE_TABLE_VADDR   0x80000000  XCHAL_PAGE_TABLE_SIZE
+  +------------------+
+  | KASAN shadow map |  KASAN_SHADOW_START       0x80400000  KASAN_SHADOW_SIZE
+  +------------------+                           0x8e400000
+  +------------------+
+  | VMALLOC area     |  VMALLOC_START            0xc0000000  128MB - 64KB
+  +------------------+  VMALLOC_END
+  +------------------+
+  | Cache aliasing   |  TLBTEMP_BASE_1           0xc8000000  DCACHE_WAY_SIZE
+  | remap area 1     |
+  +------------------+
+  | Cache aliasing   |  TLBTEMP_BASE_2                       DCACHE_WAY_SIZE
+  | remap area 2     |
+  +------------------+
+  +------------------+
+  | KMAP area        |  PKMAP_BASE                           PTRS_PER_PTE *
+  |                  |                                       DCACHE_N_COLORS *
+  |                  |                                       PAGE_SIZE
+  |                  |                                       (4MB * DCACHE_N_COLORS)
+  +------------------+
+  | Atomic KMAP area |  FIXADDR_START                        KM_TYPE_NR *
+  |                  |                                       NR_CPUS *
+  |                  |                                       DCACHE_N_COLORS *
+  |                  |                                       PAGE_SIZE
+  +------------------+  FIXADDR_TOP              0xcffff000
+  +------------------+
+  | Cached KSEG      |  XCHAL_KSEG_CACHED_VADDR  0xd0000000  128MB
+  +------------------+
+  | Uncached KSEG    |  XCHAL_KSEG_BYPASS_VADDR  0xd8000000  128MB
+  +------------------+
+  | Cached KIO       |  XCHAL_KIO_CACHED_VADDR   0xe0000000  256MB
+  +------------------+
+  | Uncached KIO     |  XCHAL_KIO_BYPASS_VADDR   0xf0000000  256MB
+  +------------------+
+
+
+256MB cached + 256MB uncached layout::
+
+                        Symbol                   VADDR       Size
+  +------------------+
+  | Userspace        |                           0x00000000  TASK_SIZE
+  +------------------+                           0x40000000
+  +------------------+
+  | Page table       |  XCHAL_PAGE_TABLE_VADDR   0x80000000  XCHAL_PAGE_TABLE_SIZE
+  +------------------+
+  | KASAN shadow map |  KASAN_SHADOW_START       0x80400000  KASAN_SHADOW_SIZE
+  +------------------+                           0x8e400000
+  +------------------+
+  | VMALLOC area     |  VMALLOC_START            0xa0000000  128MB - 64KB
+  +------------------+  VMALLOC_END
+  +------------------+
+  | Cache aliasing   |  TLBTEMP_BASE_1           0xa8000000  DCACHE_WAY_SIZE
+  | remap area 1     |
+  +------------------+
+  | Cache aliasing   |  TLBTEMP_BASE_2                       DCACHE_WAY_SIZE
+  | remap area 2     |
+  +------------------+
+  +------------------+
+  | KMAP area        |  PKMAP_BASE                           PTRS_PER_PTE *
+  |                  |                                       DCACHE_N_COLORS *
+  |                  |                                       PAGE_SIZE
+  |                  |                                       (4MB * DCACHE_N_COLORS)
+  +------------------+
+  | Atomic KMAP area |  FIXADDR_START                        KM_TYPE_NR *
+  |                  |                                       NR_CPUS *
+  |                  |                                       DCACHE_N_COLORS *
+  |                  |                                       PAGE_SIZE
+  +------------------+  FIXADDR_TOP              0xaffff000
+  +------------------+
+  | Cached KSEG      |  XCHAL_KSEG_CACHED_VADDR  0xb0000000  256MB
+  +------------------+
+  | Uncached KSEG    |  XCHAL_KSEG_BYPASS_VADDR  0xc0000000  256MB
+  +------------------+
+  +------------------+
+  | Cached KIO       |  XCHAL_KIO_CACHED_VADDR   0xe0000000  256MB
+  +------------------+
+  | Uncached KIO     |  XCHAL_KIO_BYPASS_VADDR   0xf0000000  256MB
+  +------------------+
+
+
+512MB cached + 512MB uncached layout::
+
+                        Symbol                   VADDR       Size
+  +------------------+
+  | Userspace        |                           0x00000000  TASK_SIZE
+  +------------------+                           0x40000000
+  +------------------+
+  | Page table       |  XCHAL_PAGE_TABLE_VADDR   0x80000000  XCHAL_PAGE_TABLE_SIZE
+  +------------------+
+  | KASAN shadow map |  KASAN_SHADOW_START       0x80400000  KASAN_SHADOW_SIZE
+  +------------------+                           0x8e400000
+  +------------------+
+  | VMALLOC area     |  VMALLOC_START            0x90000000  128MB - 64KB
+  +------------------+  VMALLOC_END
+  +------------------+
+  | Cache aliasing   |  TLBTEMP_BASE_1           0x98000000  DCACHE_WAY_SIZE
+  | remap area 1     |
+  +------------------+
+  | Cache aliasing   |  TLBTEMP_BASE_2                       DCACHE_WAY_SIZE
+  | remap area 2     |
+  +------------------+
+  +------------------+
+  | KMAP area        |  PKMAP_BASE                           PTRS_PER_PTE *
+  |                  |                                       DCACHE_N_COLORS *
+  |                  |                                       PAGE_SIZE
+  |                  |                                       (4MB * DCACHE_N_COLORS)
+  +------------------+
+  | Atomic KMAP area |  FIXADDR_START                        KM_TYPE_NR *
+  |                  |                                       NR_CPUS *
+  |                  |                                       DCACHE_N_COLORS *
+  |                  |                                       PAGE_SIZE
+  +------------------+  FIXADDR_TOP              0x9ffff000
+  +------------------+
+  | Cached KSEG      |  XCHAL_KSEG_CACHED_VADDR  0xa0000000  512MB
+  +------------------+
+  | Uncached KSEG    |  XCHAL_KSEG_BYPASS_VADDR  0xc0000000  512MB
+  +------------------+
+  | Cached KIO       |  XCHAL_KIO_CACHED_VADDR   0xe0000000  256MB
+  +------------------+
+  | Uncached KIO     |  XCHAL_KIO_BYPASS_VADDR   0xf0000000  256MB
+  +------------------+