diff options
Diffstat (limited to 'Documentation/arch')
89 files changed, 15617 insertions, 0 deletions
diff --git a/Documentation/arch/arc/arc.rst b/Documentation/arch/arc/arc.rst new file mode 100644 index 000000000000..6c4d978f3f4e --- /dev/null +++ b/Documentation/arch/arc/arc.rst @@ -0,0 +1,85 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Linux kernel for ARC processors +******************************* + +Other sources of information +############################ + +Below are some resources where more information can be found on +ARC processors and relevant open source projects. + +- `<https://embarc.org>`_ - Community portal for open source on ARC. + Good place to start to find relevant FOSS projects, toolchain releases, + news items and more. + +- `<https://github.com/foss-for-synopsys-dwc-arc-processors>`_ - + Home for all development activities regarding open source projects for + ARC processors. Some of the projects are forks of various upstream projects, + where "work in progress" is hosted prior to submission to upstream projects. + Other projects are developed by Synopsys and made available to community + as open source for use on ARC Processors. + +- `Official Synopsys ARC Processors website + <https://www.synopsys.com/designware-ip/processor-solutions.html>`_ - + location, with access to some IP documentation (`Programmer's Reference + Manual, AKA PRM for ARC HS processors + <https://www.synopsys.com/dw/doc.php/ds/cc/programmers-reference-manual-ARC-HS.pdf>`_) + and free versions of some commercial tools (`Free nSIM + <https://www.synopsys.com/cgi-bin/dwarcnsim/req1.cgi>`_ and + `MetaWare Light Edition <https://www.synopsys.com/cgi-bin/arcmwtk_lite/reg1.cgi>`_). + Please note though, registration is required to access both the documentation and + the tools. + +Important note on ARC processors configurability +################################################ + +ARC processors are highly configurable and several configurable options +are supported in Linux. Some options are transparent to software +(i.e cache geometries, some can be detected at runtime and configured +and used accordingly, while some need to be explicitly selected or configured +in the kernel's configuration utility (AKA "make menuconfig"). + +However not all configurable options are supported when an ARC processor +is to run Linux. SoC design teams should refer to "Appendix E: +Configuration for ARC Linux" in the ARC HS Databook for configurability +guidelines. + +Following these guidelines and selecting valid configuration options +up front is critical to help prevent any unwanted issues during +SoC bringup and software development in general. + +Building the Linux kernel for ARC processors +############################################ + +The process of kernel building for ARC processors is the same as for any other +architecture and could be done in 2 ways: + +- Cross-compilation: process of compiling for ARC targets on a development + host with a different processor architecture (generally x86_64/amd64). +- Native compilation: process of compiling for ARC on a ARC platform + (hardware board or a simulator like QEMU) with complete development environment + (GNU toolchain, dtc, make etc) installed on the platform. + +In both cases, up-to-date GNU toolchain for ARC for the host is needed. +Synopsys offers prebuilt toolchain releases which can be used for this purpose, +available from: + +- Synopsys GNU toolchain releases: + `<https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/releases>`_ + +- Linux kernel compilers collection: + `<https://mirrors.edge.kernel.org/pub/tools/crosstool>`_ + +- Bootlin's toolchain collection: `<https://toolchains.bootlin.com>`_ + +Once the toolchain is installed in the system, make sure its "bin" folder +is added in your ``PATH`` environment variable. Then set ``ARCH=arc`` & +``CROSS_COMPILE=arc-linux`` (or whatever matches installed ARC toolchain prefix) +and then as usual ``make defconfig && make``. + +This will produce "vmlinux" file in the root of the kernel source tree +usable for loading on the target system via JTAG. +If you need to get an image usable with U-Boot bootloader, +type ``make uImage`` and ``uImage`` will be produced in ``arch/arc/boot`` +folder. diff --git a/Documentation/arch/arc/features.rst b/Documentation/arch/arc/features.rst new file mode 100644 index 000000000000..b793583d688a --- /dev/null +++ b/Documentation/arch/arc/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features arc diff --git a/Documentation/arch/arc/index.rst b/Documentation/arch/arc/index.rst new file mode 100644 index 000000000000..7b098d4a5e3e --- /dev/null +++ b/Documentation/arch/arc/index.rst @@ -0,0 +1,17 @@ +=================== +ARC architecture +=================== + +.. toctree:: + :maxdepth: 1 + + arc + + features + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/arch/ia64/aliasing.rst b/Documentation/arch/ia64/aliasing.rst new file mode 100644 index 000000000000..36a1e1d4842b --- /dev/null +++ b/Documentation/arch/ia64/aliasing.rst @@ -0,0 +1,246 @@ +================================== +Memory Attribute Aliasing on IA-64 +================================== + +Bjorn Helgaas <bjorn.helgaas@hp.com> + +May 4, 2006 + + +Memory Attributes +================= + + Itanium supports several attributes for virtual memory references. + The attribute is part of the virtual translation, i.e., it is + contained in the TLB entry. The ones of most interest to the Linux + kernel are: + + == ====================== + WB Write-back (cacheable) + UC Uncacheable + WC Write-coalescing + == ====================== + + System memory typically uses the WB attribute. The UC attribute is + used for memory-mapped I/O devices. The WC attribute is uncacheable + like UC is, but writes may be delayed and combined to increase + performance for things like frame buffers. + + The Itanium architecture requires that we avoid accessing the same + page with both a cacheable mapping and an uncacheable mapping[1]. + + The design of the chipset determines which attributes are supported + on which regions of the address space. For example, some chipsets + support either WB or UC access to main memory, while others support + only WB access. + +Memory Map +========== + + Platform firmware describes the physical memory map and the + supported attributes for each region. At boot-time, the kernel uses + the EFI GetMemoryMap() interface. ACPI can also describe memory + devices and the attributes they support, but Linux/ia64 currently + doesn't use this information. + + The kernel uses the efi_memmap table returned from GetMemoryMap() to + learn the attributes supported by each region of physical address + space. Unfortunately, this table does not completely describe the + address space because some machines omit some or all of the MMIO + regions from the map. + + The kernel maintains another table, kern_memmap, which describes the + memory Linux is actually using and the attribute for each region. + This contains only system memory; it does not contain MMIO space. + + The kern_memmap table typically contains only a subset of the system + memory described by the efi_memmap. Linux/ia64 can't use all memory + in the system because of constraints imposed by the identity mapping + scheme. + + The efi_memmap table is preserved unmodified because the original + boot-time information is required for kexec. + +Kernel Identity Mappings +======================== + + Linux/ia64 identity mappings are done with large pages, currently + either 16MB or 64MB, referred to as "granules." Cacheable mappings + are speculative[2], so the processor can read any location in the + page at any time, independent of the programmer's intentions. This + means that to avoid attribute aliasing, Linux can create a cacheable + identity mapping only when the entire granule supports cacheable + access. + + Therefore, kern_memmap contains only full granule-sized regions that + can referenced safely by an identity mapping. + + Uncacheable mappings are not speculative, so the processor will + generate UC accesses only to locations explicitly referenced by + software. This allows UC identity mappings to cover granules that + are only partially populated, or populated with a combination of UC + and WB regions. + +User Mappings +============= + + User mappings are typically done with 16K or 64K pages. The smaller + page size allows more flexibility because only 16K or 64K has to be + homogeneous with respect to memory attributes. + +Potential Attribute Aliasing Cases +================================== + + There are several ways the kernel creates new mappings: + +mmap of /dev/mem +---------------- + + This uses remap_pfn_range(), which creates user mappings. These + mappings may be either WB or UC. If the region being mapped + happens to be in kern_memmap, meaning that it may also be mapped + by a kernel identity mapping, the user mapping must use the same + attribute as the kernel mapping. + + If the region is not in kern_memmap, the user mapping should use + an attribute reported as being supported in the EFI memory map. + + Since the EFI memory map does not describe MMIO on some + machines, this should use an uncacheable mapping as a fallback. + +mmap of /sys/class/pci_bus/.../legacy_mem +----------------------------------------- + + This is very similar to mmap of /dev/mem, except that legacy_mem + only allows mmap of the one megabyte "legacy MMIO" area for a + specific PCI bus. Typically this is the first megabyte of + physical address space, but it may be different on machines with + several VGA devices. + + "X" uses this to access VGA frame buffers. Using legacy_mem + rather than /dev/mem allows multiple instances of X to talk to + different VGA cards. + + The /dev/mem mmap constraints apply. + +mmap of /proc/bus/pci/.../??.? +------------------------------ + + This is an MMIO mmap of PCI functions, which additionally may or + may not be requested as using the WC attribute. + + If WC is requested, and the region in kern_memmap is either WC + or UC, and the EFI memory map designates the region as WC, then + the WC mapping is allowed. + + Otherwise, the user mapping must use the same attribute as the + kernel mapping. + +read/write of /dev/mem +---------------------- + + This uses copy_from_user(), which implicitly uses a kernel + identity mapping. This is obviously safe for things in + kern_memmap. + + There may be corner cases of things that are not in kern_memmap, + but could be accessed this way. For example, registers in MMIO + space are not in kern_memmap, but could be accessed with a UC + mapping. This would not cause attribute aliasing. But + registers typically can be accessed only with four-byte or + eight-byte accesses, and the copy_from_user() path doesn't allow + any control over the access size, so this would be dangerous. + +ioremap() +--------- + + This returns a mapping for use inside the kernel. + + If the region is in kern_memmap, we should use the attribute + specified there. + + If the EFI memory map reports that the entire granule supports + WB, we should use that (granules that are partially reserved + or occupied by firmware do not appear in kern_memmap). + + If the granule contains non-WB memory, but we can cover the + region safely with kernel page table mappings, we can use + ioremap_page_range() as most other architectures do. + + Failing all of the above, we have to fall back to a UC mapping. + +Past Problem Cases +================== + +mmap of various MMIO regions from /dev/mem by "X" on Intel platforms +-------------------------------------------------------------------- + + The EFI memory map may not report these MMIO regions. + + These must be allowed so that X will work. This means that + when the EFI memory map is incomplete, every /dev/mem mmap must + succeed. It may create either WB or UC user mappings, depending + on whether the region is in kern_memmap or the EFI memory map. + +mmap of 0x0-0x9FFFF /dev/mem by "hwinfo" on HP sx1000 with VGA enabled +---------------------------------------------------------------------- + + The EFI memory map reports the following attributes: + + =============== ======= ================== + 0x00000-0x9FFFF WB only + 0xA0000-0xBFFFF UC only (VGA frame buffer) + 0xC0000-0xFFFFF WB only + =============== ======= ================== + + This mmap is done with user pages, not kernel identity mappings, + so it is safe to use WB mappings. + + The kernel VGA driver may ioremap the VGA frame buffer at 0xA0000, + which uses a granule-sized UC mapping. This granule will cover some + WB-only memory, but since UC is non-speculative, the processor will + never generate an uncacheable reference to the WB-only areas unless + the driver explicitly touches them. + +mmap of 0x0-0xFFFFF legacy_mem by "X" +------------------------------------- + + If the EFI memory map reports that the entire range supports the + same attributes, we can allow the mmap (and we will prefer WB if + supported, as is the case with HP sx[12]000 machines with VGA + disabled). + + If EFI reports the range as partly WB and partly UC (as on sx[12]000 + machines with VGA enabled), we must fail the mmap because there's no + safe attribute to use. + + If EFI reports some of the range but not all (as on Intel firmware + that doesn't report the VGA frame buffer at all), we should fail the + mmap and force the user to map just the specific region of interest. + +mmap of 0xA0000-0xBFFFF legacy_mem by "X" on HP sx1000 with VGA disabled +------------------------------------------------------------------------ + + The EFI memory map reports the following attributes:: + + 0x00000-0xFFFFF WB only (no VGA MMIO hole) + + This is a special case of the previous case, and the mmap should + fail for the same reason as above. + +read of /sys/devices/.../rom +---------------------------- + + For VGA devices, this may cause an ioremap() of 0xC0000. This + used to be done with a UC mapping, because the VGA frame buffer + at 0xA0000 prevents use of a WB granule. The UC mapping causes + an MCA on HP sx[12]000 chipsets. + + We should use WB page table mappings to avoid covering the VGA + frame buffer. + +Notes +===== + + [1] SDM rev 2.2, vol 2, sec 4.4.1. + [2] SDM rev 2.2, vol 2, sec 4.4.6. diff --git a/Documentation/arch/ia64/efirtc.rst b/Documentation/arch/ia64/efirtc.rst new file mode 100644 index 000000000000..fd8328408301 --- /dev/null +++ b/Documentation/arch/ia64/efirtc.rst @@ -0,0 +1,144 @@ +========================== +EFI Real Time Clock driver +========================== + +S. Eranian <eranian@hpl.hp.com> + +March 2000 + +1. Introduction +=============== + +This document describes the efirtc.c driver has provided for +the IA-64 platform. + +The purpose of this driver is to supply an API for kernel and user applications +to get access to the Time Service offered by EFI version 0.92. + +EFI provides 4 calls one can make once the OS is booted: GetTime(), +SetTime(), GetWakeupTime(), SetWakeupTime() which are all supported by this +driver. We describe those calls as well the design of the driver in the +following sections. + +2. Design Decisions +=================== + +The original ideas was to provide a very simple driver to get access to, +at first, the time of day service. This is required in order to access, in a +portable way, the CMOS clock. A program like /sbin/hwclock uses such a clock +to initialize the system view of the time during boot. + +Because we wanted to minimize the impact on existing user-level apps using +the CMOS clock, we decided to expose an API that was very similar to the one +used today with the legacy RTC driver (driver/char/rtc.c). However, because +EFI provides a simpler services, not all ioctl() are available. Also +new ioctl()s have been introduced for things that EFI provides but not the +legacy. + +EFI uses a slightly different way of representing the time, noticeably +the reference date is different. Year is the using the full 4-digit format. +The Epoch is January 1st 1998. For backward compatibility reasons we don't +expose this new way of representing time. Instead we use something very +similar to the struct tm, i.e. struct rtc_time, as used by hwclock. +One of the reasons for doing it this way is to allow for EFI to still evolve +without necessarily impacting any of the user applications. The decoupling +enables flexibility and permits writing wrapper code is ncase things change. + +The driver exposes two interfaces, one via the device file and a set of +ioctl()s. The other is read-only via the /proc filesystem. + +As of today we don't offer a /proc/sys interface. + +To allow for a uniform interface between the legacy RTC and EFI time service, +we have created the include/linux/rtc.h header file to contain only the +"public" API of the two drivers. The specifics of the legacy RTC are still +in include/linux/mc146818rtc.h. + + +3. Time of day service +====================== + +The part of the driver gives access to the time of day service of EFI. +Two ioctl()s, compatible with the legacy RTC calls: + + Read the CMOS clock:: + + ioctl(d, RTC_RD_TIME, &rtc); + + Write the CMOS clock:: + + ioctl(d, RTC_SET_TIME, &rtc); + +The rtc is a pointer to a data structure defined in rtc.h which is close +to a struct tm:: + + struct rtc_time { + int tm_sec; + int tm_min; + int tm_hour; + int tm_mday; + int tm_mon; + int tm_year; + int tm_wday; + int tm_yday; + int tm_isdst; + }; + +The driver takes care of converting back an forth between the EFI time and +this format. + +Those two ioctl()s can be exercised with the hwclock command: + +For reading:: + + # /sbin/hwclock --show + Mon Mar 6 15:32:32 2000 -0.910248 seconds + +For setting:: + + # /sbin/hwclock --systohc + +Root privileges are required to be able to set the time of day. + +4. Wakeup Alarm service +======================= + +EFI provides an API by which one can program when a machine should wakeup, +i.e. reboot. This is very different from the alarm provided by the legacy +RTC which is some kind of interval timer alarm. For this reason we don't use +the same ioctl()s to get access to the service. Instead we have +introduced 2 news ioctl()s to the interface of an RTC. + +We have added 2 new ioctl()s that are specific to the EFI driver: + + Read the current state of the alarm:: + + ioctl(d, RTC_WKALM_RD, &wkt) + + Set the alarm or change its status:: + + ioctl(d, RTC_WKALM_SET, &wkt) + +The wkt structure encapsulates a struct rtc_time + 2 extra fields to get +status information:: + + struct rtc_wkalrm { + + unsigned char enabled; /* =1 if alarm is enabled */ + unsigned char pending; /* =1 if alarm is pending */ + + struct rtc_time time; + } + +As of today, none of the existing user-level apps supports this feature. +However writing such a program should be hard by simply using those two +ioctl(). + +Root privileges are required to be able to set the alarm. + +5. References +============= + +Checkout the following Web site for more information on EFI: + +http://developer.intel.com/technology/efi/ diff --git a/Documentation/arch/ia64/err_inject.rst b/Documentation/arch/ia64/err_inject.rst new file mode 100644 index 000000000000..900f71e93a29 --- /dev/null +++ b/Documentation/arch/ia64/err_inject.rst @@ -0,0 +1,1067 @@ +======================================== +IPF Machine Check (MC) error inject tool +======================================== + +IPF Machine Check (MC) error inject tool is used to inject MC +errors from Linux. The tool is a test bed for IPF MC work flow including +hardware correctable error handling, OS recoverable error handling, MC +event logging, etc. + +The tool includes two parts: a kernel driver and a user application +sample. The driver provides interface to PAL to inject error +and query error injection capabilities. The driver code is in +arch/ia64/kernel/err_inject.c. The application sample (shown below) +provides a combination of various errors and calls the driver's interface +(sysfs interface) to inject errors or query error injection capabilities. + +The tool can be used to test Intel IPF machine MC handling capabilities. +It's especially useful for people who can not access hardware MC injection +tool to inject error. It's also very useful to integrate with other +software test suits to do stressful testing on IPF. + +Below is a sample application as part of the whole tool. The sample +can be used as a working test tool. Or it can be expanded to include +more features. It also can be a integrated into a library or other user +application to have more thorough test. + +The sample application takes err.conf as error configuration input. GCC +compiles the code. After you install err_inject driver, you can run +this sample application to inject errors. + +Errata: Itanium 2 Processors Specification Update lists some errata against +the pal_mc_error_inject PAL procedure. The following err.conf has been tested +on latest Montecito PAL. + +err.conf:: + + #This is configuration file for err_inject_tool. + #The format of the each line is: + #cpu, loop, interval, err_type_info, err_struct_info, err_data_buffer + #where + # cpu: logical cpu number the error will be inject in. + # loop: times the error will be injected. + # interval: In second. every so often one error is injected. + # err_type_info, err_struct_info: PAL parameters. + # + #Note: All values are hex w/o or w/ 0x prefix. + + + #On cpu2, inject only total 0x10 errors, interval 5 seconds + #corrected, data cache, hier-2, physical addr(assigned by tool code). + #working on Montecito latest PAL. + 2, 10, 5, 4101, 95 + + #On cpu4, inject and consume total 0x10 errors, interval 5 seconds + #corrected, data cache, hier-2, physical addr(assigned by tool code). + #working on Montecito latest PAL. + 4, 10, 5, 4109, 95 + + #On cpu15, inject and consume total 0x10 errors, interval 5 seconds + #recoverable, DTR0, hier-2. + #working on Montecito latest PAL. + 0xf, 0x10, 5, 4249, 15 + +The sample application source code: + +err_injection_tool.c:: + + /* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for more + * details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Copyright (C) 2006 Intel Co + * Fenghua Yu <fenghua.yu@intel.com> + * + */ + #include <sys/types.h> + #include <sys/stat.h> + #include <fcntl.h> + #include <stdio.h> + #include <sched.h> + #include <unistd.h> + #include <stdlib.h> + #include <stdarg.h> + #include <string.h> + #include <errno.h> + #include <time.h> + #include <sys/ipc.h> + #include <sys/sem.h> + #include <sys/wait.h> + #include <sys/mman.h> + #include <sys/shm.h> + + #define MAX_FN_SIZE 256 + #define MAX_BUF_SIZE 256 + #define DATA_BUF_SIZE 256 + #define NR_CPUS 512 + #define MAX_TASK_NUM 2048 + #define MIN_INTERVAL 5 // seconds + #define ERR_DATA_BUFFER_SIZE 3 // Three 8-byte. + #define PARA_FIELD_NUM 5 + #define MASK_SIZE (NR_CPUS/64) + #define PATH_FORMAT "/sys/devices/system/cpu/cpu%d/err_inject/" + + int sched_setaffinity(pid_t pid, unsigned int len, unsigned long *mask); + + int verbose; + #define vbprintf if (verbose) printf + + int log_info(int cpu, const char *fmt, ...) + { + FILE *log; + char fn[MAX_FN_SIZE]; + char buf[MAX_BUF_SIZE]; + va_list args; + + sprintf(fn, "%d.log", cpu); + log=fopen(fn, "a+"); + if (log==NULL) { + perror("Error open:"); + return -1; + } + + va_start(args, fmt); + vprintf(fmt, args); + memset(buf, 0, MAX_BUF_SIZE); + vsprintf(buf, fmt, args); + va_end(args); + + fwrite(buf, sizeof(buf), 1, log); + fclose(log); + + return 0; + } + + typedef unsigned long u64; + typedef unsigned int u32; + + typedef union err_type_info_u { + struct { + u64 mode : 3, /* 0-2 */ + err_inj : 3, /* 3-5 */ + err_sev : 2, /* 6-7 */ + err_struct : 5, /* 8-12 */ + struct_hier : 3, /* 13-15 */ + reserved : 48; /* 16-63 */ + } err_type_info_u; + u64 err_type_info; + } err_type_info_t; + + typedef union err_struct_info_u { + struct { + u64 siv : 1, /* 0 */ + c_t : 2, /* 1-2 */ + cl_p : 3, /* 3-5 */ + cl_id : 3, /* 6-8 */ + cl_dp : 1, /* 9 */ + reserved1 : 22, /* 10-31 */ + tiv : 1, /* 32 */ + trigger : 4, /* 33-36 */ + trigger_pl : 3, /* 37-39 */ + reserved2 : 24; /* 40-63 */ + } err_struct_info_cache; + struct { + u64 siv : 1, /* 0 */ + tt : 2, /* 1-2 */ + tc_tr : 2, /* 3-4 */ + tr_slot : 8, /* 5-12 */ + reserved1 : 19, /* 13-31 */ + tiv : 1, /* 32 */ + trigger : 4, /* 33-36 */ + trigger_pl : 3, /* 37-39 */ + reserved2 : 24; /* 40-63 */ + } err_struct_info_tlb; + struct { + u64 siv : 1, /* 0 */ + regfile_id : 4, /* 1-4 */ + reg_num : 7, /* 5-11 */ + reserved1 : 20, /* 12-31 */ + tiv : 1, /* 32 */ + trigger : 4, /* 33-36 */ + trigger_pl : 3, /* 37-39 */ + reserved2 : 24; /* 40-63 */ + } err_struct_info_register; + struct { + u64 reserved; + } err_struct_info_bus_processor_interconnect; + u64 err_struct_info; + } err_struct_info_t; + + typedef union err_data_buffer_u { + struct { + u64 trigger_addr; /* 0-63 */ + u64 inj_addr; /* 64-127 */ + u64 way : 5, /* 128-132 */ + index : 20, /* 133-152 */ + : 39; /* 153-191 */ + } err_data_buffer_cache; + struct { + u64 trigger_addr; /* 0-63 */ + u64 inj_addr; /* 64-127 */ + u64 way : 5, /* 128-132 */ + index : 20, /* 133-152 */ + reserved : 39; /* 153-191 */ + } err_data_buffer_tlb; + struct { + u64 trigger_addr; /* 0-63 */ + } err_data_buffer_register; + struct { + u64 reserved; /* 0-63 */ + } err_data_buffer_bus_processor_interconnect; + u64 err_data_buffer[ERR_DATA_BUFFER_SIZE]; + } err_data_buffer_t; + + typedef union capabilities_u { + struct { + u64 i : 1, + d : 1, + rv : 1, + tag : 1, + data : 1, + mesi : 1, + dp : 1, + reserved1 : 3, + pa : 1, + va : 1, + wi : 1, + reserved2 : 20, + trigger : 1, + trigger_pl : 1, + reserved3 : 30; + } capabilities_cache; + struct { + u64 d : 1, + i : 1, + rv : 1, + tc : 1, + tr : 1, + reserved1 : 27, + trigger : 1, + trigger_pl : 1, + reserved2 : 30; + } capabilities_tlb; + struct { + u64 gr_b0 : 1, + gr_b1 : 1, + fr : 1, + br : 1, + pr : 1, + ar : 1, + cr : 1, + rr : 1, + pkr : 1, + dbr : 1, + ibr : 1, + pmc : 1, + pmd : 1, + reserved1 : 3, + regnum : 1, + reserved2 : 15, + trigger : 1, + trigger_pl : 1, + reserved3 : 30; + } capabilities_register; + struct { + u64 reserved; + } capabilities_bus_processor_interconnect; + } capabilities_t; + + typedef struct resources_s { + u64 ibr0 : 1, + ibr2 : 1, + ibr4 : 1, + ibr6 : 1, + dbr0 : 1, + dbr2 : 1, + dbr4 : 1, + dbr6 : 1, + reserved : 48; + } resources_t; + + + long get_page_size(void) + { + long page_size=sysconf(_SC_PAGESIZE); + return page_size; + } + + #define PAGE_SIZE (get_page_size()==-1?0x4000:get_page_size()) + #define SHM_SIZE (2*PAGE_SIZE*NR_CPUS) + #define SHM_VA 0x2000000100000000 + + int shmid; + void *shmaddr; + + int create_shm(void) + { + key_t key; + char fn[MAX_FN_SIZE]; + + /* cpu0 is always existing */ + sprintf(fn, PATH_FORMAT, 0); + if ((key = ftok(fn, 's')) == -1) { + perror("ftok"); + return -1; + } + + shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT); + if (shmid == -1) { + if (errno==EEXIST) { + shmid = shmget(key, SHM_SIZE, 0); + if (shmid == -1) { + perror("shmget"); + return -1; + } + } + else { + perror("shmget"); + return -1; + } + } + vbprintf("shmid=%d", shmid); + + /* connect to the segment: */ + shmaddr = shmat(shmid, (void *)SHM_VA, 0); + if (shmaddr == (void*)-1) { + perror("shmat"); + return -1; + } + + memset(shmaddr, 0, SHM_SIZE); + mlock(shmaddr, SHM_SIZE); + + return 0; + } + + int free_shm() + { + munlock(shmaddr, SHM_SIZE); + shmdt(shmaddr); + semctl(shmid, 0, IPC_RMID); + + return 0; + } + + #ifdef _SEM_SEMUN_UNDEFINED + union semun + { + int val; + struct semid_ds *buf; + unsigned short int *array; + struct seminfo *__buf; + }; + #endif + + u32 mode=1; /* 1: physical mode; 2: virtual mode. */ + int one_lock=1; + key_t key[NR_CPUS]; + int semid[NR_CPUS]; + + int create_sem(int cpu) + { + union semun arg; + char fn[MAX_FN_SIZE]; + int sid; + + sprintf(fn, PATH_FORMAT, cpu); + sprintf(fn, "%s/%s", fn, "err_type_info"); + if ((key[cpu] = ftok(fn, 'e')) == -1) { + perror("ftok"); + return -1; + } + + if (semid[cpu]!=0) + return 0; + + /* clear old semaphore */ + if ((sid = semget(key[cpu], 1, 0)) != -1) + semctl(sid, 0, IPC_RMID); + + /* get one semaphore */ + if ((semid[cpu] = semget(key[cpu], 1, IPC_CREAT | IPC_EXCL)) == -1) { + perror("semget"); + printf("Please remove semaphore with key=0x%lx, then run the tool.\n", + (u64)key[cpu]); + return -1; + } + + vbprintf("semid[%d]=0x%lx, key[%d]=%lx\n",cpu,(u64)semid[cpu],cpu, + (u64)key[cpu]); + /* initialize the semaphore to 1: */ + arg.val = 1; + if (semctl(semid[cpu], 0, SETVAL, arg) == -1) { + perror("semctl"); + return -1; + } + + return 0; + } + + static int lock(int cpu) + { + struct sembuf lock; + + lock.sem_num = cpu; + lock.sem_op = 1; + semop(semid[cpu], &lock, 1); + + return 0; + } + + static int unlock(int cpu) + { + struct sembuf unlock; + + unlock.sem_num = cpu; + unlock.sem_op = -1; + semop(semid[cpu], &unlock, 1); + + return 0; + } + + void free_sem(int cpu) + { + semctl(semid[cpu], 0, IPC_RMID); + } + + int wr_multi(char *fn, unsigned long *data, int size) + { + int fd; + char buf[MAX_BUF_SIZE]; + int ret; + + if (size==1) + sprintf(buf, "%lx", *data); + else if (size==3) + sprintf(buf, "%lx,%lx,%lx", data[0], data[1], data[2]); + else { + fprintf(stderr,"write to file with wrong size!\n"); + return -1; + } + + fd=open(fn, O_RDWR); + if (!fd) { + perror("Error:"); + return -1; + } + ret=write(fd, buf, sizeof(buf)); + close(fd); + return ret; + } + + int wr(char *fn, unsigned long data) + { + return wr_multi(fn, &data, 1); + } + + int rd(char *fn, unsigned long *data) + { + int fd; + char buf[MAX_BUF_SIZE]; + + fd=open(fn, O_RDONLY); + if (fd<0) { + perror("Error:"); + return -1; + } + read(fd, buf, MAX_BUF_SIZE); + *data=strtoul(buf, NULL, 16); + close(fd); + return 0; + } + + int rd_status(char *path, int *status) + { + char fn[MAX_FN_SIZE]; + sprintf(fn, "%s/status", path); + if (rd(fn, (u64*)status)<0) { + perror("status reading error.\n"); + return -1; + } + + return 0; + } + + int rd_capabilities(char *path, u64 *capabilities) + { + char fn[MAX_FN_SIZE]; + sprintf(fn, "%s/capabilities", path); + if (rd(fn, capabilities)<0) { + perror("capabilities reading error.\n"); + return -1; + } + + return 0; + } + + int rd_all(char *path) + { + unsigned long err_type_info, err_struct_info, err_data_buffer; + int status; + unsigned long capabilities, resources; + char fn[MAX_FN_SIZE]; + + sprintf(fn, "%s/err_type_info", path); + if (rd(fn, &err_type_info)<0) { + perror("err_type_info reading error.\n"); + return -1; + } + printf("err_type_info=%lx\n", err_type_info); + + sprintf(fn, "%s/err_struct_info", path); + if (rd(fn, &err_struct_info)<0) { + perror("err_struct_info reading error.\n"); + return -1; + } + printf("err_struct_info=%lx\n", err_struct_info); + + sprintf(fn, "%s/err_data_buffer", path); + if (rd(fn, &err_data_buffer)<0) { + perror("err_data_buffer reading error.\n"); + return -1; + } + printf("err_data_buffer=%lx\n", err_data_buffer); + + sprintf(fn, "%s/status", path); + if (rd("status", (u64*)&status)<0) { + perror("status reading error.\n"); + return -1; + } + printf("status=%d\n", status); + + sprintf(fn, "%s/capabilities", path); + if (rd(fn,&capabilities)<0) { + perror("capabilities reading error.\n"); + return -1; + } + printf("capabilities=%lx\n", capabilities); + + sprintf(fn, "%s/resources", path); + if (rd(fn, &resources)<0) { + perror("resources reading error.\n"); + return -1; + } + printf("resources=%lx\n", resources); + + return 0; + } + + int query_capabilities(char *path, err_type_info_t err_type_info, + u64 *capabilities) + { + char fn[MAX_FN_SIZE]; + err_struct_info_t err_struct_info; + err_data_buffer_t err_data_buffer; + + err_struct_info.err_struct_info=0; + memset(err_data_buffer.err_data_buffer, -1, ERR_DATA_BUFFER_SIZE*8); + + sprintf(fn, "%s/err_type_info", path); + wr(fn, err_type_info.err_type_info); + sprintf(fn, "%s/err_struct_info", path); + wr(fn, 0x0); + sprintf(fn, "%s/err_data_buffer", path); + wr_multi(fn, err_data_buffer.err_data_buffer, ERR_DATA_BUFFER_SIZE); + + // Fire pal_mc_error_inject procedure. + sprintf(fn, "%s/call_start", path); + wr(fn, mode); + + if (rd_capabilities(path, capabilities)<0) + return -1; + + return 0; + } + + int query_all_capabilities() + { + int status; + err_type_info_t err_type_info; + int err_sev, err_struct, struct_hier; + int cap=0; + u64 capabilities; + char path[MAX_FN_SIZE]; + + err_type_info.err_type_info=0; // Initial + err_type_info.err_type_info_u.mode=0; // Query mode; + err_type_info.err_type_info_u.err_inj=0; + + printf("All capabilities implemented in pal_mc_error_inject:\n"); + sprintf(path, PATH_FORMAT ,0); + for (err_sev=0;err_sev<3;err_sev++) + for (err_struct=0;err_struct<5;err_struct++) + for (struct_hier=0;struct_hier<5;struct_hier++) + { + status=-1; + capabilities=0; + err_type_info.err_type_info_u.err_sev=err_sev; + err_type_info.err_type_info_u.err_struct=err_struct; + err_type_info.err_type_info_u.struct_hier=struct_hier; + + if (query_capabilities(path, err_type_info, &capabilities)<0) + continue; + + if (rd_status(path, &status)<0) + continue; + + if (status==0) { + cap=1; + printf("For err_sev=%d, err_struct=%d, struct_hier=%d: ", + err_sev, err_struct, struct_hier); + printf("capabilities 0x%lx\n", capabilities); + } + } + if (!cap) { + printf("No capabilities supported.\n"); + return 0; + } + + return 0; + } + + int err_inject(int cpu, char *path, err_type_info_t err_type_info, + err_struct_info_t err_struct_info, + err_data_buffer_t err_data_buffer) + { + int status; + char fn[MAX_FN_SIZE]; + + log_info(cpu, "err_type_info=%lx, err_struct_info=%lx, ", + err_type_info.err_type_info, + err_struct_info.err_struct_info); + log_info(cpu,"err_data_buffer=[%lx,%lx,%lx]\n", + err_data_buffer.err_data_buffer[0], + err_data_buffer.err_data_buffer[1], + err_data_buffer.err_data_buffer[2]); + sprintf(fn, "%s/err_type_info", path); + wr(fn, err_type_info.err_type_info); + sprintf(fn, "%s/err_struct_info", path); + wr(fn, err_struct_info.err_struct_info); + sprintf(fn, "%s/err_data_buffer", path); + wr_multi(fn, err_data_buffer.err_data_buffer, ERR_DATA_BUFFER_SIZE); + + // Fire pal_mc_error_inject procedure. + sprintf(fn, "%s/call_start", path); + wr(fn,mode); + + if (rd_status(path, &status)<0) { + vbprintf("fail: read status\n"); + return -100; + } + + if (status!=0) { + log_info(cpu, "fail: status=%d\n", status); + return status; + } + + return status; + } + + static int construct_data_buf(char *path, err_type_info_t err_type_info, + err_struct_info_t err_struct_info, + err_data_buffer_t *err_data_buffer, + void *va1) + { + char fn[MAX_FN_SIZE]; + u64 virt_addr=0, phys_addr=0; + + vbprintf("va1=%lx\n", (u64)va1); + memset(&err_data_buffer->err_data_buffer_cache, 0, ERR_DATA_BUFFER_SIZE*8); + + switch (err_type_info.err_type_info_u.err_struct) { + case 1: // Cache + switch (err_struct_info.err_struct_info_cache.cl_id) { + case 1: //Virtual addr + err_data_buffer->err_data_buffer_cache.inj_addr=(u64)va1; + break; + case 2: //Phys addr + sprintf(fn, "%s/virtual_to_phys", path); + virt_addr=(u64)va1; + if (wr(fn,virt_addr)<0) + return -1; + rd(fn, &phys_addr); + err_data_buffer->err_data_buffer_cache.inj_addr=phys_addr; + break; + default: + printf("Not supported cl_id\n"); + break; + } + break; + case 2: // TLB + break; + case 3: // Register file + break; + case 4: // Bus/system interconnect + default: + printf("Not supported err_struct\n"); + break; + } + + return 0; + } + + typedef struct { + u64 cpu; + u64 loop; + u64 interval; + u64 err_type_info; + u64 err_struct_info; + u64 err_data_buffer[ERR_DATA_BUFFER_SIZE]; + } parameters_t; + + parameters_t line_para; + int para; + + static int empty_data_buffer(u64 *err_data_buffer) + { + int empty=1; + int i; + + for (i=0;i<ERR_DATA_BUFFER_SIZE; i++) + if (err_data_buffer[i]!=-1) + empty=0; + + return empty; + } + + int err_inj() + { + err_type_info_t err_type_info; + err_struct_info_t err_struct_info; + err_data_buffer_t err_data_buffer; + int count; + FILE *fp; + unsigned long cpu, loop, interval, err_type_info_conf, err_struct_info_conf; + u64 err_data_buffer_conf[ERR_DATA_BUFFER_SIZE]; + int num; + int i; + char path[MAX_FN_SIZE]; + parameters_t parameters[MAX_TASK_NUM]={}; + pid_t child_pid[MAX_TASK_NUM]; + time_t current_time; + int status; + + if (!para) { + fp=fopen("err.conf", "r"); + if (fp==NULL) { + perror("Error open err.conf"); + return -1; + } + + num=0; + while (!feof(fp)) { + char buf[256]; + memset(buf,0,256); + fgets(buf, 256, fp); + count=sscanf(buf, "%lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx\n", + &cpu, &loop, &interval,&err_type_info_conf, + &err_struct_info_conf, + &err_data_buffer_conf[0], + &err_data_buffer_conf[1], + &err_data_buffer_conf[2]); + if (count!=PARA_FIELD_NUM+3) { + err_data_buffer_conf[0]=-1; + err_data_buffer_conf[1]=-1; + err_data_buffer_conf[2]=-1; + count=sscanf(buf, "%lx, %lx, %lx, %lx, %lx\n", + &cpu, &loop, &interval,&err_type_info_conf, + &err_struct_info_conf); + if (count!=PARA_FIELD_NUM) + continue; + } + + parameters[num].cpu=cpu; + parameters[num].loop=loop; + parameters[num].interval= interval>MIN_INTERVAL + ?interval:MIN_INTERVAL; + parameters[num].err_type_info=err_type_info_conf; + parameters[num].err_struct_info=err_struct_info_conf; + memcpy(parameters[num++].err_data_buffer, + err_data_buffer_conf,ERR_DATA_BUFFER_SIZE*8) ; + + if (num>=MAX_TASK_NUM) + break; + } + } + else { + parameters[0].cpu=line_para.cpu; + parameters[0].loop=line_para.loop; + parameters[0].interval= line_para.interval>MIN_INTERVAL + ?line_para.interval:MIN_INTERVAL; + parameters[0].err_type_info=line_para.err_type_info; + parameters[0].err_struct_info=line_para.err_struct_info; + memcpy(parameters[0].err_data_buffer, + line_para.err_data_buffer,ERR_DATA_BUFFER_SIZE*8) ; + + num=1; + } + + /* Create semaphore: If one_lock, one semaphore for all processors. + Otherwise, one semaphore for each processor. */ + if (one_lock) { + if (create_sem(0)) { + printf("Can not create semaphore...exit\n"); + free_sem(0); + return -1; + } + } + else { + for (i=0;i<num;i++) { + if (create_sem(parameters[i].cpu)) { + printf("Can not create semaphore for cpu%d...exit\n",i); + free_sem(parameters[num].cpu); + return -1; + } + } + } + + /* Create a shm segment which will be used to inject/consume errors on.*/ + if (create_shm()==-1) { + printf("Error to create shm...exit\n"); + return -1; + } + + for (i=0;i<num;i++) { + pid_t pid; + + current_time=time(NULL); + log_info(parameters[i].cpu, "\nBegine at %s", ctime(¤t_time)); + log_info(parameters[i].cpu, "Configurations:\n"); + log_info(parameters[i].cpu,"On cpu%ld: loop=%lx, interval=%lx(s)", + parameters[i].cpu, + parameters[i].loop, + parameters[i].interval); + log_info(parameters[i].cpu," err_type_info=%lx,err_struct_info=%lx\n", + parameters[i].err_type_info, + parameters[i].err_struct_info); + + sprintf(path, PATH_FORMAT, (int)parameters[i].cpu); + err_type_info.err_type_info=parameters[i].err_type_info; + err_struct_info.err_struct_info=parameters[i].err_struct_info; + memcpy(err_data_buffer.err_data_buffer, + parameters[i].err_data_buffer, + ERR_DATA_BUFFER_SIZE*8); + + pid=fork(); + if (pid==0) { + unsigned long mask[MASK_SIZE]; + int j, k; + + void *va1, *va2; + + /* Allocate two memory areas va1 and va2 in shm */ + va1=shmaddr+parameters[i].cpu*PAGE_SIZE; + va2=shmaddr+parameters[i].cpu*PAGE_SIZE+PAGE_SIZE; + + vbprintf("va1=%lx, va2=%lx\n", (u64)va1, (u64)va2); + memset(va1, 0x1, PAGE_SIZE); + memset(va2, 0x2, PAGE_SIZE); + + if (empty_data_buffer(err_data_buffer.err_data_buffer)) + /* If not specified yet, construct data buffer + * with va1 + */ + construct_data_buf(path, err_type_info, + err_struct_info, &err_data_buffer,va1); + + for (j=0;j<MASK_SIZE;j++) + mask[j]=0; + + cpu=parameters[i].cpu; + k = cpu%64; + j = cpu/64; + mask[j] = 1UL << k; + + if (sched_setaffinity(0, MASK_SIZE*8, mask)==-1) { + perror("Error sched_setaffinity:"); + return -1; + } + + for (j=0; j<parameters[i].loop; j++) { + log_info(parameters[i].cpu,"Injection "); + log_info(parameters[i].cpu,"on cpu%ld: #%d/%ld ", + + parameters[i].cpu,j+1, parameters[i].loop); + + /* Hold the lock */ + if (one_lock) + lock(0); + else + /* Hold lock on this cpu */ + lock(parameters[i].cpu); + + if ((status=err_inject(parameters[i].cpu, + path, err_type_info, + err_struct_info, err_data_buffer)) + ==0) { + /* consume the error for "inject only"*/ + memcpy(va2, va1, PAGE_SIZE); + memcpy(va1, va2, PAGE_SIZE); + log_info(parameters[i].cpu, + "successful\n"); + } + else { + log_info(parameters[i].cpu,"fail:"); + log_info(parameters[i].cpu, + "status=%d\n", status); + unlock(parameters[i].cpu); + break; + } + if (one_lock) + /* Release the lock */ + unlock(0); + /* Release lock on this cpu */ + else + unlock(parameters[i].cpu); + + if (j < parameters[i].loop-1) + sleep(parameters[i].interval); + } + current_time=time(NULL); + log_info(parameters[i].cpu, "Done at %s", ctime(¤t_time)); + return 0; + } + else if (pid<0) { + perror("Error fork:"); + continue; + } + child_pid[i]=pid; + } + for (i=0;i<num;i++) + waitpid(child_pid[i], NULL, 0); + + if (one_lock) + free_sem(0); + else + for (i=0;i<num;i++) + free_sem(parameters[i].cpu); + + printf("All done.\n"); + + return 0; + } + + void help() + { + printf("err_inject_tool:\n"); + printf("\t-q: query all capabilities. default: off\n"); + printf("\t-m: procedure mode. 1: physical 2: virtual. default: 1\n"); + printf("\t-i: inject errors. default: off\n"); + printf("\t-l: one lock per cpu. default: one lock for all\n"); + printf("\t-e: error parameters:\n"); + printf("\t\tcpu,loop,interval,err_type_info,err_struct_info[,err_data_buffer[0],err_data_buffer[1],err_data_buffer[2]]\n"); + printf("\t\t cpu: logical cpu number the error will be inject in.\n"); + printf("\t\t loop: times the error will be injected.\n"); + printf("\t\t interval: In second. every so often one error is injected.\n"); + printf("\t\t err_type_info, err_struct_info: PAL parameters.\n"); + printf("\t\t err_data_buffer: PAL parameter. Optional. If not present,\n"); + printf("\t\t it's constructed by tool automatically. Be\n"); + printf("\t\t careful to provide err_data_buffer and make\n"); + printf("\t\t sure it's working with the environment.\n"); + printf("\t Note:no space between error parameters.\n"); + printf("\t default: Take error parameters from err.conf instead of command line.\n"); + printf("\t-v: verbose. default: off\n"); + printf("\t-h: help\n\n"); + printf("The tool will take err.conf file as "); + printf("input to inject single or multiple errors "); + printf("on one or multiple cpus in parallel.\n"); + } + + int main(int argc, char **argv) + { + char c; + int do_err_inj=0; + int do_query_all=0; + int count; + u32 m; + + /* Default one lock for all cpu's */ + one_lock=1; + while ((c = getopt(argc, argv, "m:iqvhle:")) != EOF) + switch (c) { + case 'm': /* Procedure mode. 1: phys 2: virt */ + count=sscanf(optarg, "%x", &m); + if (count!=1 || (m!=1 && m!=2)) { + printf("Wrong mode number.\n"); + help(); + return -1; + } + mode=m; + break; + case 'i': /* Inject errors */ + do_err_inj=1; + break; + case 'q': /* Query */ + do_query_all=1; + break; + case 'v': /* Verbose */ + verbose=1; + break; + case 'l': /* One lock per cpu */ + one_lock=0; + break; + case 'e': /* error arguments */ + /* Take parameters: + * #cpu, loop, interval, err_type_info, err_struct_info[, err_data_buffer] + * err_data_buffer is optional. Recommend not to specify + * err_data_buffer. Better to use tool to generate it. + */ + count=sscanf(optarg, + "%lx, %lx, %lx, %lx, %lx, %lx, %lx, %lx\n", + &line_para.cpu, + &line_para.loop, + &line_para.interval, + &line_para.err_type_info, + &line_para.err_struct_info, + &line_para.err_data_buffer[0], + &line_para.err_data_buffer[1], + &line_para.err_data_buffer[2]); + if (count!=PARA_FIELD_NUM+3) { + line_para.err_data_buffer[0]=-1, + line_para.err_data_buffer[1]=-1, + line_para.err_data_buffer[2]=-1; + count=sscanf(optarg, "%lx, %lx, %lx, %lx, %lx\n", + &line_para.cpu, + &line_para.loop, + &line_para.interval, + &line_para.err_type_info, + &line_para.err_struct_info); + if (count!=PARA_FIELD_NUM) { + printf("Wrong error arguments.\n"); + help(); + return -1; + } + } + para=1; + break; + continue; + break; + case 'h': + help(); + return 0; + default: + break; + } + + if (do_query_all) + query_all_capabilities(); + if (do_err_inj) + err_inj(); + + if (!do_query_all && !do_err_inj) + help(); + + return 0; + } diff --git a/Documentation/arch/ia64/features.rst b/Documentation/arch/ia64/features.rst new file mode 100644 index 000000000000..d7226fdcf5f8 --- /dev/null +++ b/Documentation/arch/ia64/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features ia64 diff --git a/Documentation/arch/ia64/fsys.rst b/Documentation/arch/ia64/fsys.rst new file mode 100644 index 000000000000..a702d2cc94b6 --- /dev/null +++ b/Documentation/arch/ia64/fsys.rst @@ -0,0 +1,303 @@ +=================================== +Light-weight System Calls for IA-64 +=================================== + + Started: 13-Jan-2003 + + Last update: 27-Sep-2003 + + David Mosberger-Tang + <davidm@hpl.hp.com> + +Using the "epc" instruction effectively introduces a new mode of +execution to the ia64 linux kernel. We call this mode the +"fsys-mode". To recap, the normal states of execution are: + + - kernel mode: + Both the register stack and the memory stack have been + switched over to kernel memory. The user-level state is saved + in a pt-regs structure at the top of the kernel memory stack. + + - user mode: + Both the register stack and the kernel stack are in + user memory. The user-level state is contained in the + CPU registers. + + - bank 0 interruption-handling mode: + This is the non-interruptible state which all + interruption-handlers start execution in. The user-level + state remains in the CPU registers and some kernel state may + be stored in bank 0 of registers r16-r31. + +In contrast, fsys-mode has the following special properties: + + - execution is at privilege level 0 (most-privileged) + + - CPU registers may contain a mixture of user-level and kernel-level + state (it is the responsibility of the kernel to ensure that no + security-sensitive kernel-level state is leaked back to + user-level) + + - execution is interruptible and preemptible (an fsys-mode handler + can disable interrupts and avoid all other interruption-sources + to avoid preemption) + + - neither the memory-stack nor the register-stack can be trusted while + in fsys-mode (they point to the user-level stacks, which may + be invalid, or completely bogus addresses) + +In summary, fsys-mode is much more similar to running in user-mode +than it is to running in kernel-mode. Of course, given that the +privilege level is at level 0, this means that fsys-mode requires some +care (see below). + + +How to tell fsys-mode +===================== + +Linux operates in fsys-mode when (a) the privilege level is 0 (most +privileged) and (b) the stacks have NOT been switched to kernel memory +yet. For convenience, the header file <asm-ia64/ptrace.h> provides +three macros:: + + user_mode(regs) + user_stack(task,regs) + fsys_mode(task,regs) + +The "regs" argument is a pointer to a pt_regs structure. The "task" +argument is a pointer to the task structure to which the "regs" +pointer belongs to. user_mode() returns TRUE if the CPU state pointed +to by "regs" was executing in user mode (privilege level 3). +user_stack() returns TRUE if the state pointed to by "regs" was +executing on the user-level stack(s). Finally, fsys_mode() returns +TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. +The fsys_mode() macro is equivalent to the expression:: + + !user_mode(regs) && user_stack(task,regs) + +How to write an fsyscall handler +================================ + +The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers +(fsyscall_table). This table contains one entry for each system call. +By default, a system call is handled by fsys_fallback_syscall(). This +routine takes care of entering (full) kernel mode and calling the +normal Linux system call handler. For performance-critical system +calls, it is possible to write a hand-tuned fsyscall_handler. For +example, fsys.S contains fsys_getpid(), which is a hand-tuned version +of the getpid() system call. + +The entry and exit-state of an fsyscall handler is as follows: + +Machine state on entry to fsyscall handler +------------------------------------------ + + ========= =============================================================== + r10 0 + r11 saved ar.pfs (a user-level value) + r15 system call number + r16 "current" task pointer (in normal kernel-mode, this is in r13) + r32-r39 system call arguments + b6 return address (a user-level value) + ar.pfs previous frame-state (a user-level value) + PSR.be cleared to zero (i.e., little-endian byte order is in effect) + - all other registers may contain values passed in from user-mode + ========= =============================================================== + +Required machine state on exit to fsyscall handler +-------------------------------------------------- + + ========= =========================================================== + r11 saved ar.pfs (as passed into the fsyscall handler) + r15 system call number (as passed into the fsyscall handler) + r32-r39 system call arguments (as passed into the fsyscall handler) + b6 return address (as passed into the fsyscall handler) + ar.pfs previous frame-state (as passed into the fsyscall handler) + ========= =========================================================== + +Fsyscall handlers can execute with very little overhead, but with that +speed comes a set of restrictions: + + * Fsyscall-handlers MUST check for any pending work in the flags + member of the thread-info structure and if any of the + TIF_ALLWORK_MASK flags are set, the handler needs to fall back on + doing a full system call (by calling fsys_fallback_syscall). + + * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, + r15, b6, and ar.pfs) because they will be needed in case of a + system call restart. Of course, all "preserved" registers also + must be preserved, in accordance to the normal calling conventions. + + * Fsyscall-handlers MUST check argument registers for containing a + NaT value before using them in any way that could trigger a + NaT-consumption fault. If a system call argument is found to + contain a NaT value, an fsyscall-handler may return immediately + with r8=EINVAL, r10=-1. + + * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform + any other operation that would trigger mandatory RSE + (register-stack engine) traffic. + + * Fsyscall-handlers MUST NOT write to any stacked registers because + it is not safe to assume that user-level called a handler with the + proper number of arguments. + + * Fsyscall-handlers need to be careful when accessing per-CPU variables: + unless proper safe-guards are taken (e.g., interruptions are avoided), + execution may be pre-empted and resumed on another CPU at any given + time. + + * Fsyscall-handlers must be careful not to leak sensitive kernel' + information back to user-level. In particular, before returning to + user-level, care needs to be taken to clear any scratch registers + that could contain sensitive information (note that the current + task pointer is not considered sensitive: it's already exposed + through ar.k6). + + * Fsyscall-handlers MUST NOT access user-memory without first + validating access-permission (this can be done typically via + probe.r.fault and/or probe.w.fault) and without guarding against + memory access exceptions (this can be done with the EX() macros + defined by asmmacro.h). + +The above restrictions may seem draconian, but remember that it's +possible to trade off some of the restrictions by paying a slightly +higher overhead. For example, if an fsyscall-handler could benefit +from the shadow register bank, it could temporarily disable PSR.i and +PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as +needed. In other words, following the above rules yields extremely +fast system call execution (while fully preserving system call +semantics), but there is also a lot of flexibility in handling more +complicated cases. + +Signal handling +=============== + +The delivery of (asynchronous) signals must be delayed until fsys-mode +is exited. This is accomplished with the help of the lower-privilege +transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() +checks whether the interrupted task was in fsys-mode and, if so, sets +PSR.lp and returns immediately. When fsys-mode is exited via the +"br.ret" instruction that lowers the privilege level, a trap will +occur. The trap handler clears PSR.lp again and returns immediately. +The kernel exit path then checks for and delivers any pending signals. + +PSR Handling +============ + +The "epc" instruction doesn't change the contents of PSR at all. This +is in contrast to a regular interruption, which clears almost all +bits. Because of that, some care needs to be taken to ensure things +work as expected. The following discussion describes how each PSR bit +is handled. + +======= ======================================================================= +PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used + to ensure the CPU is in little-endian mode before the first + load/store instruction is executed. PSR.be is normally NOT + restored upon return from an fsys-mode handler. In other + words, user-level code must not rely on PSR.be being preserved + across a system call. +PSR.up Unchanged. +PSR.ac Unchanged. +PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. +PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. +PSR.pk Unchanged. +PSR.dt Unchanged. +PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! +PSR.sp Unchanged. +PSR.pp Unchanged. +PSR.di Unchanged. +PSR.si Unchanged. +PSR.db Unchanged. The kernel prevents user-level from setting a hardware + breakpoint that triggers at any privilege level other than + 3 (user-mode). +PSR.lp Unchanged. +PSR.tb Lazy redirect. If a taken-branch trap occurs while in + fsys-mode, the trap-handler modifies the saved machine state + such that execution resumes in the gate page at + syscall_via_break(), with privilege level 3. Note: the + taken branch would occur on the branch invoking the + fsyscall-handler, at which point, by definition, a syscall + restart is still safe. If the system call number is invalid, + the fsys-mode handler will return directly to user-level. This + return will trigger a taken-branch trap, but since the trap is + taken _after_ restoring the privilege level, the CPU has already + left fsys-mode, so no special treatment is needed. +PSR.rt Unchanged. +PSR.cpl Cleared to 0. +PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). +PSR.mc Unchanged. +PSR.it Unchanged (guaranteed to be 1). +PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. +PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to + be taken. The trap handler then modifies the saved machine + state such that execution resumes in the gate page at + syscall_via_break(), with privilege level 3. +PSR.ri Unchanged. +PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode + handler performed a speculative load that gets NaTted. If so, this + would be the normal & expected behavior, so no special treatment is + needed. +PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. + Doing so requires clearing PSR.i and PSR.ic as well. +PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. +======= ======================================================================= + +Using fast system calls +======================= + +To use fast system calls, userspace applications need simply call +__kernel_syscall_via_epc(). For example + +-- example fgettimeofday() call -- + +-- fgettimeofday.S -- + +:: + + #include <asm/asmmacro.h> + + GLOBAL_ENTRY(fgettimeofday) + .prologue + .save ar.pfs, r11 + mov r11 = ar.pfs + .body + + mov r2 = 0xa000000000020660;; // gate address + // found by inspection of System.map for the + // __kernel_syscall_via_epc() function. See + // below for how to do this for real. + + mov b7 = r2 + mov r15 = 1087 // gettimeofday syscall + ;; + br.call.sptk.many b6 = b7 + ;; + + .restore sp + + mov ar.pfs = r11 + br.ret.sptk.many rp;; // return to caller + END(fgettimeofday) + +-- end fgettimeofday.S -- + +In reality, getting the gate address is accomplished by two extra +values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) + + * AT_SYSINFO : is the address of __kernel_syscall_via_epc() + * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO + +The ELF DSO is a pre-linked library that is mapped in by the kernel at +the gate page. It is a proper ELF shared object so, with a dynamic +loader that recognises the library, you should be able to make calls to +the exported functions within it as with any other shared library. +AT_SYSINFO points into the kernel DSO at the +__kernel_syscall_via_epc() function for historical reasons (it was +used before the kernel DSO) and as a convenience. diff --git a/Documentation/arch/ia64/ia64.rst b/Documentation/arch/ia64/ia64.rst new file mode 100644 index 000000000000..b725019a9492 --- /dev/null +++ b/Documentation/arch/ia64/ia64.rst @@ -0,0 +1,49 @@ +=========================================== +Linux kernel release for the IA-64 Platform +=========================================== + + These are the release notes for Linux since version 2.4 for IA-64 + platform. This document provides information specific to IA-64 + ONLY, to get additional information about the Linux kernel also + read the original Linux README provided with the kernel. + +Installing the Kernel +===================== + + - IA-64 kernel installation is the same as the other platforms, see + original README for details. + + +Software Requirements +===================== + + Compiling and running this kernel requires an IA-64 compliant GCC + compiler. And various software packages also compiled with an + IA-64 compliant GCC compiler. + + +Configuring the Kernel +====================== + + Configuration is the same, see original README for details. + + +Compiling the Kernel: + + - Compiling this kernel doesn't differ from other platform so read + the original README for details BUT make sure you have an IA-64 + compliant GCC compiler. + +IA-64 Specifics +=============== + + - General issues: + + * Hardly any performance tuning has been done. Obvious targets + include the library routines (IP checksum, etc.). Less + obvious targets include making sure we don't flush the TLB + needlessly, etc. + + * SMP locks cleanup/optimization + + * IA32 support. Currently experimental. It mostly works. diff --git a/Documentation/arch/ia64/index.rst b/Documentation/arch/ia64/index.rst new file mode 100644 index 000000000000..761f2154dfa2 --- /dev/null +++ b/Documentation/arch/ia64/index.rst @@ -0,0 +1,19 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +IA-64 Architecture +================== + +.. toctree:: + :maxdepth: 1 + + ia64 + aliasing + efirtc + err_inject + fsys + irq-redir + mca + serial + + features diff --git a/Documentation/arch/ia64/irq-redir.rst b/Documentation/arch/ia64/irq-redir.rst new file mode 100644 index 000000000000..6bbbbe4f73ef --- /dev/null +++ b/Documentation/arch/ia64/irq-redir.rst @@ -0,0 +1,80 @@ +============================== +IRQ affinity on IA64 platforms +============================== + +07.01.2002, Erich Focht <efocht@ess.nec.de> + + +By writing to /proc/irq/IRQ#/smp_affinity the interrupt routing can be +controlled. The behavior on IA64 platforms is slightly different from +that described in Documentation/core-api/irq/irq-affinity.rst for i386 systems. + +Because of the usage of SAPIC mode and physical destination mode the +IRQ target is one particular CPU and cannot be a mask of several +CPUs. Only the first non-zero bit is taken into account. + + +Usage examples +============== + +The target CPU has to be specified as a hexadecimal CPU mask. The +first non-zero bit is the selected CPU. This format has been kept for +compatibility reasons with i386. + +Set the delivery mode of interrupt 41 to fixed and route the +interrupts to CPU #3 (logical CPU number) (2^3=0x08):: + + echo "8" >/proc/irq/41/smp_affinity + +Set the default route for IRQ number 41 to CPU 6 in lowest priority +delivery mode (redirectable):: + + echo "r 40" >/proc/irq/41/smp_affinity + +The output of the command:: + + cat /proc/irq/IRQ#/smp_affinity + +gives the target CPU mask for the specified interrupt vector. If the CPU +mask is preceded by the character "r", the interrupt is redirectable +(i.e. lowest priority mode routing is used), otherwise its route is +fixed. + + + +Initialization and default behavior +=================================== + +If the platform features IRQ redirection (info provided by SAL) all +IO-SAPIC interrupts are initialized with CPU#0 as their default target +and the routing is the so called "lowest priority mode" (actually +fixed SAPIC mode with hint). The XTP chipset registers are used as hints +for the IRQ routing. Currently in Linux XTP registers can have three +values: + + - minimal for an idle task, + - normal if any other task runs, + - maximal if the CPU is going to be switched off. + +The IRQ is routed to the CPU with lowest XTP register value, the +search begins at the default CPU. Therefore most of the interrupts +will be handled by CPU #0. + +If the platform doesn't feature interrupt redirection IOSAPIC fixed +routing is used. The target CPUs are distributed in a round robin +manner. IRQs will be routed only to the selected target CPUs. Check +with:: + + cat /proc/interrupts + + + +Comments +======== + +On large (multi-node) systems it is recommended to route the IRQs to +the node to which the corresponding device is connected. +For systems like the NEC AzusA we get IRQ node-affinity for free. This +is because usually the chipsets on each node redirect the interrupts +only to their own CPUs (as they cannot see the XTP registers on the +other nodes). diff --git a/Documentation/arch/ia64/mca.rst b/Documentation/arch/ia64/mca.rst new file mode 100644 index 000000000000..08270bba44a4 --- /dev/null +++ b/Documentation/arch/ia64/mca.rst @@ -0,0 +1,198 @@ +============================================================= +An ad-hoc collection of notes on IA64 MCA and INIT processing +============================================================= + +Feel free to update it with notes about any area that is not clear. + +--- + +MCA/INIT are completely asynchronous. They can occur at any time, when +the OS is in any state. Including when one of the cpus is already +holding a spinlock. Trying to get any lock from MCA/INIT state is +asking for deadlock. Also the state of structures that are protected +by locks is indeterminate, including linked lists. + +--- + +The complicated ia64 MCA process. All of this is mandated by Intel's +specification for ia64 SAL, error recovery and unwind, it is not as +if we have a choice here. + +* MCA occurs on one cpu, usually due to a double bit memory error. + This is the monarch cpu. + +* SAL sends an MCA rendezvous interrupt (which is a normal interrupt) + to all the other cpus, the slaves. + +* Slave cpus that receive the MCA interrupt call down into SAL, they + end up spinning disabled while the MCA is being serviced. + +* If any slave cpu was already spinning disabled when the MCA occurred + then it cannot service the MCA interrupt. SAL waits ~20 seconds then + sends an unmaskable INIT event to the slave cpus that have not + already rendezvoused. + +* Because MCA/INIT can be delivered at any time, including when the cpu + is down in PAL in physical mode, the registers at the time of the + event are _completely_ undefined. In particular the MCA/INIT + handlers cannot rely on the thread pointer, PAL physical mode can + (and does) modify TP. It is allowed to do that as long as it resets + TP on return. However MCA/INIT events expose us to these PAL + internal TP changes. Hence curr_task(). + +* If an MCA/INIT event occurs while the kernel was running (not user + space) and the kernel has called PAL then the MCA/INIT handler cannot + assume that the kernel stack is in a fit state to be used. Mainly + because PAL may or may not maintain the stack pointer internally. + Because the MCA/INIT handlers cannot trust the kernel stack, they + have to use their own, per-cpu stacks. The MCA/INIT stacks are + preformatted with just enough task state to let the relevant handlers + do their job. + +* Unlike most other architectures, the ia64 struct task is embedded in + the kernel stack[1]. So switching to a new kernel stack means that + we switch to a new task as well. Because various bits of the kernel + assume that current points into the struct task, switching to a new + stack also means a new value for current. + +* Once all slaves have rendezvoused and are spinning disabled, the + monarch is entered. The monarch now tries to diagnose the problem + and decide if it can recover or not. + +* Part of the monarch's job is to look at the state of all the other + tasks. The only way to do that on ia64 is to call the unwinder, + as mandated by Intel. + +* The starting point for the unwind depends on whether a task is + running or not. That is, whether it is on a cpu or is blocked. The + monarch has to determine whether or not a task is on a cpu before it + knows how to start unwinding it. The tasks that received an MCA or + INIT event are no longer running, they have been converted to blocked + tasks. But (and its a big but), the cpus that received the MCA + rendezvous interrupt are still running on their normal kernel stacks! + +* To distinguish between these two cases, the monarch must know which + tasks are on a cpu and which are not. Hence each slave cpu that + switches to an MCA/INIT stack, registers its new stack using + set_curr_task(), so the monarch can tell that the _original_ task is + no longer running on that cpu. That gives us a decent chance of + getting a valid backtrace of the _original_ task. + +* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a + nested error, we want diagnostics on the MCA/INIT handler that + failed, not on the task that was originally running. Again this + requires set_curr_task() so the MCA/INIT handlers can register their + own stack as running on that cpu. Then a recursive error gets a + trace of the failing handler's "task". + +[1] + My (Keith Owens) original design called for ia64 to separate its + struct task and the kernel stacks. Then the MCA/INIT data would be + chained stacks like i386 interrupt stacks. But that required + radical surgery on the rest of ia64, plus extra hard wired TLB + entries with its associated performance degradation. David + Mosberger vetoed that approach. Which meant that separate kernel + stacks meant separate "tasks" for the MCA/INIT handlers. + +--- + +INIT is less complicated than MCA. Pressing the nmi button or using +the equivalent command on the management console sends INIT to all +cpus. SAL picks one of the cpus as the monarch and the rest are +slaves. All the OS INIT handlers are entered at approximately the same +time. The OS monarch prints the state of all tasks and returns, after +which the slaves return and the system resumes. + +At least that is what is supposed to happen. Alas there are broken +versions of SAL out there. Some drive all the cpus as monarchs. Some +drive them all as slaves. Some drive one cpu as monarch, wait for that +cpu to return from the OS then drive the rest as slaves. Some versions +of SAL cannot even cope with returning from the OS, they spin inside +SAL on resume. The OS INIT code has workarounds for some of these +broken SAL symptoms, but some simply cannot be fixed from the OS side. + +--- + +The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer +violations. Unfortunately MCA/INIT start off as massive layer +violations (can occur at _any_ time) and they build from there. + +At least ia64 makes an attempt at recovering from hardware errors, but +it is a difficult problem because of the asynchronous nature of these +errors. When processing an unmaskable interrupt we sometimes need +special code to cope with our inability to take any locks. + +--- + +How is ia64 MCA/INIT different from x86 NMI? + +* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to + all cpus. + +* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 + per cpu. + +* x86 has a separate struct task which points to one of multiple kernel + stacks. ia64 has the struct task embedded in the single kernel + stack, so switching stack means switching task. + +* x86 does not call the BIOS so the NMI handler does not have to worry + about any registers having changed. MCA/INIT can occur while the cpu + is in PAL in physical mode, with undefined registers and an undefined + kernel stack. + +* i386 backtrace is not very sensitive to whether a process is running + or not. ia64 unwind is very, very sensitive to whether a process is + running or not. + +--- + +What happens when MCA/INIT is delivered what a cpu is running user +space code? + +The user mode registers are stored in the RSE area of the MCA/INIT on +entry to the OS and are restored from there on return to SAL, so user +mode registers are preserved across a recoverable MCA/INIT. Since the +OS has no idea what unwind data is available for the user space stack, +MCA/INIT never tries to backtrace user space. Which means that the OS +does not bother making the user space process look like a blocked task, +i.e. the OS does not copy pt_regs and switch_stack to the user space +stack. Also the OS has no idea how big the user space RSE and memory +stacks are, which makes it too risky to copy the saved state to a user +mode stack. + +--- + +How do we get a backtrace on the tasks that were running when MCA/INIT +was delivered? + +mca.c:::ia64_mca_modify_original_stack(). That identifies and +verifies the original kernel stack, copies the dirty registers from +the MCA/INIT stack's RSE to the original stack's RSE, copies the +skeleton struct pt_regs and switch_stack to the original stack, fills +in the skeleton structures from the PAL minstate area and updates the +original stack's thread.ksp. That makes the original stack look +exactly like any other blocked task, i.e. it now appears to be +sleeping. To get a backtrace, just start with thread.ksp for the +original task and unwind like any other sleeping task. + +--- + +How do we identify the tasks that were running when MCA/INIT was +delivered? + +If the previous task has been verified and converted to a blocked +state, then sos->prev_task on the MCA/INIT stack is updated to point to +the previous task. You can look at that field in dumps or debuggers. +To help distinguish between the handler and the original tasks, +handlers have _TIF_MCA_INIT set in thread_info.flags. + +The sos data is always in the MCA/INIT handler stack, at offset +MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it +as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct +ia64_sal_os_state), with 16 byte alignment for all structures. + +Also the comm field of the MCA/INIT task is modified to include the pid +of the original task, for humans to use. For example, a comm field of +'MCA 12159' means that pid 12159 was running when the MCA was +delivered. diff --git a/Documentation/arch/ia64/serial.rst b/Documentation/arch/ia64/serial.rst new file mode 100644 index 000000000000..1de70c305a79 --- /dev/null +++ b/Documentation/arch/ia64/serial.rst @@ -0,0 +1,165 @@ +============== +Serial Devices +============== + +Serial Device Naming +==================== + + As of 2.6.10, serial devices on ia64 are named based on the + order of ACPI and PCI enumeration. The first device in the + ACPI namespace (if any) becomes /dev/ttyS0, the second becomes + /dev/ttyS1, etc., and PCI devices are named sequentially + starting after the ACPI devices. + + Prior to 2.6.10, there were confusing exceptions to this: + + - Firmware on some machines (mostly from HP) provides an HCDP + table[1] that tells the kernel about devices that can be used + as a serial console. If the user specified "console=ttyS0" + or the EFI ConOut path contained only UART devices, the + kernel registered the device described by the HCDP as + /dev/ttyS0. + + - If there was no HCDP, we assumed there were UARTs at the + legacy COM port addresses (I/O ports 0x3f8 and 0x2f8), so + the kernel registered those as /dev/ttyS0 and /dev/ttyS1. + + Any additional ACPI or PCI devices were registered sequentially + after /dev/ttyS0 as they were discovered. + + With an HCDP, device names changed depending on EFI configuration + and "console=" arguments. Without an HCDP, device names didn't + change, but we registered devices that might not really exist. + + For example, an HP rx1600 with a single built-in serial port + (described in the ACPI namespace) plus an MP[2] (a PCI device) has + these ports: + + ========== ========== ============ ============ ======= + Type MMIO pre-2.6.10 pre-2.6.10 2.6.10+ + address + (EFI console (EFI console + on builtin) on MP port) + ========== ========== ============ ============ ======= + builtin 0xff5e0000 ttyS0 ttyS1 ttyS0 + MP UPS 0xf8031000 ttyS1 ttyS2 ttyS1 + MP Console 0xf8030000 ttyS2 ttyS0 ttyS2 + MP 2 0xf8030010 ttyS3 ttyS3 ttyS3 + MP 3 0xf8030038 ttyS4 ttyS4 ttyS4 + ========== ========== ============ ============ ======= + +Console Selection +================= + + EFI knows what your console devices are, but it doesn't tell the + kernel quite enough to actually locate them. The DIG64 HCDP + table[1] does tell the kernel where potential serial console + devices are, but not all firmware supplies it. Also, EFI supports + multiple simultaneous consoles and doesn't tell the kernel which + should be the "primary" one. + + So how do you tell Linux which console device to use? + + - If your firmware supplies the HCDP, it is simplest to + configure EFI with a single device (either a UART or a VGA + card) as the console. Then you don't need to tell Linux + anything; the kernel will automatically use the EFI console. + + (This works only in 2.6.6 or later; prior to that you had + to specify "console=ttyS0" to get a serial console.) + + - Without an HCDP, Linux defaults to a VGA console unless you + specify a "console=" argument. + + NOTE: Don't assume that a serial console device will be /dev/ttyS0. + It might be ttyS1, ttyS2, etc. Make sure you have the appropriate + entries in /etc/inittab (for getty) and /etc/securetty (to allow + root login). + +Early Serial Console +==================== + + The kernel can't start using a serial console until it knows where + the device lives. Normally this happens when the driver enumerates + all the serial devices, which can happen a minute or more after the + kernel starts booting. + + 2.6.10 and later kernels have an "early uart" driver that works + very early in the boot process. The kernel will automatically use + this if the user supplies an argument like "console=uart,io,0x3f8", + or if the EFI console path contains only a UART device and the + firmware supplies an HCDP. + +Troubleshooting Serial Console Problems +======================================= + + No kernel output after elilo prints "Uncompressing Linux... done": + + - You specified "console=ttyS0" but Linux changed the device + to which ttyS0 refers. Configure exactly one EFI console + device[3] and remove the "console=" option. + + - The EFI console path contains both a VGA device and a UART. + EFI and elilo use both, but Linux defaults to VGA. Remove + the VGA device from the EFI console path[3]. + + - Multiple UARTs selected as EFI console devices. EFI and + elilo use all selected devices, but Linux uses only one. + Make sure only one UART is selected in the EFI console + path[3]. + + - You're connected to an HP MP port[2] but have a non-MP UART + selected as EFI console device. EFI uses the MP as a + console device even when it isn't explicitly selected. + Either move the console cable to the non-MP UART, or change + the EFI console path[3] to the MP UART. + + Long pause (60+ seconds) between "Uncompressing Linux... done" and + start of kernel output: + + - No early console because you used "console=ttyS<n>". Remove + the "console=" option if your firmware supplies an HCDP. + + - If you don't have an HCDP, the kernel doesn't know where + your console lives until the driver discovers serial + devices. Use "console=uart,io,0x3f8" (or appropriate + address for your machine). + + Kernel and init script output works fine, but no "login:" prompt: + + - Add getty entry to /etc/inittab for console tty. Look for + the "Adding console on ttyS<n>" message that tells you which + device is the console. + + "login:" prompt, but can't login as root: + + - Add entry to /etc/securetty for console tty. + + No ACPI serial devices found in 2.6.17 or later: + + - Turn on CONFIG_PNP and CONFIG_PNPACPI. Prior to 2.6.17, ACPI + serial devices were discovered by 8250_acpi. In 2.6.17, + 8250_acpi was replaced by the combination of 8250_pnp and + CONFIG_PNPACPI. + + + +[1] + http://www.dig64.org/specifications/agreement + The table was originally defined as the "HCDP" for "Headless + Console/Debug Port." The current version is the "PCDP" for + "Primary Console and Debug Port Devices." + +[2] + The HP MP (management processor) is a PCI device that provides + several UARTs. One of the UARTs is often used as a console; the + EFI Boot Manager identifies it as "Acpi(HWP0002,700)/Pci(...)/Uart". + The external connection is usually a 25-pin connector, and a + special dongle converts that to three 9-pin connectors, one of + which is labelled "Console." + +[3] + EFI console devices are configured using the EFI Boot Manager + "Boot option maintenance" menu. You may have to interrupt the + boot sequence to use this menu, and you will have to reset the + box after changing console configuration. diff --git a/Documentation/arch/index.rst b/Documentation/arch/index.rst new file mode 100644 index 000000000000..80ee31016584 --- /dev/null +++ b/Documentation/arch/index.rst @@ -0,0 +1,28 @@ +.. SPDX-License-Identifier: GPL-2.0 + +CPU Architectures +================= + +These books provide programming details about architecture-specific +implementation. + +.. toctree:: + :maxdepth: 2 + + arc/index + ../arm/index + ../arm64/index + ia64/index + ../loongarch/index + m68k/index + ../mips/index + nios2/index + openrisc/index + parisc/index + ../powerpc/index + ../riscv/index + ../s390/index + sh/index + sparc/index + x86/index + xtensa/index diff --git a/Documentation/arch/m68k/buddha-driver.rst b/Documentation/arch/m68k/buddha-driver.rst new file mode 100644 index 000000000000..20e401413991 --- /dev/null +++ b/Documentation/arch/m68k/buddha-driver.rst @@ -0,0 +1,209 @@ +===================================== +Amiga Buddha and Catweasel IDE Driver +===================================== + +The Amiga Buddha and Catweasel IDE Driver (part of ide.c) was written by +Geert Uytterhoeven based on the following specifications: + +------------------------------------------------------------------------ + +Register map of the Buddha IDE controller and the +Buddha-part of the Catweasel Zorro-II version + +The Autoconfiguration has been implemented just as Commodore +described in their manuals, no tricks have been used (for +example leaving some address lines out of the equations...). +If you want to configure the board yourself (for example let +a Linux kernel configure the card), look at the Commodore +Docs. Reading the nibbles should give this information:: + + Vendor number: 4626 ($1212) + product number: 0 (42 for Catweasel Z-II) + Serial number: 0 + Rom-vector: $1000 + +The card should be a Z-II board, size 64K, not for freemem +list, Rom-Vektor is valid, no second Autoconfig-board on the +same card, no space preference, supports "Shutup_forever". + +Setting the base address should be done in two steps, just +as the Amiga Kickstart does: The lower nibble of the 8-Bit +address is written to $4a, then the whole Byte is written to +$48, while it doesn't matter how often you're writing to $4a +as long as $48 is not touched. After $48 has been written, +the whole card disappears from $e8 and is mapped to the new +address just written. Make sure $4a is written before $48, +otherwise your chance is only 1:16 to find the board :-). + +The local memory-map is even active when mapped to $e8: + +============== =========================================== +$0-$7e Autokonfig-space, see Z-II docs. + +$80-$7fd reserved + +$7fe Speed-select Register: Read & Write + (description see further down) + +$800-$8ff IDE-Select 0 (Port 0, Register set 0) + +$900-$9ff IDE-Select 1 (Port 0, Register set 1) + +$a00-$aff IDE-Select 2 (Port 1, Register set 0) + +$b00-$bff IDE-Select 3 (Port 1, Register set 1) + +$c00-$cff IDE-Select 4 (Port 2, Register set 0, + Catweasel only!) + +$d00-$dff IDE-Select 5 (Port 3, Register set 1, + Catweasel only!) + +$e00-$eff local expansion port, on Catweasel Z-II the + Catweasel registers are also mapped here. + Never touch, use multidisk.device! + +$f00 read only, Byte-access: Bit 7 shows the + level of the IRQ-line of IDE port 0. + +$f01-$f3f mirror of $f00 + +$f40 read only, Byte-access: Bit 7 shows the + level of the IRQ-line of IDE port 1. + +$f41-$f7f mirror of $f40 + +$f80 read only, Byte-access: Bit 7 shows the + level of the IRQ-line of IDE port 2. + (Catweasel only!) + +$f81-$fbf mirror of $f80 + +$fc0 write-only: Writing any value to this + register enables IRQs to be passed from the + IDE ports to the Zorro bus. This mechanism + has been implemented to be compatible with + harddisks that are either defective or have + a buggy firmware and pull the IRQ line up + while starting up. If interrupts would + always be passed to the bus, the computer + might not start up. Once enabled, this flag + can not be disabled again. The level of the + flag can not be determined by software + (what for? Write to me if it's necessary!). + +$fc1-$fff mirror of $fc0 + +$1000-$ffff Buddha-Rom with offset $1000 in the rom + chip. The addresses $0 to $fff of the rom + chip cannot be read. Rom is Byte-wide and + mapped to even addresses. +============== =========================================== + +The IDE ports issue an INT2. You can read the level of the +IRQ-lines of the IDE-ports by reading from the three (two +for Buddha-only) registers $f00, $f40 and $f80. This way +more than one I/O request can be handled and you can easily +determine what driver has to serve the INT2. Buddha and +Catweasel expansion boards can issue an INT6. A separate +memory map is available for the I/O module and the sysop's +I/O module. + +The IDE ports are fed by the address lines A2 to A4, just as +the Amiga 1200 and Amiga 4000 IDE ports are. This way +existing drivers can be easily ported to Buddha. A move.l +polls two words out of the same address of IDE port since +every word is mirrored once. movem is not possible, but +it's not necessary either, because you can only speedup +68000 systems with this technique. A 68020 system with +fastmem is faster with move.l. + +If you're using the mirrored registers of the IDE-ports with +A6=1, the Buddha doesn't care about the speed that you have +selected in the speed register (see further down). With +A6=1 (for example $840 for port 0, register set 0), a 780ns +access is being made. These registers should be used for a +command access to the harddisk/CD-Rom, since command +accesses are Byte-wide and have to be made slower according +to the ATA-X3T9 manual. + +Now for the speed-register: The register is byte-wide, and +only the upper three bits are used (Bits 7 to 5). Bit 4 +must always be set to 1 to be compatible with later Buddha +versions (if I'll ever update this one). I presume that +I'll never use the lower four bits, but they have to be set +to 1 by definition. + +The values in this table have to be shifted 5 bits to the +left and or'd with $1f (this sets the lower 5 bits). + +All the timings have in common: Select and IOR/IOW rise at +the same time. IOR and IOW have a propagation delay of +about 30ns to the clocks on the Zorro bus, that's why the +values are no multiple of 71. One clock-cycle is 71ns long +(exactly 70,5 at 14,18 Mhz on PAL systems). + +value 0 (Default after reset) + 497ns Select (7 clock cycles) , IOR/IOW after 172ns (2 clock cycles) + (same timing as the Amiga 1200 does on it's IDE port without + accelerator card) + +value 1 + 639ns Select (9 clock cycles), IOR/IOW after 243ns (3 clock cycles) + +value 2 + 781ns Select (11 clock cycles), IOR/IOW after 314ns (4 clock cycles) + +value 3 + 355ns Select (5 clock cycles), IOR/IOW after 101ns (1 clock cycle) + +value 4 + 355ns Select (5 clock cycles), IOR/IOW after 172ns (2 clock cycles) + +value 5 + 355ns Select (5 clock cycles), IOR/IOW after 243ns (3 clock cycles) + +value 6 + 1065ns Select (15 clock cycles), IOR/IOW after 314ns (4 clock cycles) + +value 7 + 355ns Select, (5 clock cycles), IOR/IOW after 101ns (1 clock cycle) + +When accessing IDE registers with A6=1 (for example $84x), +the timing will always be mode 0 8-bit compatible, no matter +what you have selected in the speed register: + +781ns select, IOR/IOW after 4 clock cycles (=314ns) aktive. + +All the timings with a very short select-signal (the 355ns +fast accesses) depend on the accelerator card used in the +system: Sometimes two more clock cycles are inserted by the +bus interface, making the whole access 497ns long. This +doesn't affect the reliability of the controller nor the +performance of the card, since this doesn't happen very +often. + +All the timings are calculated and only confirmed by +measurements that allowed me to count the clock cycles. If +the system is clocked by an oscillator other than 28,37516 +Mhz (for example the NTSC-frequency 28,63636 Mhz), each +clock cycle is shortened to a bit less than 70ns (not worth +mentioning). You could think of a small performance boost +by overclocking the system, but you would either need a +multisync monitor, or a graphics card, and your internal +diskdrive would go crazy, that's why you shouldn't tune your +Amiga this way. + +Giving you the possibility to write software that is +compatible with both the Buddha and the Catweasel Z-II, The +Buddha acts just like a Catweasel Z-II with no device +connected to the third IDE-port. The IRQ-register $f80 +always shows a "no IRQ here" on the Buddha, and accesses to +the third IDE port are going into data's Nirwana on the +Buddha. + +Jens Schönfeld february 19th, 1997 + +updated may 27th, 1997 + +eMail: sysop@nostlgic.tng.oche.de diff --git a/Documentation/arch/m68k/features.rst b/Documentation/arch/m68k/features.rst new file mode 100644 index 000000000000..5107a2119472 --- /dev/null +++ b/Documentation/arch/m68k/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features m68k diff --git a/Documentation/arch/m68k/index.rst b/Documentation/arch/m68k/index.rst new file mode 100644 index 000000000000..0f890dbb5fe2 --- /dev/null +++ b/Documentation/arch/m68k/index.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +m68k Architecture +================= + +.. toctree:: + :maxdepth: 2 + + kernel-options + buddha-driver + + features + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/arch/m68k/kernel-options.rst b/Documentation/arch/m68k/kernel-options.rst new file mode 100644 index 000000000000..2008a20b4329 --- /dev/null +++ b/Documentation/arch/m68k/kernel-options.rst @@ -0,0 +1,911 @@ +=================================== +Command Line Options for Linux/m68k +=================================== + +Last Update: 2 May 1999 + +Linux/m68k version: 2.2.6 + +Author: Roman.Hodek@informatik.uni-erlangen.de (Roman Hodek) + +Update: jds@kom.auc.dk (Jes Sorensen) and faq@linux-m68k.org (Chris Lawrence) + +0) Introduction +=============== + +Often I've been asked which command line options the Linux/m68k +kernel understands, or how the exact syntax for the ... option is, or +... about the option ... . I hope, this document supplies all the +answers... + +Note that some options might be outdated, their descriptions being +incomplete or missing. Please update the information and send in the +patches. + + +1) Overview of the Kernel's Option Processing +============================================= + +The kernel knows three kinds of options on its command line: + + 1) kernel options + 2) environment settings + 3) arguments for init + +To which of these classes an argument belongs is determined as +follows: If the option is known to the kernel itself, i.e. if the name +(the part before the '=') or, in some cases, the whole argument string +is known to the kernel, it belongs to class 1. Otherwise, if the +argument contains an '=', it is of class 2, and the definition is put +into init's environment. All other arguments are passed to init as +command line options. + +This document describes the valid kernel options for Linux/m68k in +the version mentioned at the start of this file. Later revisions may +add new such options, and some may be missing in older versions. + +In general, the value (the part after the '=') of an option is a +list of values separated by commas. The interpretation of these values +is up to the driver that "owns" the option. This association of +options with drivers is also the reason that some are further +subdivided. + + +2) General Kernel Options +========================= + +2.1) root= +---------- + +:Syntax: root=/dev/<device> +:or: root=<hex_number> + +This tells the kernel which device it should mount as the root +filesystem. The device must be a block device with a valid filesystem +on it. + +The first syntax gives the device by name. These names are converted +into a major/minor number internally in the kernel in an unusual way. +Normally, this "conversion" is done by the device files in /dev, but +this isn't possible here, because the root filesystem (with /dev) +isn't mounted yet... So the kernel parses the name itself, with some +hardcoded name to number mappings. The name must always be a +combination of two or three letters, followed by a decimal number. +Valid names are:: + + /dev/ram: -> 0x0100 (initial ramdisk) + /dev/hda: -> 0x0300 (first IDE disk) + /dev/hdb: -> 0x0340 (second IDE disk) + /dev/sda: -> 0x0800 (first SCSI disk) + /dev/sdb: -> 0x0810 (second SCSI disk) + /dev/sdc: -> 0x0820 (third SCSI disk) + /dev/sdd: -> 0x0830 (forth SCSI disk) + /dev/sde: -> 0x0840 (fifth SCSI disk) + /dev/fd : -> 0x0200 (floppy disk) + +The name must be followed by a decimal number, that stands for the +partition number. Internally, the value of the number is just +added to the device number mentioned in the table above. The +exceptions are /dev/ram and /dev/fd, where /dev/ram refers to an +initial ramdisk loaded by your bootstrap program (please consult the +instructions for your bootstrap program to find out how to load an +initial ramdisk). As of kernel version 2.0.18 you must specify +/dev/ram as the root device if you want to boot from an initial +ramdisk. For the floppy devices, /dev/fd, the number stands for the +floppy drive number (there are no partitions on floppy disks). I.e., +/dev/fd0 stands for the first drive, /dev/fd1 for the second, and so +on. Since the number is just added, you can also force the disk format +by adding a number greater than 3. If you look into your /dev +directory, use can see the /dev/fd0D720 has major 2 and minor 16. You +can specify this device for the root FS by writing "root=/dev/fd16" on +the kernel command line. + +[Strange and maybe uninteresting stuff ON] + +This unusual translation of device names has some strange +consequences: If, for example, you have a symbolic link from /dev/fd +to /dev/fd0D720 as an abbreviation for floppy driver #0 in DD format, +you cannot use this name for specifying the root device, because the +kernel cannot see this symlink before mounting the root FS and it +isn't in the table above. If you use it, the root device will not be +set at all, without an error message. Another example: You cannot use a +partition on e.g. the sixth SCSI disk as the root filesystem, if you +want to specify it by name. This is, because only the devices up to +/dev/sde are in the table above, but not /dev/sdf. Although, you can +use the sixth SCSI disk for the root FS, but you have to specify the +device by number... (see below). Or, even more strange, you can use the +fact that there is no range checking of the partition number, and your +knowledge that each disk uses 16 minors, and write "root=/dev/sde17" +(for /dev/sdf1). + +[Strange and maybe uninteresting stuff OFF] + +If the device containing your root partition isn't in the table +above, you can also specify it by major and minor numbers. These are +written in hex, with no prefix and no separator between. E.g., if you +have a CD with contents appropriate as a root filesystem in the first +SCSI CD-ROM drive, you boot from it by "root=0b00". Here, hex "0b" = +decimal 11 is the major of SCSI CD-ROMs, and the minor 0 stands for +the first of these. You can find out all valid major numbers by +looking into include/linux/major.h. + +In addition to major and minor numbers, if the device containing your +root partition uses a partition table format with unique partition +identifiers, then you may use them. For instance, +"root=PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF". It is also +possible to reference another partition on the same device using a +known partition UUID as the starting point. For example, +if partition 5 of the device has the UUID of +00112233-4455-6677-8899-AABBCCDDEEFF then partition 3 may be found as +follows: + + PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF/PARTNROFF=-2 + +Authoritative information can be found in +"Documentation/admin-guide/kernel-parameters.rst". + + +2.2) ro, rw +----------- + +:Syntax: ro +:or: rw + +These two options tell the kernel whether it should mount the root +filesystem read-only or read-write. The default is read-only, except +for ramdisks, which default to read-write. + + +2.3) debug +---------- + +:Syntax: debug + +This raises the kernel log level to 10 (the default is 7). This is the +same level as set by the "dmesg" command, just that the maximum level +selectable by dmesg is 8. + + +2.4) debug= +----------- + +:Syntax: debug=<device> + +This option causes certain kernel messages be printed to the selected +debugging device. This can aid debugging the kernel, since the +messages can be captured and analyzed on some other machine. Which +devices are possible depends on the machine type. There are no checks +for the validity of the device name. If the device isn't implemented, +nothing happens. + +Messages logged this way are in general stack dumps after kernel +memory faults or bad kernel traps, and kernel panics. To be exact: all +messages of level 0 (panic messages) and all messages printed while +the log level is 8 or more (their level doesn't matter). Before stack +dumps, the kernel sets the log level to 10 automatically. A level of +at least 8 can also be set by the "debug" command line option (see +2.3) and at run time with "dmesg -n 8". + +Devices possible for Amiga: + + - "ser": + built-in serial port; parameters: 9600bps, 8N1 + - "mem": + Save the messages to a reserved area in chip mem. After + rebooting, they can be read under AmigaOS with the tool + 'dmesg'. + +Devices possible for Atari: + + - "ser1": + ST-MFP serial port ("Modem1"); parameters: 9600bps, 8N1 + - "ser2": + SCC channel B serial port ("Modem2"); parameters: 9600bps, 8N1 + - "ser" : + default serial port + This is "ser2" for a Falcon, and "ser1" for any other machine + - "midi": + The MIDI port; parameters: 31250bps, 8N1 + - "par" : + parallel port + + The printing routine for this implements a timeout for the + case there's no printer connected (else the kernel would + lock up). The timeout is not exact, but usually a few + seconds. + + +2.6) ramdisk_size= +------------------ + +:Syntax: ramdisk_size=<size> + +This option instructs the kernel to set up a ramdisk of the given +size in KBytes. Do not use this option if the ramdisk contents are +passed by bootstrap! In this case, the size is selected automatically +and should not be overwritten. + +The only application is for root filesystems on floppy disks, that +should be loaded into memory. To do that, select the corresponding +size of the disk as ramdisk size, and set the root device to the disk +drive (with "root="). + + +2.7) swap= + + I can't find any sign of this option in 2.2.6. + +2.8) buff= +----------- + + I can't find any sign of this option in 2.2.6. + + +3) General Device Options (Amiga and Atari) +=========================================== + +3.1) ether= +----------- + +:Syntax: ether=[<irq>[,<base_addr>[,<mem_start>[,<mem_end>]]]],<dev-name> + +<dev-name> is the name of a net driver, as specified in +drivers/net/Space.c in the Linux source. Most prominent are eth0, ... +eth3, sl0, ... sl3, ppp0, ..., ppp3, dummy, and lo. + +The non-ethernet drivers (sl, ppp, dummy, lo) obviously ignore the +settings by this options. Also, the existing ethernet drivers for +Linux/m68k (ariadne, a2065, hydra) don't use them because Zorro boards +are really Plug-'n-Play, so the "ether=" option is useless altogether +for Linux/m68k. + + +3.2) hd= +-------- + +:Syntax: hd=<cylinders>,<heads>,<sectors> + +This option sets the disk geometry of an IDE disk. The first hd= +option is for the first IDE disk, the second for the second one. +(I.e., you can give this option twice.) In most cases, you won't have +to use this option, since the kernel can obtain the geometry data +itself. It exists just for the case that this fails for one of your +disks. + + +3.3) max_scsi_luns= +------------------- + +:Syntax: max_scsi_luns=<n> + +Sets the maximum number of LUNs (logical units) of SCSI devices to +be scanned. Valid values for <n> are between 1 and 8. Default is 8 if +"Probe all LUNs on each SCSI device" was selected during the kernel +configuration, else 1. + + +3.4) st= +-------- + +:Syntax: st=<buffer_size>,[<write_thres>,[<max_buffers>]] + +Sets several parameters of the SCSI tape driver. <buffer_size> is +the number of 512-byte buffers reserved for tape operations for each +device. <write_thres> sets the number of blocks which must be filled +to start an actual write operation to the tape. Maximum value is the +total number of buffers. <max_buffer> limits the total number of +buffers allocated for all tape devices. + + +3.5) dmasound= +-------------- + +:Syntax: dmasound=[<buffers>,<buffer-size>[,<catch-radius>]] + +This option controls some configurations of the Linux/m68k DMA sound +driver (Amiga and Atari): <buffers> is the number of buffers you want +to use (minimum 4, default 4), <buffer-size> is the size of each +buffer in kilobytes (minimum 4, default 32) and <catch-radius> says +how much percent of error will be tolerated when setting a frequency +(maximum 10, default 0). For example with 3% you can play 8000Hz +AU-Files on the Falcon with its hardware frequency of 8195Hz and thus +don't need to expand the sound. + + + +4) Options for Atari Only +========================= + +4.1) video= +----------- + +:Syntax: video=<fbname>:<sub-options...> + +The <fbname> parameter specifies the name of the frame buffer, +eg. most atari users will want to specify `atafb` here. The +<sub-options> is a comma-separated list of the sub-options listed +below. + +NB: + Please notice that this option was renamed from `atavideo` to + `video` during the development of the 1.3.x kernels, thus you + might need to update your boot-scripts if upgrading to 2.x from + an 1.2.x kernel. + +NBB: + The behavior of video= was changed in 2.1.57 so the recommended + option is to specify the name of the frame buffer. + +4.1.1) Video Mode +----------------- + +This sub-option may be any of the predefined video modes, as listed +in atari/atafb.c in the Linux/m68k source tree. The kernel will +activate the given video mode at boot time and make it the default +mode, if the hardware allows. Currently defined names are: + + - stlow : 320x200x4 + - stmid, default5 : 640x200x2 + - sthigh, default4: 640x400x1 + - ttlow : 320x480x8, TT only + - ttmid, default1 : 640x480x4, TT only + - tthigh, default2: 1280x960x1, TT only + - vga2 : 640x480x1, Falcon only + - vga4 : 640x480x2, Falcon only + - vga16, default3 : 640x480x4, Falcon only + - vga256 : 640x480x8, Falcon only + - falh2 : 896x608x1, Falcon only + - falh16 : 896x608x4, Falcon only + +If no video mode is given on the command line, the kernel tries the +modes names "default<n>" in turn, until one is possible with the +hardware in use. + +A video mode setting doesn't make sense, if the external driver is +activated by a "external:" sub-option. + +4.1.2) inverse +-------------- + +Invert the display. This affects only text consoles. +Usually, the background is chosen to be black. With this +option, you can make the background white. + +4.1.3) font +----------- + +:Syntax: font:<fontname> + +Specify the font to use in text modes. Currently you can choose only +between `VGA8x8`, `VGA8x16` and `PEARL8x8`. `VGA8x8` is default, if the +vertical size of the display is less than 400 pixel rows. Otherwise, the +`VGA8x16` font is the default. + +4.1.4) `hwscroll_` +------------------ + +:Syntax: `hwscroll_<n>` + +The number of additional lines of video memory to reserve for +speeding up the scrolling ("hardware scrolling"). Hardware scrolling +is possible only if the kernel can set the video base address in steps +fine enough. This is true for STE, MegaSTE, TT, and Falcon. It is not +possible with plain STs and graphics cards (The former because the +base address must be on a 256 byte boundary there, the latter because +the kernel doesn't know how to set the base address at all.) + +By default, <n> is set to the number of visible text lines on the +display. Thus, the amount of video memory is doubled, compared to no +hardware scrolling. You can turn off the hardware scrolling altogether +by setting <n> to 0. + +4.1.5) internal: +---------------- + +:Syntax: internal:<xres>;<yres>[;<xres_max>;<yres_max>;<offset>] + +This option specifies the capabilities of some extended internal video +hardware, like e.g. OverScan. <xres> and <yres> give the (extended) +dimensions of the screen. + +If your OverScan needs a black border, you have to write the last +three arguments of the "internal:". <xres_max> is the maximum line +length the hardware allows, <yres_max> the maximum number of lines. +<offset> is the offset of the visible part of the screen memory to its +physical start, in bytes. + +Often, extended interval video hardware has to be activated somehow. +For this, see the "sw_*" options below. + +4.1.6) external: +---------------- + +:Syntax: + external:<xres>;<yres>;<depth>;<org>;<scrmem>[;<scrlen>[;<vgabase> + [;<colw>[;<coltype>[;<xres_virtual>]]]]] + +.. I had to break this line... + +This is probably the most complicated parameter... It specifies that +you have some external video hardware (a graphics board), and how to +use it under Linux/m68k. The kernel cannot know more about the hardware +than you tell it here! The kernel also is unable to set or change any +video modes, since it doesn't know about any board internal. So, you +have to switch to that video mode before you start Linux, and cannot +switch to another mode once Linux has started. + +The first 3 parameters of this sub-option should be obvious: <xres>, +<yres> and <depth> give the dimensions of the screen and the number of +planes (depth). The depth is the logarithm to base 2 of the number +of colors possible. (Or, the other way round: The number of colors is +2^depth). + +You have to tell the kernel furthermore how the video memory is +organized. This is done by a letter as <org> parameter: + + 'n': + "normal planes", i.e. one whole plane after another + 'i': + "interleaved planes", i.e. 16 bit of the first plane, than 16 bit + of the next, and so on... This mode is used only with the + built-in Atari video modes, I think there is no card that + supports this mode. + 'p': + "packed pixels", i.e. <depth> consecutive bits stand for all + planes of one pixel; this is the most common mode for 8 planes + (256 colors) on graphic cards + 't': + "true color" (more or less packed pixels, but without a color + lookup table); usually depth is 24 + +For monochrome modes (i.e., <depth> is 1), the <org> letter has a +different meaning: + + 'n': + normal colors, i.e. 0=white, 1=black + 'i': + inverted colors, i.e. 0=black, 1=white + +The next important information about the video hardware is the base +address of the video memory. That is given in the <scrmem> parameter, +as a hexadecimal number with a "0x" prefix. You have to find out this +address in the documentation of your hardware. + +The next parameter, <scrlen>, tells the kernel about the size of the +video memory. If it's missing, the size is calculated from <xres>, +<yres>, and <depth>. For now, it is not useful to write a value here. +It would be used only for hardware scrolling (which isn't possible +with the external driver, because the kernel cannot set the video base +address), or for virtual resolutions under X (which the X server +doesn't support yet). So, it's currently best to leave this field +empty, either by ending the "external:" after the video address or by +writing two consecutive semicolons, if you want to give a <vgabase> +(it is allowed to leave this parameter empty). + +The <vgabase> parameter is optional. If it is not given, the kernel +cannot read or write any color registers of the video hardware, and +thus you have to set appropriate colors before you start Linux. But if +your card is somehow VGA compatible, you can tell the kernel the base +address of the VGA register set, so it can change the color lookup +table. You have to look up this address in your board's documentation. +To avoid misunderstandings: <vgabase> is the _base_ address, i.e. a 4k +aligned address. For read/writing the color registers, the kernel +uses the addresses vgabase+0x3c7...vgabase+0x3c9. The <vgabase> +parameter is written in hexadecimal with a "0x" prefix, just as +<scrmem>. + +<colw> is meaningful only if <vgabase> is specified. It tells the +kernel how wide each of the color register is, i.e. the number of bits +per single color (red/green/blue). Default is 6, another quite usual +value is 8. + +Also <coltype> is used together with <vgabase>. It tells the kernel +about the color register model of your gfx board. Currently, the types +"vga" (which is also the default) and "mv300" (SANG MV300) are +implemented. + +Parameter <xres_virtual> is required for ProMST or ET4000 cards where +the physical linelength differs from the visible length. With ProMST, +xres_virtual must be set to 2048. For ET4000, xres_virtual depends on the +initialisation of the video-card. +If you're missing a corresponding yres_virtual: the external part is legacy, +therefore we don't support hardware-dependent functions like hardware-scroll, +panning or blanking. + +4.1.7) eclock: +-------------- + +The external pixel clock attached to the Falcon VIDEL shifter. This +currently works only with the ScreenWonder! + +4.1.8) monitorcap: +------------------- + +:Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax> + +This describes the capabilities of a multisync monitor. Don't use it +with a fixed-frequency monitor! For now, only the Falcon frame buffer +uses the settings of "monitorcap:". + +<vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies +your monitor can work with, in Hz. <hmin> and <hmax> are the same for +the horizontal frequency, in kHz. + + The defaults are 58;62;31;32 (VGA compatible). + + The defaults for TV/SC1224/SC1435 cover both PAL and NTSC standards. + +4.1.9) keep +------------ + +If this option is given, the framebuffer device doesn't do any video +mode calculations and settings on its own. The only Atari fb device +that does this currently is the Falcon. + +What you reach with this: Settings for unknown video extensions +aren't overridden by the driver, so you can still use the mode found +when booting, when the driver doesn't know to set this mode itself. +But this also means, that you can't switch video modes anymore... + +An example where you may want to use "keep" is the ScreenBlaster for +the Falcon. + + +4.2) atamouse= +-------------- + +:Syntax: atamouse=<x-threshold>,[<y-threshold>] + +With this option, you can set the mouse movement reporting threshold. +This is the number of pixels of mouse movement that have to accumulate +before the IKBD sends a new mouse packet to the kernel. Higher values +reduce the mouse interrupt load and thus reduce the chance of keyboard +overruns. Lower values give a slightly faster mouse responses and +slightly better mouse tracking. + +You can set the threshold in x and y separately, but usually this is +of little practical use. If there's just one number in the option, it +is used for both dimensions. The default value is 2 for both +thresholds. + + +4.3) ataflop= +------------- + +:Syntax: ataflop=<drive type>[,<trackbuffering>[,<steprateA>[,<steprateB>]]] + + The drive type may be 0, 1, or 2, for DD, HD, and ED, resp. This + setting affects how many buffers are reserved and which formats are + probed (see also below). The default is 1 (HD). Only one drive type + can be selected. If you have two disk drives, select the "better" + type. + + The second parameter <trackbuffer> tells the kernel whether to use + track buffering (1) or not (0). The default is machine-dependent: + no for the Medusa and yes for all others. + + With the two following parameters, you can change the default + steprate used for drive A and B, resp. + + +4.4) atascsi= +------------- + +:Syntax: atascsi=<can_queue>[,<cmd_per_lun>[,<scat-gat>[,<host-id>[,<tagged>]]]] + +This option sets some parameters for the Atari native SCSI driver. +Generally, any number of arguments can be omitted from the end. And +for each of the numbers, a negative value means "use default". The +defaults depend on whether TT-style or Falcon-style SCSI is used. +Below, defaults are noted as n/m, where the first value refers to +TT-SCSI and the latter to Falcon-SCSI. If an illegal value is given +for one parameter, an error message is printed and that one setting is +ignored (others aren't affected). + + <can_queue>: + This is the maximum number of SCSI commands queued internally to the + Atari SCSI driver. A value of 1 effectively turns off the driver + internal multitasking (if it causes problems). Legal values are >= + 1. <can_queue> can be as high as you like, but values greater than + <cmd_per_lun> times the number of SCSI targets (LUNs) you have + don't make sense. Default: 16/8. + + <cmd_per_lun>: + Maximum number of SCSI commands issued to the driver for one + logical unit (LUN, usually one SCSI target). Legal values start + from 1. If tagged queuing (see below) is not used, values greater + than 2 don't make sense, but waste memory. Otherwise, the maximum + is the number of command tags available to the driver (currently + 32). Default: 8/1. (Note: Values > 1 seem to cause problems on a + Falcon, cause not yet known.) + + The <cmd_per_lun> value at a great part determines the amount of + memory SCSI reserves for itself. The formula is rather + complicated, but I can give you some hints: + + no scatter-gather: + cmd_per_lun * 232 bytes + full scatter-gather: + cmd_per_lun * approx. 17 Kbytes + + <scat-gat>: + Size of the scatter-gather table, i.e. the number of requests + consecutive on the disk that can be merged into one SCSI command. + Legal values are between 0 and 255. Default: 255/0. Note: This + value is forced to 0 on a Falcon, since scatter-gather isn't + possible with the ST-DMA. Not using scatter-gather hurts + performance significantly. + + <host-id>: + The SCSI ID to be used by the initiator (your Atari). This is + usually 7, the highest possible ID. Every ID on the SCSI bus must + be unique. Default: determined at run time: If the NV-RAM checksum + is valid, and bit 7 in byte 30 of the NV-RAM is set, the lower 3 + bits of this byte are used as the host ID. (This method is defined + by Atari and also used by some TOS HD drivers.) If the above + isn't given, the default ID is 7. (both, TT and Falcon). + + <tagged>: + 0 means turn off tagged queuing support, all other values > 0 mean + use tagged queuing for targets that support it. Default: currently + off, but this may change when tagged queuing handling has been + proved to be reliable. + + Tagged queuing means that more than one command can be issued to + one LUN, and the SCSI device itself orders the requests so they + can be performed in optimal order. Not all SCSI devices support + tagged queuing (:-(). + +4.5 switches= +------------- + +:Syntax: switches=<list of switches> + +With this option you can switch some hardware lines that are often +used to enable/disable certain hardware extensions. Examples are +OverScan, overclocking, ... + +The <list of switches> is a comma-separated list of the following +items: + + ikbd: + set RTS of the keyboard ACIA high + midi: + set RTS of the MIDI ACIA high + snd6: + set bit 6 of the PSG port A + snd7: + set bit 6 of the PSG port A + +It doesn't make sense to mention a switch more than once (no +difference to only once), but you can give as many switches as you +want to enable different features. The switch lines are set as early +as possible during kernel initialization (even before determining the +present hardware.) + +All of the items can also be prefixed with `ov_`, i.e. `ov_ikbd`, +`ov_midi`, ... These options are meant for switching on an OverScan +video extension. The difference to the bare option is that the +switch-on is done after video initialization, and somehow synchronized +to the HBLANK. A speciality is that ov_ikbd and ov_midi are switched +off before rebooting, so that OverScan is disabled and TOS boots +correctly. + +If you give an option both, with and without the `ov_` prefix, the +earlier initialization (`ov_`-less) takes precedence. But the +switching-off on reset still happens in this case. + +5) Options for Amiga Only: +========================== + +5.1) video= +----------- + +:Syntax: video=<fbname>:<sub-options...> + +The <fbname> parameter specifies the name of the frame buffer, valid +options are `amifb`, `cyber`, 'virge', `retz3` and `clgen`, provided +that the respective frame buffer devices have been compiled into the +kernel (or compiled as loadable modules). The behavior of the <fbname> +option was changed in 2.1.57 so it is now recommended to specify this +option. + +The <sub-options> is a comma-separated list of the sub-options listed +below. This option is organized similar to the Atari version of the +"video"-option (4.1), but knows fewer sub-options. + +5.1.1) video mode +----------------- + +Again, similar to the video mode for the Atari (see 4.1.1). Predefined +modes depend on the used frame buffer device. + +OCS, ECS and AGA machines all use the color frame buffer. The following +predefined video modes are available: + +NTSC modes: + - ntsc : 640x200, 15 kHz, 60 Hz + - ntsc-lace : 640x400, 15 kHz, 60 Hz interlaced + +PAL modes: + - pal : 640x256, 15 kHz, 50 Hz + - pal-lace : 640x512, 15 kHz, 50 Hz interlaced + +ECS modes: + - multiscan : 640x480, 29 kHz, 57 Hz + - multiscan-lace : 640x960, 29 kHz, 57 Hz interlaced + - euro36 : 640x200, 15 kHz, 72 Hz + - euro36-lace : 640x400, 15 kHz, 72 Hz interlaced + - euro72 : 640x400, 29 kHz, 68 Hz + - euro72-lace : 640x800, 29 kHz, 68 Hz interlaced + - super72 : 800x300, 23 kHz, 70 Hz + - super72-lace : 800x600, 23 kHz, 70 Hz interlaced + - dblntsc-ff : 640x400, 27 kHz, 57 Hz + - dblntsc-lace : 640x800, 27 kHz, 57 Hz interlaced + - dblpal-ff : 640x512, 27 kHz, 47 Hz + - dblpal-lace : 640x1024, 27 kHz, 47 Hz interlaced + - dblntsc : 640x200, 27 kHz, 57 Hz doublescan + - dblpal : 640x256, 27 kHz, 47 Hz doublescan + +VGA modes: + - vga : 640x480, 31 kHz, 60 Hz + - vga70 : 640x400, 31 kHz, 70 Hz + +Please notice that the ECS and VGA modes require either an ECS or AGA +chipset, and that these modes are limited to 2-bit color for the ECS +chipset and 8-bit color for the AGA chipset. + +5.1.2) depth +------------ + +:Syntax: depth:<nr. of bit-planes> + +Specify the number of bit-planes for the selected video-mode. + +5.1.3) inverse +-------------- + +Use inverted display (black on white). Functionally the same as the +"inverse" sub-option for the Atari. + +5.1.4) font +----------- + +:Syntax: font:<fontname> + +Specify the font to use in text modes. Functionally the same as the +"font" sub-option for the Atari, except that `PEARL8x8` is used instead +of `VGA8x8` if the vertical size of the display is less than 400 pixel +rows. + +5.1.5) monitorcap: +------------------- + +:Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax> + +This describes the capabilities of a multisync monitor. For now, only +the color frame buffer uses the settings of "monitorcap:". + +<vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies +your monitor can work with, in Hz. <hmin> and <hmax> are the same for +the horizontal frequency, in kHz. + +The defaults are 50;90;15;38 (Generic Amiga multisync monitor). + + +5.2) fd_def_df0= +---------------- + +:Syntax: fd_def_df0=<value> + +Sets the df0 value for "silent" floppy drives. The value should be in +hexadecimal with "0x" prefix. + + +5.3) wd33c93= +------------- + +:Syntax: wd33c93=<sub-options...> + +These options affect the A590/A2091, A3000 and GVP Series II SCSI +controllers. + +The <sub-options> is a comma-separated list of the sub-options listed +below. + +5.3.1) nosync +------------- + +:Syntax: nosync:bitmask + +bitmask is a byte where the 1st 7 bits correspond with the 7 +possible SCSI devices. Set a bit to prevent sync negotiation on that +device. To maintain backwards compatibility, a command-line such as +"wd33c93=255" will be automatically translated to +"wd33c93=nosync:0xff". The default is to disable sync negotiation for +all devices, eg. nosync:0xff. + +5.3.2) period +------------- + +:Syntax: period:ns + +`ns` is the minimum # of nanoseconds in a SCSI data transfer +period. Default is 500; acceptable values are 250 - 1000. + +5.3.3) disconnect +----------------- + +:Syntax: disconnect:x + +Specify x = 0 to never allow disconnects, 2 to always allow them. +x = 1 does 'adaptive' disconnects, which is the default and generally +the best choice. + +5.3.4) debug +------------ + +:Syntax: debug:x + +If `DEBUGGING_ON` is defined, x is a bit mask that causes various +types of debug output to printed - see the DB_xxx defines in +wd33c93.h. + +5.3.5) clock +------------ + +:Syntax: clock:x + +x = clock input in MHz for WD33c93 chip. Normal values would be from +8 through 20. The default value depends on your hostadapter(s), +default for the A3000 internal controller is 14, for the A2091 it's 8 +and for the GVP hostadapters it's either 8 or 14, depending on the +hostadapter and the SCSI-clock jumper present on some GVP +hostadapters. + +5.3.6) next +----------- + +No argument. Used to separate blocks of keywords when there's more +than one wd33c93-based host adapter in the system. + +5.3.7) nodma +------------ + +:Syntax: nodma:x + +If x is 1 (or if the option is just written as "nodma"), the WD33c93 +controller will not use DMA (= direct memory access) to access the +Amiga's memory. This is useful for some systems (like A3000's and +A4000's with the A3640 accelerator, revision 3.0) that have problems +using DMA to chip memory. The default is 0, i.e. to use DMA if +possible. + + +5.4) gvp11= +----------- + +:Syntax: gvp11=<addr-mask> + +The earlier versions of the GVP driver did not handle DMA +address-mask settings correctly which made it necessary for some +people to use this option, in order to get their GVP controller +running under Linux. These problems have hopefully been solved and the +use of this option is now highly unrecommended! + +Incorrect use can lead to unpredictable behavior, so please only use +this option if you *know* what you are doing and have a reason to do +so. In any case if you experience problems and need to use this +option, please inform us about it by mailing to the Linux/68k kernel +mailing list. + +The address mask set by this option specifies which addresses are +valid for DMA with the GVP Series II SCSI controller. An address is +valid, if no bits are set except the bits that are set in the mask, +too. + +Some versions of the GVP can only DMA into a 24 bit address range, +some can address a 25 bit address range while others can use the whole +32 bit address range for DMA. The correct setting depends on your +controller and should be autodetected by the driver. An example is the +24 bit region which is specified by a mask of 0x00fffffe. diff --git a/Documentation/arch/nios2/features.rst b/Documentation/arch/nios2/features.rst new file mode 100644 index 000000000000..8449e63f69b2 --- /dev/null +++ b/Documentation/arch/nios2/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features nios2 diff --git a/Documentation/arch/nios2/index.rst b/Documentation/arch/nios2/index.rst new file mode 100644 index 000000000000..4468fe1a1037 --- /dev/null +++ b/Documentation/arch/nios2/index.rst @@ -0,0 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +Nios II Specific Documentation +============================== + +.. toctree:: + :maxdepth: 2 + :numbered: + + nios2 + features diff --git a/Documentation/arch/nios2/nios2.rst b/Documentation/arch/nios2/nios2.rst new file mode 100644 index 000000000000..43da3f7cee76 --- /dev/null +++ b/Documentation/arch/nios2/nios2.rst @@ -0,0 +1,24 @@ +================================= +Linux on the Nios II architecture +================================= + +This is a port of Linux to Nios II (nios2) processor. + +In order to compile for Nios II, you need a version of GCC with support for the generic +system call ABI. Please see this link for more information on how compiling and booting +software for the Nios II platform: +http://www.rocketboards.org/foswiki/Documentation/NiosIILinuxUserManual + +For reference, please see the following link: +http://www.altera.com/literature/lit-nio2.jsp + +What is Nios II? +================ +Nios II is a 32-bit embedded-processor architecture designed specifically for the +Altera family of FPGAs. In order to support Linux, Nios II needs to be configured +with MMU and hardware multiplier enabled. + +Nios II ABI +=========== +Please refer to chapter "Application Binary Interface" in Nios II Processor Reference +Handbook. diff --git a/Documentation/arch/openrisc/features.rst b/Documentation/arch/openrisc/features.rst new file mode 100644 index 000000000000..3f7c40d219f2 --- /dev/null +++ b/Documentation/arch/openrisc/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features openrisc diff --git a/Documentation/arch/openrisc/index.rst b/Documentation/arch/openrisc/index.rst new file mode 100644 index 000000000000..6879f998b87a --- /dev/null +++ b/Documentation/arch/openrisc/index.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +OpenRISC Architecture +===================== + +.. toctree:: + :maxdepth: 2 + + openrisc_port + todo + + features + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/arch/openrisc/openrisc_port.rst b/Documentation/arch/openrisc/openrisc_port.rst new file mode 100644 index 000000000000..657ac4af7be6 --- /dev/null +++ b/Documentation/arch/openrisc/openrisc_port.rst @@ -0,0 +1,121 @@ +============== +OpenRISC Linux +============== + +This is a port of Linux to the OpenRISC class of microprocessors; the initial +target architecture, specifically, is the 32-bit OpenRISC 1000 family (or1k). + +For information about OpenRISC processors and ongoing development: + + ======= ============================= + website https://openrisc.io + email openrisc@lists.librecores.org + ======= ============================= + +--------------------------------------------------------------------- + +Build instructions for OpenRISC toolchain and Linux +=================================================== + +In order to build and run Linux for OpenRISC, you'll need at least a basic +toolchain and, perhaps, the architectural simulator. Steps to get these bits +in place are outlined here. + +1) Toolchain + +Toolchain binaries can be obtained from openrisc.io or our github releases page. +Instructions for building the different toolchains can be found on openrisc.io +or Stafford's toolchain build and release scripts. + + ========== ================================================= + binaries https://github.com/openrisc/or1k-gcc/releases + toolchains https://openrisc.io/software + building https://github.com/stffrdhrn/or1k-toolchain-build + ========== ================================================= + +2) Building + +Build the Linux kernel as usual:: + + make ARCH=openrisc CROSS_COMPILE="or1k-linux-" defconfig + make ARCH=openrisc CROSS_COMPILE="or1k-linux-" + +3) Running on FPGA (optional) + +The OpenRISC community typically uses FuseSoC to manage building and programming +an SoC into an FPGA. The below is an example of programming a De0 Nano +development board with the OpenRISC SoC. During the build FPGA RTL is code +downloaded from the FuseSoC IP cores repository and built using the FPGA vendor +tools. Binaries are loaded onto the board with openocd. + +:: + + git clone https://github.com/olofk/fusesoc + cd fusesoc + sudo pip install -e . + + fusesoc init + fusesoc build de0_nano + fusesoc pgm de0_nano + + openocd -f interface/altera-usb-blaster.cfg \ + -f board/or1k_generic.cfg + + telnet localhost 4444 + > init + > halt; load_image vmlinux ; reset + +4) Running on a Simulator (optional) + +QEMU is a processor emulator which we recommend for simulating the OpenRISC +platform. Please follow the OpenRISC instructions on the QEMU website to get +Linux running on QEMU. You can build QEMU yourself, but your Linux distribution +likely provides binary packages to support OpenRISC. + + ============= ====================================================== + qemu openrisc https://wiki.qemu.org/Documentation/Platforms/OpenRISC + ============= ====================================================== + +--------------------------------------------------------------------- + +Terminology +=========== + +In the code, the following particles are used on symbols to limit the scope +to more or less specific processor implementations: + +========= ======================================= +openrisc: the OpenRISC class of processors +or1k: the OpenRISC 1000 family of processors +or1200: the OpenRISC 1200 processor +========= ======================================= + +--------------------------------------------------------------------- + +History +======== + +18-11-2003 Matjaz Breskvar (phoenix@bsemi.com) + initial port of linux to OpenRISC/or32 architecture. + all the core stuff is implemented and seams usable. + +08-12-2003 Matjaz Breskvar (phoenix@bsemi.com) + complete change of TLB miss handling. + rewrite of exceptions handling. + fully functional sash-3.6 in default initrd. + a much improved version with changes all around. + +10-04-2004 Matjaz Breskvar (phoenix@bsemi.com) + alot of bugfixes all over. + ethernet support, functional http and telnet servers. + running many standard linux apps. + +26-06-2004 Matjaz Breskvar (phoenix@bsemi.com) + port to 2.6.x + +30-11-2004 Matjaz Breskvar (phoenix@bsemi.com) + lots of bugfixes and enhancments. + added opencores framebuffer driver. + +09-10-2010 Jonas Bonn (jonas@southpole.se) + major rewrite to bring up to par with upstream Linux 2.6.36 diff --git a/Documentation/arch/openrisc/todo.rst b/Documentation/arch/openrisc/todo.rst new file mode 100644 index 000000000000..420b18b87eda --- /dev/null +++ b/Documentation/arch/openrisc/todo.rst @@ -0,0 +1,15 @@ +==== +TODO +==== + +The OpenRISC Linux port is fully functional and has been tracking upstream +since 2.6.35. There are, however, remaining items to be completed within +the coming months. Here's a list of known-to-be-less-than-stellar items +that are due for investigation shortly, i.e. our TODO list: + +- Implement the rest of the DMA API... dma_map_sg, etc. + +- Finish the renaming cleanup... there are references to or32 in the code + which was an older name for the architecture. The name we've settled on is + or1k and this change is slowly trickling through the stack. For the time + being, or32 is equivalent to or1k. diff --git a/Documentation/arch/parisc/debugging.rst b/Documentation/arch/parisc/debugging.rst new file mode 100644 index 000000000000..de1b60402c5b --- /dev/null +++ b/Documentation/arch/parisc/debugging.rst @@ -0,0 +1,46 @@ +================= +PA-RISC Debugging +================= + +okay, here are some hints for debugging the lower-level parts of +linux/parisc. + + +1. Absolute addresses +===================== + +A lot of the assembly code currently runs in real mode, which means +absolute addresses are used instead of virtual addresses as in the +rest of the kernel. To translate an absolute address to a virtual +address you can lookup in System.map, add __PAGE_OFFSET (0x10000000 +currently). + + +2. HPMCs +======== + +When real-mode code tries to access non-existent memory, you'll get +an HPMC instead of a kernel oops. To debug an HPMC, try to find +the System Responder/Requestor addresses. The System Requestor +address should match (one of the) processor HPAs (high addresses in +the I/O range); the System Responder address is the address real-mode +code tried to access. + +Typical values for the System Responder address are addresses larger +than __PAGE_OFFSET (0x10000000) which mean a virtual address didn't +get translated to a physical address before real-mode code tried to +access it. + + +3. Q bit fun +============ + +Certain, very critical code has to clear the Q bit in the PSW. What +happens when the Q bit is cleared is the CPU does not update the +registers interruption handlers read to find out where the machine +was interrupted - so if you get an interruption between the instruction +that clears the Q bit and the RFI that sets it again you don't know +where exactly it happened. If you're lucky the IAOQ will point to the +instruction that cleared the Q bit, if you're not it points anywhere +at all. Usually Q bit problems will show themselves in unexplainable +system hangs or running off the end of physical memory. diff --git a/Documentation/arch/parisc/features.rst b/Documentation/arch/parisc/features.rst new file mode 100644 index 000000000000..501d7c450037 --- /dev/null +++ b/Documentation/arch/parisc/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features parisc diff --git a/Documentation/arch/parisc/index.rst b/Documentation/arch/parisc/index.rst new file mode 100644 index 000000000000..240685751825 --- /dev/null +++ b/Documentation/arch/parisc/index.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +PA-RISC Architecture +==================== + +.. toctree:: + :maxdepth: 2 + + debugging + registers + + features + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/arch/parisc/registers.rst b/Documentation/arch/parisc/registers.rst new file mode 100644 index 000000000000..59c8ecf3e856 --- /dev/null +++ b/Documentation/arch/parisc/registers.rst @@ -0,0 +1,154 @@ +================================ +Register Usage for Linux/PA-RISC +================================ + +[ an asterisk is used for planned usage which is currently unimplemented ] + +General Registers as specified by ABI +===================================== + +Control Registers +----------------- + +=============================== =============================================== +CR 0 (Recovery Counter) used for ptrace +CR 1-CR 7(undefined) unused +CR 8 (Protection ID) per-process value* +CR 9, 12, 13 (PIDS) unused +CR10 (CCR) lazy FPU saving* +CR11 as specified by ABI (SAR) +CR14 (interruption vector) initialized to fault_vector +CR15 (EIEM) initialized to all ones* +CR16 (Interval Timer) read for cycle count/write starts Interval Tmr +CR17-CR22 interruption parameters +CR19 Interrupt Instruction Register +CR20 Interrupt Space Register +CR21 Interrupt Offset Register +CR22 Interrupt PSW +CR23 (EIRR) read for pending interrupts/write clears bits +CR24 (TR 0) Kernel Space Page Directory Pointer +CR25 (TR 1) User Space Page Directory Pointer +CR26 (TR 2) not used +CR27 (TR 3) Thread descriptor pointer +CR28 (TR 4) not used +CR29 (TR 5) not used +CR30 (TR 6) current / 0 +CR31 (TR 7) Temporary register, used in various places +=============================== =============================================== + +Space Registers (kernel mode) +----------------------------- + +=============================== =============================================== +SR0 temporary space register +SR4-SR7 set to 0 +SR1 temporary space register +SR2 kernel should not clobber this +SR3 used for userspace accesses (current process) +=============================== =============================================== + +Space Registers (user mode) +--------------------------- + +=============================== =============================================== +SR0 temporary space register +SR1 temporary space register +SR2 holds space of linux gateway page +SR3 holds user address space value while in kernel +SR4-SR7 Defines short address space for user/kernel +=============================== =============================================== + + +Processor Status Word +--------------------- + +=============================== =============================================== +W (64-bit addresses) 0 +E (Little-endian) 0 +S (Secure Interval Timer) 0 +T (Taken Branch Trap) 0 +H (Higher-privilege trap) 0 +L (Lower-privilege trap) 0 +N (Nullify next instruction) used by C code +X (Data memory break disable) 0 +B (Taken Branch) used by C code +C (code address translation) 1, 0 while executing real-mode code +V (divide step correction) used by C code +M (HPMC mask) 0, 1 while executing HPMC handler* +C/B (carry/borrow bits) used by C code +O (ordered references) 1* +F (performance monitor) 0 +R (Recovery Counter trap) 0 +Q (collect interruption state) 1 (0 in code directly preceding an rfi) +P (Protection Identifiers) 1* +D (Data address translation) 1, 0 while executing real-mode code +I (external interrupt mask) used by cli()/sti() macros +=============================== =============================================== + +"Invisible" Registers +--------------------- + +=============================== =============================================== +PSW default W value 0 +PSW default E value 0 +Shadow Registers used by interruption handler code +TOC enable bit 1 +=============================== =============================================== + +------------------------------------------------------------------------- + +The PA-RISC architecture defines 7 registers as "shadow registers". +Those are used in RETURN FROM INTERRUPTION AND RESTORE instruction to reduce +the state save and restore time by eliminating the need for general register +(GR) saves and restores in interruption handlers. +Shadow registers are the GRs 1, 8, 9, 16, 17, 24, and 25. + +------------------------------------------------------------------------- + +Register usage notes, originally from John Marvin, with some additional +notes from Randolph Chung. + +For the general registers: + +r1,r2,r19-r26,r28,r29 & r31 can be used without saving them first. And of +course, you need to save them if you care about them, before calling +another procedure. Some of the above registers do have special meanings +that you should be aware of: + + r1: + The addil instruction is hardwired to place its result in r1, + so if you use that instruction be aware of that. + + r2: + This is the return pointer. In general you don't want to + use this, since you need the pointer to get back to your + caller. However, it is grouped with this set of registers + since the caller can't rely on the value being the same + when you return, i.e. you can copy r2 to another register + and return through that register after trashing r2, and + that should not cause a problem for the calling routine. + + r19-r22: + these are generally regarded as temporary registers. + Note that in 64 bit they are arg7-arg4. + + r23-r26: + these are arg3-arg0, i.e. you can use them if you + don't care about the values that were passed in anymore. + + r28,r29: + are ret0 and ret1. They are what you pass return values + in. r28 is the primary return. When returning small structures + r29 may also be used to pass data back to the caller. + + r30: + stack pointer + + r31: + the ble instruction puts the return pointer in here. + + + r3-r18,r27,r30 need to be saved and restored. r3-r18 are just + general purpose registers. r27 is the data pointer, and is + used to make references to global variables easier. r30 is + the stack pointer. diff --git a/Documentation/arch/sh/booting.rst b/Documentation/arch/sh/booting.rst new file mode 100644 index 000000000000..d851c49a01bf --- /dev/null +++ b/Documentation/arch/sh/booting.rst @@ -0,0 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +DeviceTree Booting +------------------ + + Device-tree compatible SH bootloaders are expected to provide the physical + address of the device tree blob in r4. Since legacy bootloaders did not + guarantee any particular initial register state, kernels built to + inter-operate with old bootloaders must either use a builtin DTB or + select a legacy board option (something other than CONFIG_SH_DEVICE_TREE) + that does not use device tree. Support for the latter is being phased out + in favor of device tree. diff --git a/Documentation/arch/sh/features.rst b/Documentation/arch/sh/features.rst new file mode 100644 index 000000000000..f722af3b6c99 --- /dev/null +++ b/Documentation/arch/sh/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features sh diff --git a/Documentation/arch/sh/index.rst b/Documentation/arch/sh/index.rst new file mode 100644 index 000000000000..c64776738cf6 --- /dev/null +++ b/Documentation/arch/sh/index.rst @@ -0,0 +1,56 @@ +======================= +SuperH Interfaces Guide +======================= + +:Author: Paul Mundt + +.. toctree:: + :maxdepth: 1 + + booting + new-machine + register-banks + + features + +Memory Management +================= + +SH-4 +---- + +Store Queue API +~~~~~~~~~~~~~~~ + +.. kernel-doc:: arch/sh/kernel/cpu/sh4/sq.c + :export: + +Machine Specific Interfaces +=========================== + +mach-dreamcast +-------------- + +.. kernel-doc:: arch/sh/boards/mach-dreamcast/rtc.c + :internal: + +mach-x3proto +------------ + +.. kernel-doc:: arch/sh/boards/mach-x3proto/ilsel.c + :export: + +Busses +====== + +SuperHyway +---------- + +.. kernel-doc:: drivers/sh/superhyway/superhyway.c + :export: + +Maple +----- + +.. kernel-doc:: drivers/sh/maple/maple.c + :export: diff --git a/Documentation/arch/sh/new-machine.rst b/Documentation/arch/sh/new-machine.rst new file mode 100644 index 000000000000..e501c52b3b30 --- /dev/null +++ b/Documentation/arch/sh/new-machine.rst @@ -0,0 +1,277 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +Adding a new board to LinuxSH +============================= + + Paul Mundt <lethal@linux-sh.org> + +This document attempts to outline what steps are necessary to add support +for new boards to the LinuxSH port under the new 2.5 and 2.6 kernels. This +also attempts to outline some of the noticeable changes between the 2.4 +and the 2.5/2.6 SH backend. + +1. New Directory Structure +========================== + +The first thing to note is the new directory structure. Under 2.4, most +of the board-specific code (with the exception of stboards) ended up +in arch/sh/kernel/ directly, with board-specific headers ending up in +include/asm-sh/. For the new kernel, things are broken out by board type, +companion chip type, and CPU type. Looking at a tree view of this directory +hierarchy looks like the following: + +Board-specific code:: + + . + |-- arch + | `-- sh + | `-- boards + | |-- adx + | | `-- board-specific files + | |-- bigsur + | | `-- board-specific files + | | + | ... more boards here ... + | + `-- include + `-- asm-sh + |-- adx + | `-- board-specific headers + |-- bigsur + | `-- board-specific headers + | + .. more boards here ... + +Next, for companion chips:: + + . + `-- arch + `-- sh + `-- cchips + `-- hd6446x + `-- hd64461 + `-- cchip-specific files + +... and so on. Headers for the companion chips are treated the same way as +board-specific headers. Thus, include/asm-sh/hd64461 is home to all of the +hd64461-specific headers. + +Finally, CPU family support is also abstracted:: + + . + |-- arch + | `-- sh + | |-- kernel + | | `-- cpu + | | |-- sh2 + | | | `-- SH-2 generic files + | | |-- sh3 + | | | `-- SH-3 generic files + | | `-- sh4 + | | `-- SH-4 generic files + | `-- mm + | `-- This is also broken out per CPU family, so each family can + | have their own set of cache/tlb functions. + | + `-- include + `-- asm-sh + |-- cpu-sh2 + | `-- SH-2 specific headers + |-- cpu-sh3 + | `-- SH-3 specific headers + `-- cpu-sh4 + `-- SH-4 specific headers + +It should be noted that CPU subtypes are _not_ abstracted. Thus, these still +need to be dealt with by the CPU family specific code. + +2. Adding a New Board +===================== + +The first thing to determine is whether the board you are adding will be +isolated, or whether it will be part of a family of boards that can mostly +share the same board-specific code with minor differences. + +In the first case, this is just a matter of making a directory for your +board in arch/sh/boards/ and adding rules to hook your board in with the +build system (more on this in the next section). However, for board families +it makes more sense to have a common top-level arch/sh/boards/ directory +and then populate that with sub-directories for each member of the family. +Both the Solution Engine and the hp6xx boards are an example of this. + +After you have setup your new arch/sh/boards/ directory, remember that you +should also add a directory in include/asm-sh for headers localized to this +board (if there are going to be more than one). In order to interoperate +seamlessly with the build system, it's best to have this directory the same +as the arch/sh/boards/ directory name, though if your board is again part of +a family, the build system has ways of dealing with this (via incdir-y +overloading), and you can feel free to name the directory after the family +member itself. + +There are a few things that each board is required to have, both in the +arch/sh/boards and the include/asm-sh/ hierarchy. In order to better +explain this, we use some examples for adding an imaginary board. For +setup code, we're required at the very least to provide definitions for +get_system_type() and platform_setup(). For our imaginary board, this +might look something like:: + + /* + * arch/sh/boards/vapor/setup.c - Setup code for imaginary board + */ + #include <linux/init.h> + + const char *get_system_type(void) + { + return "FooTech Vaporboard"; + } + + int __init platform_setup(void) + { + /* + * If our hardware actually existed, we would do real + * setup here. Though it's also sane to leave this empty + * if there's no real init work that has to be done for + * this board. + */ + + /* Start-up imaginary PCI ... */ + + /* And whatever else ... */ + + return 0; + } + +Our new imaginary board will also have to tie into the machvec in order for it +to be of any use. + +machvec functions fall into a number of categories: + + - I/O functions to IO memory (inb etc) and PCI/main memory (readb etc). + - I/O mapping functions (ioport_map, ioport_unmap, etc). + - a 'heartbeat' function. + - PCI and IRQ initialization routines. + - Consistent allocators (for boards that need special allocators, + particularly for allocating out of some board-specific SRAM for DMA + handles). + +There are machvec functions added and removed over time, so always be sure to +consult include/asm-sh/machvec.h for the current state of the machvec. + +The kernel will automatically wrap in generic routines for undefined function +pointers in the machvec at boot time, as machvec functions are referenced +unconditionally throughout most of the tree. Some boards have incredibly +sparse machvecs (such as the dreamcast and sh03), whereas others must define +virtually everything (rts7751r2d). + +Adding a new machine is relatively trivial (using vapor as an example): + +If the board-specific definitions are quite minimalistic, as is the case for +the vast majority of boards, simply having a single board-specific header is +sufficient. + + - add a new file include/asm-sh/vapor.h which contains prototypes for + any machine specific IO functions prefixed with the machine name, for + example vapor_inb. These will be needed when filling out the machine + vector. + + Note that these prototypes are generated automatically by setting + __IO_PREFIX to something sensible. A typical example would be:: + + #define __IO_PREFIX vapor + #include <asm/io_generic.h> + + somewhere in the board-specific header. Any boards being ported that still + have a legacy io.h should remove it entirely and switch to the new model. + + - Add machine vector definitions to the board's setup.c. At a bare minimum, + this must be defined as something like:: + + struct sh_machine_vector mv_vapor __initmv = { + .mv_name = "vapor", + }; + ALIAS_MV(vapor) + + - finally add a file arch/sh/boards/vapor/io.c, which contains definitions of + the machine specific io functions (if there are enough to warrant it). + +3. Hooking into the Build System +================================ + +Now that we have the corresponding directories setup, and all of the +board-specific code is in place, it's time to look at how to get the +whole mess to fit into the build system. + +Large portions of the build system are now entirely dynamic, and merely +require the proper entry here and there in order to get things done. + +The first thing to do is to add an entry to arch/sh/Kconfig, under the +"System type" menu:: + + config SH_VAPOR + bool "Vapor" + help + select Vapor if configuring for a FooTech Vaporboard. + +next, this has to be added into arch/sh/Makefile. All boards require a +machdir-y entry in order to be built. This entry needs to be the name of +the board directory as it appears in arch/sh/boards, even if it is in a +sub-directory (in which case, all parent directories below arch/sh/boards/ +need to be listed). For our new board, this entry can look like:: + + machdir-$(CONFIG_SH_VAPOR) += vapor + +provided that we've placed everything in the arch/sh/boards/vapor/ directory. + +Next, the build system assumes that your include/asm-sh directory will also +be named the same. If this is not the case (as is the case with multiple +boards belonging to a common family), then the directory name needs to be +implicitly appended to incdir-y. The existing code manages this for the +Solution Engine and hp6xx boards, so see these for an example. + +Once that is taken care of, it's time to add an entry for the mach type. +This is done by adding an entry to the end of the arch/sh/tools/mach-types +list. The method for doing this is self explanatory, and so we won't waste +space restating it here. After this is done, you will be able to use +implicit checks for your board if you need this somewhere throughout the +common code, such as:: + + /* Make sure we're on the FooTech Vaporboard */ + if (!mach_is_vapor()) + return -ENODEV; + +also note that the mach_is_boardname() check will be implicitly forced to +lowercase, regardless of the fact that the mach-types entries are all +uppercase. You can read the script if you really care, but it's pretty ugly, +so you probably don't want to do that. + +Now all that's left to do is providing a defconfig for your new board. This +way, other people who end up with this board can simply use this config +for reference instead of trying to guess what settings are supposed to be +used on it. + +Also, as soon as you have copied over a sample .config for your new board +(assume arch/sh/configs/vapor_defconfig), you can also use this directly as a +build target, and it will be implicitly listed as such in the help text. + +Looking at the 'make help' output, you should now see something like: + +Architecture specific targets (sh): + + ======================= ============================================= + zImage Compressed kernel image (arch/sh/boot/zImage) + adx_defconfig Build for adx + cqreek_defconfig Build for cqreek + dreamcast_defconfig Build for dreamcast + ... + vapor_defconfig Build for vapor + ======================= ============================================= + +which then allows you to do:: + + $ make ARCH=sh CROSS_COMPILE=sh4-linux- vapor_defconfig vmlinux + +which will in turn copy the defconfig for this board, run it through +oldconfig (prompting you for any new options since the time of creation), +and start you on your way to having a functional kernel for your new +board. diff --git a/Documentation/arch/sh/register-banks.rst b/Documentation/arch/sh/register-banks.rst new file mode 100644 index 000000000000..2bef5c8fcbbc --- /dev/null +++ b/Documentation/arch/sh/register-banks.rst @@ -0,0 +1,40 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +Notes on register bank usage in the kernel +========================================== + +Introduction +------------ + +The SH-3 and SH-4 CPU families traditionally include a single partial register +bank (selected by SR.RB, only r0 ... r7 are banked), whereas other families +may have more full-featured banking or simply no such capabilities at all. + +SR.RB banking +------------- + +In the case of this type of banking, banked registers are mapped directly to +r0 ... r7 if SR.RB is set to the bank we are interested in, otherwise ldc/stc +can still be used to reference the banked registers (as r0_bank ... r7_bank) +when in the context of another bank. The developer must keep the SR.RB value +in mind when writing code that utilizes these banked registers, for obvious +reasons. Userspace is also not able to poke at the bank1 values, so these can +be used rather effectively as scratch registers by the kernel. + +Presently the kernel uses several of these registers. + + - r0_bank, r1_bank (referenced as k0 and k1, used for scratch + registers when doing exception handling). + + - r2_bank (used to track the EXPEVT/INTEVT code) + + - Used by do_IRQ() and friends for doing irq mapping based off + of the interrupt exception vector jump table offset + + - r6_bank (global interrupt mask) + + - The SR.IMASK interrupt handler makes use of this to set the + interrupt priority level (used by local_irq_enable()) + + - r7_bank (current) diff --git a/Documentation/arch/sparc/adi.rst b/Documentation/arch/sparc/adi.rst new file mode 100644 index 000000000000..dbcd8b6e7bc3 --- /dev/null +++ b/Documentation/arch/sparc/adi.rst @@ -0,0 +1,286 @@ +================================ +Application Data Integrity (ADI) +================================ + +SPARC M7 processor adds the Application Data Integrity (ADI) feature. +ADI allows a task to set version tags on any subset of its address +space. Once ADI is enabled and version tags are set for ranges of +address space of a task, the processor will compare the tag in pointers +to memory in these ranges to the version set by the application +previously. Access to memory is granted only if the tag in given pointer +matches the tag set by the application. In case of mismatch, processor +raises an exception. + +Following steps must be taken by a task to enable ADI fully: + +1. Set the user mode PSTATE.mcde bit. This acts as master switch for + the task's entire address space to enable/disable ADI for the task. + +2. Set TTE.mcd bit on any TLB entries that correspond to the range of + addresses ADI is being enabled on. MMU checks the version tag only + on the pages that have TTE.mcd bit set. + +3. Set the version tag for virtual addresses using stxa instruction + and one of the MCD specific ASIs. Each stxa instruction sets the + given tag for one ADI block size number of bytes. This step must + be repeated for entire page to set tags for entire page. + +ADI block size for the platform is provided by the hypervisor to kernel +in machine description tables. Hypervisor also provides the number of +top bits in the virtual address that specify the version tag. Once +version tag has been set for a memory location, the tag is stored in the +physical memory and the same tag must be present in the ADI version tag +bits of the virtual address being presented to the MMU. For example on +SPARC M7 processor, MMU uses bits 63-60 for version tags and ADI block +size is same as cacheline size which is 64 bytes. A task that sets ADI +version to, say 10, on a range of memory, must access that memory using +virtual addresses that contain 0xa in bits 63-60. + +ADI is enabled on a set of pages using mprotect() with PROT_ADI flag. +When ADI is enabled on a set of pages by a task for the first time, +kernel sets the PSTATE.mcde bit for the task. Version tags for memory +addresses are set with an stxa instruction on the addresses using +ASI_MCD_PRIMARY or ASI_MCD_ST_BLKINIT_PRIMARY. ADI block size is +provided by the hypervisor to the kernel. Kernel returns the value of +ADI block size to userspace using auxiliary vector along with other ADI +info. Following auxiliary vectors are provided by the kernel: + + ============ =========================================== + AT_ADI_BLKSZ ADI block size. This is the granularity and + alignment, in bytes, of ADI versioning. + AT_ADI_NBITS Number of ADI version bits in the VA + ============ =========================================== + + +IMPORTANT NOTES +=============== + +- Version tag values of 0x0 and 0xf are reserved. These values match any + tag in virtual address and never generate a mismatch exception. + +- Version tags are set on virtual addresses from userspace even though + tags are stored in physical memory. Tags are set on a physical page + after it has been allocated to a task and a pte has been created for + it. + +- When a task frees a memory page it had set version tags on, the page + goes back to free page pool. When this page is re-allocated to a task, + kernel clears the page using block initialization ASI which clears the + version tags as well for the page. If a page allocated to a task is + freed and allocated back to the same task, old version tags set by the + task on that page will no longer be present. + +- ADI tag mismatches are not detected for non-faulting loads. + +- Kernel does not set any tags for user pages and it is entirely a + task's responsibility to set any version tags. Kernel does ensure the + version tags are preserved if a page is swapped out to the disk and + swapped back in. It also preserves that version tags if a page is + migrated. + +- ADI works for any size pages. A userspace task need not be aware of + page size when using ADI. It can simply select a virtual address + range, enable ADI on the range using mprotect() and set version tags + for the entire range. mprotect() ensures range is aligned to page size + and is a multiple of page size. + +- ADI tags can only be set on writable memory. For example, ADI tags can + not be set on read-only mappings. + + + +ADI related traps +================= + +With ADI enabled, following new traps may occur: + +Disrupting memory corruption +---------------------------- + + When a store accesses a memory location that has TTE.mcd=1, + the task is running with ADI enabled (PSTATE.mcde=1), and the ADI + tag in the address used (bits 63:60) does not match the tag set on + the corresponding cacheline, a memory corruption trap occurs. By + default, it is a disrupting trap and is sent to the hypervisor + first. Hypervisor creates a sun4v error report and sends a + resumable error (TT=0x7e) trap to the kernel. The kernel sends + a SIGSEGV to the task that resulted in this trap with the following + info:: + + siginfo.si_signo = SIGSEGV; + siginfo.errno = 0; + siginfo.si_code = SEGV_ADIDERR; + siginfo.si_addr = addr; /* PC where first mismatch occurred */ + siginfo.si_trapno = 0; + + +Precise memory corruption +------------------------- + + When a store accesses a memory location that has TTE.mcd=1, + the task is running with ADI enabled (PSTATE.mcde=1), and the ADI + tag in the address used (bits 63:60) does not match the tag set on + the corresponding cacheline, a memory corruption trap occurs. If + MCD precise exception is enabled (MCDPERR=1), a precise + exception is sent to the kernel with TT=0x1a. The kernel sends + a SIGSEGV to the task that resulted in this trap with the following + info:: + + siginfo.si_signo = SIGSEGV; + siginfo.errno = 0; + siginfo.si_code = SEGV_ADIPERR; + siginfo.si_addr = addr; /* address that caused trap */ + siginfo.si_trapno = 0; + + NOTE: + ADI tag mismatch on a load always results in precise trap. + + +MCD disabled +------------ + + When a task has not enabled ADI and attempts to set ADI version + on a memory address, processor sends an MCD disabled trap. This + trap is handled by hypervisor first and the hypervisor vectors this + trap through to the kernel as Data Access Exception trap with + fault type set to 0xa (invalid ASI). When this occurs, the kernel + sends the task SIGSEGV signal with following info:: + + siginfo.si_signo = SIGSEGV; + siginfo.errno = 0; + siginfo.si_code = SEGV_ACCADI; + siginfo.si_addr = addr; /* address that caused trap */ + siginfo.si_trapno = 0; + + +Sample program to use ADI +------------------------- + +Following sample program is meant to illustrate how to use the ADI +functionality:: + + #include <unistd.h> + #include <stdio.h> + #include <stdlib.h> + #include <elf.h> + #include <sys/ipc.h> + #include <sys/shm.h> + #include <sys/mman.h> + #include <asm/asi.h> + + #ifndef AT_ADI_BLKSZ + #define AT_ADI_BLKSZ 48 + #endif + #ifndef AT_ADI_NBITS + #define AT_ADI_NBITS 49 + #endif + + #ifndef PROT_ADI + #define PROT_ADI 0x10 + #endif + + #define BUFFER_SIZE 32*1024*1024UL + + main(int argc, char* argv[], char* envp[]) + { + unsigned long i, mcde, adi_blksz, adi_nbits; + char *shmaddr, *tmp_addr, *end, *veraddr, *clraddr; + int shmid, version; + Elf64_auxv_t *auxv; + + adi_blksz = 0; + + while(*envp++ != NULL); + for (auxv = (Elf64_auxv_t *)envp; auxv->a_type != AT_NULL; auxv++) { + switch (auxv->a_type) { + case AT_ADI_BLKSZ: + adi_blksz = auxv->a_un.a_val; + break; + case AT_ADI_NBITS: + adi_nbits = auxv->a_un.a_val; + break; + } + } + if (adi_blksz == 0) { + fprintf(stderr, "Oops! ADI is not supported\n"); + exit(1); + } + + printf("ADI capabilities:\n"); + printf("\tBlock size = %ld\n", adi_blksz); + printf("\tNumber of bits = %ld\n", adi_nbits); + + if ((shmid = shmget(2, BUFFER_SIZE, + IPC_CREAT | SHM_R | SHM_W)) < 0) { + perror("shmget failed"); + exit(1); + } + + shmaddr = shmat(shmid, NULL, 0); + if (shmaddr == (char *)-1) { + perror("shm attach failed"); + shmctl(shmid, IPC_RMID, NULL); + exit(1); + } + + if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE|PROT_ADI)) { + perror("mprotect failed"); + goto err_out; + } + + /* Set the ADI version tag on the shm segment + */ + version = 10; + tmp_addr = shmaddr; + end = shmaddr + BUFFER_SIZE; + while (tmp_addr < end) { + asm volatile( + "stxa %1, [%0]0x90\n\t" + : + : "r" (tmp_addr), "r" (version)); + tmp_addr += adi_blksz; + } + asm volatile("membar #Sync\n\t"); + + /* Create a versioned address from the normal address by placing + * version tag in the upper adi_nbits bits + */ + tmp_addr = (void *) ((unsigned long)shmaddr << adi_nbits); + tmp_addr = (void *) ((unsigned long)tmp_addr >> adi_nbits); + veraddr = (void *) (((unsigned long)version << (64-adi_nbits)) + | (unsigned long)tmp_addr); + + printf("Starting the writes:\n"); + for (i = 0; i < BUFFER_SIZE; i++) { + veraddr[i] = (char)(i); + if (!(i % (1024 * 1024))) + printf("."); + } + printf("\n"); + + printf("Verifying data..."); + fflush(stdout); + for (i = 0; i < BUFFER_SIZE; i++) + if (veraddr[i] != (char)i) + printf("\nIndex %lu mismatched\n", i); + printf("Done.\n"); + + /* Disable ADI and clean up + */ + if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE)) { + perror("mprotect failed"); + goto err_out; + } + + if (shmdt((const void *)shmaddr) != 0) + perror("Detach failure"); + shmctl(shmid, IPC_RMID, NULL); + + exit(0); + + err_out: + if (shmdt((const void *)shmaddr) != 0) + perror("Detach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(1); + } diff --git a/Documentation/arch/sparc/console.rst b/Documentation/arch/sparc/console.rst new file mode 100644 index 000000000000..73132db83ece --- /dev/null +++ b/Documentation/arch/sparc/console.rst @@ -0,0 +1,9 @@ +Steps for sending 'break' on sunhv console +========================================== + +On Baremetal: + 1. press Esc + 'B' + +On LDOM: + 1. press Ctrl + ']' + 2. telnet> send break diff --git a/Documentation/arch/sparc/features.rst b/Documentation/arch/sparc/features.rst new file mode 100644 index 000000000000..c0c92468b0fe --- /dev/null +++ b/Documentation/arch/sparc/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features sparc diff --git a/Documentation/arch/sparc/index.rst b/Documentation/arch/sparc/index.rst new file mode 100644 index 000000000000..ae884224eec2 --- /dev/null +++ b/Documentation/arch/sparc/index.rst @@ -0,0 +1,13 @@ +================== +Sparc Architecture +================== + +.. toctree:: + :maxdepth: 1 + + console + adi + + oradax/oracle-dax + + features diff --git a/Documentation/arch/sparc/oradax/dax-hv-api.txt b/Documentation/arch/sparc/oradax/dax-hv-api.txt new file mode 100644 index 000000000000..7ecd0bf4957b --- /dev/null +++ b/Documentation/arch/sparc/oradax/dax-hv-api.txt @@ -0,0 +1,1433 @@ +Excerpt from UltraSPARC Virtual Machine Specification +Compiled from version 3.0.20+15 +Publication date 2017-09-25 08:21 +Copyright © 2008, 2015 Oracle and/or its affiliates. All rights reserved. +Extracted via "pdftotext -f 547 -l 572 -layout sun4v_20170925.pdf" +Authors: + Charles Kunzman + Sam Glidden + Mark Cianchetti + + +Chapter 36. Coprocessor services + The following APIs provide access via the Hypervisor to hardware assisted data processing functionality. + These APIs may only be provided by certain platforms, and may not be available to all virtual machines + even on supported platforms. Restrictions on the use of these APIs may be imposed in order to support + live-migration and other system management activities. + +36.1. Data Analytics Accelerator + The Data Analytics Accelerator (DAX) functionality is a collection of hardware coprocessors that provide + high speed processoring of database-centric operations. The coprocessors may support one or more of + the following data query operations: search, extraction, compression, decompression, and translation. The + functionality offered may vary by virtual machine implementation. + + The DAX is a virtual device to sun4v guests, with supported data operations indicated by the virtual device + compatibility property. Functionality is accessed through the submission of Command Control Blocks + (CCBs) via the ccb_submit API function. The operations are processed asynchronously, with the status + of the submitted operations reported through a Completion Area linked to each CCB. Each CCB has a + separate Completion Area and, unless execution order is specifically restricted through the use of serial- + conditional flags, the execution order of submitted CCBs is arbitrary. Likewise, the time to completion + for a given CCB is never guaranteed. + + Guest software may implement a software timeout on CCB operations, and if the timeout is exceeded, the + operation may be cancelled or killed via the ccb_kill API function. It is recommended for guest software + to implement a software timeout to account for certain RAS errors which may result in lost CCBs. It is + recommended such implementation use the ccb_info API function to check the status of a CCB prior to + killing it in order to determine if the CCB is still in queue, or may have been lost due to a RAS error. + + There is no fixed limit on the number of outstanding CCBs guest software may have queued in the virtual + machine, however, internal resource limitations within the virtual machine can cause CCB submissions + to be temporarily rejected with EWOULDBLOCK. In such cases, guests should continue to attempt + submissions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would + not be a guarantee that a future submission would succeed. + + The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual + device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device + node”). + +36.1.1. DAX Compatibility Property + The query functionality may vary based on the compatibility property of the virtual device: + +36.1.1.1. "ORCL,sun4v-dax" Device Compatibility + Available CCB commands: + + • No-op/Sync + + • Extract + + • Scan Value + + • Inverted Scan Value + + • Scan Range + + + 509 + Coprocessor services + + + • Inverted Scan Range + + • Translate + + • Inverted Translate + + • Select + + See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats. + + Only version 0 CCBs are available. + +36.1.1.2. "ORCL,sun4v-dax-fc" Device Compatibility + "ORCL,sun4v-dax-fc" is compatible with the "ORCL,sun4v-dax" interface, and includes additional CCB + bit fields and controls. + +36.1.1.3. "ORCL,sun4v-dax2" Device Compatibility + Available CCB commands: + + • No-op/Sync + + • Extract + + • Scan Value + + • Inverted Scan Value + + • Scan Range + + • Inverted Scan Range + + • Translate + + • Inverted Translate + + • Select + + See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats. + + Version 0 and 1 CCBs are available. Only version 0 CCBs may use Huffman encoded data, whereas only + version 1 CCBs may use OZIP. + +36.1.2. DAX Virtual Device Interrupts + The DAX virtual device has multiple interrupts associated with it which may be used by the guest if + desired. The number of device interrupts available to the guest is indicated in the virtual device node of the + guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device node”). If the device + node indicates N interrupts available, the guest may use any value from 0 to N - 1 (inclusive) in a CCB + interrupt number field. Using values outside this range will result in the CCB being rejected for an invalid + field value. + + The interrupts may be bound and managed using the standard sun4v device interrupts API (Chapter 16, + Device interrupt services). Sysino interrupts are not available for DAX devices. + +36.2. Coprocessor Control Block (CCB) + CCBs are either 64 or 128 bytes long, depending on the operation type. The exact contents of the CCB + are command specific, but all CCBs contain at least one memory buffer address. All memory locations + + + 510 + Coprocessor services + + +referenced by a CCB must be pinned in memory until the CCB either completes execution or is killed +via the ccb_kill API call. Changes in virtual address mappings occurring after CCB submission are not +guaranteed to be visible, and as such all virtual address updates need to be synchronized with CCB +execution. + +All CCBs begin with a common 32-bit header. + +Table 36.1. CCB Header Format +Bits Field Description +[31:28] CCB version. For API version 2.0: set to 1 if CCB uses OZIP encoding; set to 0 if the CCB + uses Huffman encoding; otherwise either 0 or 1. For API version 1.0: always set to 0. +[27] When API version 2.0 is negotiated, this is the Pipeline Flag [512]. It is reserved in + API version 1.0 +[26] Long CCB flag [512] +[25] Conditional synchronization flag [512] +[24] Serial synchronization flag +[23:16] CCB operation code: + 0x00 No Operation (No-op) or Sync + 0x01 Extract + 0x02 Scan Value + 0x12 Inverted Scan Value + 0x03 Scan Range + 0x13 Inverted Scan Range + 0x04 Translate + 0x14 Inverted Translate + 0x05 Select +[15:13] Reserved +[12:11] Table address type + 0b'00 No address + 0b'01 Alternate context virtual address + 0b'10 Real address + 0b'11 Primary context virtual address +[10:8] Output/Destination address type + 0b'000 No address + 0b'001 Alternate context virtual address + 0b'010 Real address + 0b'011 Primary context virtual address + 0b'100 Reserved + 0b'101 Reserved + 0b'110 Reserved + 0b'111 Reserved +[7:5] Secondary source address type + + + 511 + Coprocessor services + + +Bits Field Description + 0b'000 No address + 0b'001 Alternate context virtual address + 0b'010 Real address + 0b'011 Primary context virtual address + 0b'100 Reserved + 0b'101 Reserved + 0b'110 Reserved + 0b'111 Reserved +[4:2] Primary source address type + 0b'000 No address + 0b'001 Alternate context virtual address + 0b'010 Real address + 0b'011 Primary context virtual address + 0b'100 Reserved + 0b'101 Reserved + 0b'110 Reserved + 0b'111 Reserved +[1:0] Completion area address type + 0b'00 No address + 0b'01 Alternate context virtual address + 0b'10 Real address + 0b'11 Primary context virtual address + +The Long CCB flag indicates whether the submitted CCB is 64 or 128 bytes long; value is 0 for 64 bytes +and 1 for 128 bytes. + +The Serial and Conditional flags allow simple relative ordering between CCBs. Any CCB with the Serial +flag set will execute sequentially relative to any previous CCB that is also marked as Serial in the same +CCB submission. CCBs without the Serial flag set execute independently, even if they are between CCBs +with the Serial flag set. CCBs marked solely with the Serial flag will execute upon the completion of the +previous Serial CCB, regardless of the completion status of that CCB. The Conditional flag allows CCBs +to conditionally execute based on the successful execution of the closest CCB marked with the Serial flag. +A CCB may only be conditional on exactly one CCB, however, a CCB may be marked both Conditional +and Serial to allow execution chaining. The flags do NOT allow fan-out chaining, where multiple CCBs +execute in parallel based on the completion of another CCB. + +The Pipeline flag is an optimization that directs the output of one CCB (the "source" CCB) directly to +the input of the next CCB (the "target" CCB). The target CCB thus does not need to read the input from +memory. The Pipeline flag is advisory and may be dropped. + +Both the Pipeline and Serial bits must be set in the source CCB. The Conditional bit must be set in the +target CCB. Exactly one CCB must be made conditional on the source CCB; either 0 or 2 target CCBs +is invalid. However, Pipelines can be extended beyond two CCBs: the sequence would start with a CCB +with both the Pipeline and Serial bits set, proceed through CCBs with the Pipeline, Serial, and Conditional +bits set, and terminate at a CCB that has the Conditional bit set, but not the Pipeline bit. + + + 512 + Coprocessor services + + + The input of the target CCB must start within 64 bytes of the output of the source CCB or the pipeline flag + will be ignored. All CCBs in a pipeline must be submitted in the same call to ccb_submit. + + The various address type fields indicate how the various address values used in the CCB should be + interpreted by the virtual machine. Not all of the types specified are used by every CCB format. Types + which are not applicable to the given CCB command should be indicated as type 0 (No address). Virtual + addresses used in the CCB must have translation entries present in either the TLB or a configured TSB + for the submitting virtual processor. Virtual addresses which cannot be translated by the virtual machine + will result in the CCB submission being rejected, with the causal virtual address indicated. The CCB + may be resubmitted after inserting the translation, or the address may be translated by guest software and + resubmitted using the real address translation. + +36.2.1. Query CCB Command Formats +36.2.1.1. Supported Data Formats, Elements Sizes and Offsets + Data for query commands may be encoded in multiple possible formats. The data query commands use a + common set of values to indicate the encoding formats of the data being processed. Some encoding formats + require multiple data streams for processing, requiring the specification of both primary data formats (the + encoded data) and secondary data streams (meta-data for the encoded data). + +36.2.1.1.1. Primary Input Format + + The primary input format code is a 4-bit field when it is used. There are 10 primary input formats available. + The packed formats are not endian neutral. Code values not listed below are reserved. + + Code Format Description + 0x0 Fixed width byte packed Up to 16 bytes + 0x1 Fixed width bit packed Up to 15 bits (CCB version 0) or 23 bits (CCB version + 1); bits are read most significant bit to least significant bit + within a byte + 0x2 Variable width byte packed Data stream of lengths must be provided as a secondary + input + 0x4 Fixed width byte packed with run Up to 16 bytes; data stream of run lengths must be + length encoding provided as a secondary input + 0x5 Fixed width bit packed with run Up to 15 bits (CCB version 0) or 23 bits (CCB version + length encoding 1); bits are read most significant bit to least significant bit + within a byte; data stream of run lengths must be provided + as a secondary input + 0x8 Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream + Huffman (CCB version 0) or bits are read most significant bit to least significant bit + OZIP (CCB version 1) encoding within a byte; pointer to the encoding table must be + provided + 0x9 Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits (CCB version + Huffman (CCB version 0) or 1); compressed stream bits are read most significant bit to + OZIP (CCB version 1) encoding least significant bit within a byte; pointer to the encoding + table must be provided + 0xA Variable width byte packed with Up to 16 bytes before the encoding; compressed stream + Huffman (CCB version 0) or bits are read most significant bit to least significant bit + OZIP (CCB version 1) encoding within a byte; data stream of lengths must be provided as + a secondary input; pointer to the encoding table must be + provided + + + 513 + Coprocessor services + + + Code Format Description + 0xC Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream + run length encoding, followed by bits are read most significant bit to least significant bit + Huffman (CCB version 0) or within a byte; data stream of run lengths must be provided + OZIP (CCB version 1) encoding as a secondary input; pointer to the encoding table must + be provided + 0xD Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits(CCB version 1) + run length encoding, followed by before the encoding; compressed stream bits are read most + Huffman (CCB version 0) or significant bit to least significant bit within a byte; data + OZIP (CCB version 1) encoding stream of run lengths must be provided as a secondary + input; pointer to the encoding table must be provided + + If OZIP encoding is used, there must be no reserved bytes in the table. + +36.2.1.1.2. Primary Input Element Size + + For primary input data streams with fixed size elements, the element size must be indicated in the CCB + command. The size is encoded as the number of bits or bytes, minus one. The valid value range for this + field depends on the input format selected, as listed in the table above. + +36.2.1.1.3. Secondary Input Format + + For primary input data streams which require a secondary input stream, the secondary input stream is + always encoded in a fixed width, bit-packed format. The bits are read from most significant bit to least + significant bit within a byte. There are two encoding options for the secondary input stream data elements, + depending on whether the value of 0 is needed: + + Secondary Input Description + Format Code + 0 Element is stored as value minus 1 (0 evaluates to 1, 1 evaluates + to 2, etc) + 1 Element is stored as value + +36.2.1.1.4. Secondary Input Element Size + + Secondary input element size is encoded as a two bit field: + + Secondary Input Size Description + Code + 0x0 1 bit + 0x1 2 bits + 0x2 4 bits + 0x3 8 bits + +36.2.1.1.5. Input Element Offsets + + Bit-wise input data streams may have any alignment within the base addressed byte. The offset, specified + from most significant bit to least significant bit, is provided as a fixed 3 bit field for each input type. A + value of 0 indicates that the first input element begins at the most significant bit in the first byte, and a + value of 7 indicates it begins with the least significant bit. + + This field should be zero for any byte-wise primary input data streams. + + + 514 + Coprocessor services + + +36.2.1.1.6. Output Format + + Query commands support multiple sizes and encodings for output data streams. There are four possible + output encodings, and up to four supported element sizes per encoding. Not all output encodings are + supported for every command. The format is indicated by a 4-bit field in the CCB: + + Output Format Code Description + 0x0 Byte aligned, 1 byte elements + 0x1 Byte aligned, 2 byte elements + 0x2 Byte aligned, 4 byte elements + 0x3 Byte aligned, 8 byte elements + 0x4 16 byte aligned, 16 byte elements + 0x5 Reserved + 0x6 Reserved + 0x7 Reserved + 0x8 Packed vector of single bit elements + 0x9 Reserved + 0xA Reserved + 0xB Reserved + 0xC Reserved + 0xD 2 byte elements where each element is the index value of a bit, + from an bit vector, which was 1. + 0xE 4 byte elements where each element is the index value of a bit, + from an bit vector, which was 1. + 0xF Reserved + +36.2.1.1.7. Application Data Integrity (ADI) + + On platforms which support ADI, the ADI version number may be specified for each separate memory + access type used in the CCB command. ADI checking only occurs when reading data. When writing data, + the specified ADI version number overwrites any existing ADI value in memory. + + An ADI version value of 0 or 0xF indicates the ADI checking is disabled for that data access, even if it is + enabled in memory. By setting the appropriate flag in CCB_SUBMIT (Section 36.3.1, “ccb_submit”) it is + also an option to disable ADI checking for all inputs accessed via virtual address for all CCBs submitted + during that hypercall invocation. + + The ADI value is only guaranteed to be checked on the first 64 bytes of each data access. Mismatches on + subsequent data chunks may not be detected, so guest software should be careful to use page size checking + to protect against buffer overruns. + +36.2.1.1.8. Page size checking + + All data accesses used in CCB commands must be bounded within a single memory page. When addresses + are provided using a virtual address, the page size for checking is extracted from the TTE for that virtual + address. When using real addresses, the guest must supply the page size in the same field as the address + value. The page size must be one of the sizes supported by the underlying virtual machine. Using a value + that is not supported may result in the CCB submission being rejected or the generation of a CCB parsing + error in the completion area. + + + 515 + Coprocessor services + + +36.2.1.2. Extract command + + Converts an input vector in one format to an output vector in another format. All input format types are + supported. + + The only supported output format is a padded, byte-aligned output stream, using output codes 0x0 - 0x4. + When the specified output element size is larger than the extracted input element size, zeros are padded to + the extracted input element. First, if the decompressed input size is not a whole number of bytes, 0 bits are + padded to the most significant bit side till the next byte boundary. Next, if the output element size is larger + than the byte padded input element, bytes of value 0 are added based on the Padding Direction bit in the + CCB. If the output element size is smaller than the byte-padded input element size, the input element is + truncated by dropped from the least significant byte side until the selected output size is reached. + + The return value of the CCB completion area is invalid. The “number of elements processed” field in the + CCB completion area will be valid. + + The extract CCB is a 64-byte “short format” CCB. + + The extract CCB command format can be specified by the following packed C structure for a big-endian + machine: + + + struct extract_ccb { + uint32_t header; + uint32_t control; + uint64_t completion; + uint64_t primary_input; + uint64_t data_access_control; + uint64_t secondary_input; + uint64_t reserved; + uint64_t output; + uint64_t table; + }; + + + The exact field offsets, sizes, and composition are as follows: + + Offset Size Field Description + 0 4 CCB header (Table 36.1, “CCB Header Format”) + 4 4 Command control + Bits Field Description + [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input + Format”) + [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary + Input Element Size”) + [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary + Input Format”) + [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + + + 516 + Coprocessor services + + +Offset Size Field Description + Bits Field Description + [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, + “Secondary Input Element Size” + [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”) + [9] Padding Direction selector: A value of 1 causes padding bytes + to be added to the left side of output elements. A value of 0 + causes padding bytes to be added to the right side of output + elements. + [8:0] Reserved +8 8 Completion + Bits Field Description + [63:60] ADI version (see Section 36.2.1.1.7, “Application Data + Integrity (ADI)”) + [59] If set to 1, a virtual device interrupt will be generated using + the device interrupt number specified in the lower bits of this + completion word. If 0, the lower bits of this completion word + are ignored. + [58:6] Completion area address bits [58:6]. Address type is + determined by CCB header. + [5:0] Virtual device interrupt number for completion interrupt, if + enabled. +16 8 Primary Input + Bits Field Description + [63:60] ADI version (see Section 36.2.1.1.7, “Application Data + Integrity (ADI)”) + [59:56] If using real address, these bits should be filled in with the + page size code for the page boundary checking the guest wants + the virtual machine to use when accessing this data stream + (checking is only guaranteed to be performed when using API + version 1.1 and later). If using a virtual address, this field will + be used as as primary input address bits [59:56]. + [55:0] Primary input address bits [55:0]. Address type is determined + by CCB header. +24 8 Data Access Control + Bits Field Description + [63:62] Flow Control + Value Description + 0b'00 Disable flow control + 0b'01 Enable flow control (only valid with "ORCL,sun4v- + dax-fc" compatible virtual device variants) + 0b'10 Reserved + 0b'11 Reserved + [61:60] Reserved (API 1.0) + + + 517 + Coprocessor services + + +Offset Size Field Description + Bits Field Description + Pipeline target (API 2.0) + Value Description + 0b'00 Connect to primary input + 0b'01 Connect to secondary input + 0b'10 Reserved + 0b'11 Reserved + [59:40] Output buffer size given in units of 64 bytes, minus 1. Value of + 0 means 64 bytes, value of 1 means 128 bytes, etc. Buffer size is + only enforced if flow control is enabled in Flow Control field. + [39:32] Reserved + [31:30] Output Data Cache Allocation + Value Description + 0b'00 Do not allocate cache lines for output data stream. + 0b'01 Force cache lines for output data stream to be + allocated in the cache that is local to the submitting + virtual cpu. + 0b'10 Allocate cache lines for output data stream, but allow + existing cache lines associated with the data to remain + in their current cache instance. Any memory not + already in cache will be allocated in the cache local + to the submitting virtual cpu. + 0b'11 Reserved + [29:26] Reserved + [25:24] Primary Input Length Format + Value Description + 0b'00 Number of primary symbols + 0b'01 Number of primary bytes + 0b'10 Number of primary bits + 0b'11 Reserved + [23:0] Primary Input Length + Format Field Value + # of primary symbols Number of input elements to process, + minus 1. Command execution stops + once count is reached. + # of primary bytes Number of input bytes to process, + minus 1. Command execution stops + once count is reached. The count is + done before any decompression or + decoding. + # of primary bits Number of input bits to process, + minus 1. Command execution stops + + + + 518 + Coprocessor services + + + Offset Size Field Description + Bits Field Description + Format Field Value + once count is reached. The count is + done before any decompression or + decoding, and does not include any + bits skipped by the Primary Input + Offset field value of the command + control word. + 32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary + Input. + 40 8 Reserved + 48 8 Output (same fields as Primary Input) + 56 8 Symbol Table (if used by Primary Input) + Bits Field Description + [63:60] ADI version (see Section 36.2.1.1.7, “Application Data + Integrity (ADI)”) + [59:56] If using real address, these bits should be filled in with the + page size code for the page boundary checking the guest wants + the virtual machine to use when accessing this data stream + (checking is only guaranteed to be performed when using API + version 1.1 and later). If using a virtual address, this field will + be used as as symbol table address bits [59:56]. + [55:4] Symbol table address bits [55:4]. Address type is determined + by CCB header. + [3:0] Symbol table version + Value Description + 0 Huffman encoding. Must use 64 byte aligned table + address. (Only available when using version 0 CCBs) + 1 OZIP encoding. Must use 16 byte aligned table + address. (Only available when using version 1 CCBs) + + +36.2.1.3. Scan commands + + The scan commands search a stream of input data elements for values which match the selection criteria. + All the input format types are supported. There are multiple formats for the scan commands, allowing the + scan to search for exact matches to one value, exact matches to either of two values, or any value within + a specified range. The specific type of scan is indicated by the command code in the CCB header. For the + scan range commands, the boundary conditions can be specified as greater-than-or-equal-to a value, less- + than-or-equal-to a value, or both by using two boundary values. + + There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8, + 0xD, and 0xE). For the standard scan command using the bit vector output, for each input element there + exists one bit in the vector that is set if the input element matched the scan criteria, or clear if not. The + inverted scan command inverts the polarity of the bits in the output. The most significant bit of the first + byte of the output stream corresponds to the first element in the input stream. The standard index array + output format contains one array entry for each input element that matched the scan criteria. Each array + + + + 519 + Coprocessor services + + +entry is the index of an input element that matched the scan criteria. An inverted scan command produces +a similar array, but of all the input elements which did NOT match the scan criteria. + +The return value of the CCB completion area contains the number of input elements found which match +the scan criteria (or number that did not match for the inverted scans). The “number of elements processed” +field in the CCB completion area will be valid, indicating the number of input elements processed. + +These commands are 128-byte “long format” CCBs. + +The scan CCB command format can be specified by the following packed C structure for a big-endian +machine: + + + struct scan_ccb { + uint32_t header; + uint32_t control; + uint64_t completion; + uint64_t primary_input; + uint64_t data_access_control; + uint64_t secondary_input; + uint64_t match_criteria0; + uint64_t output; + uint64_t table; + uint64_t match_criteria1; + uint64_t match_criteria2; + uint64_t match_criteria3; + uint64_t reserved[5]; + }; + + +The exact field offsets, sizes, and composition are as follows: + +Offset Size Field Description +0 4 CCB header (Table 36.1, “CCB Header Format”) +4 4 Command control + Bits Field Description + [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input + Format”) + [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary + Input Element Size”) + [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary + Input Format”) + [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, + “Secondary Input Element Size” + [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”) + [9:5] Operand size for first scan criteria value. In a scan value + operation, this is one of two potential exact match values. + In a scan range operation, this is the size of the upper range + + + 520 + Coprocessor services + + +Offset Size Field Description + Bits Field Description + boundary. The value of this field is the number of bytes in the + operand, minus 1. Values 0xF-0x1E are reserved. A value of + 0x1F indicates this operand is not in use for this scan operation. + [4:0] Operand size for second scan criteria value. In a scan value + operation, this is one of two potential exact match values. + In a scan range operation, this is the size of the lower range + boundary. The value of this field is the number of bytes in the + operand, minus 1. Values 0xF-0x1E are reserved. A value of + 0x1F indicates this operand is not in use for this scan operation. +8 8 Completion (same fields as Section 36.2.1.2, “Extract command”) +16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”) +24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”) +32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary + Input. +40 4 Most significant 4 bytes of first scan criteria operand. If first operand is less + than 4 bytes, the value is left-aligned to the lowest address bytes. +44 4 Most significant 4 bytes of second scan criteria operand. If second operand + is less than 4 bytes, the value is left-aligned to the lowest address bytes. +48 8 Output (same fields as Primary Input) +56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2, + “Extract command” +64 4 Next 4 most significant bytes of first scan criteria operand occurring after the + bytes specified at offset 40, if needed by the operand size. If first operand + is less than 8 bytes, the valid bytes are left-aligned to the lowest address. +68 4 Next 4 most significant bytes of second scan criteria operand occurring after + the bytes specified at offset 44, if needed by the operand size. If second + operand is less than 8 bytes, the valid bytes are left-aligned to the lowest + address. +72 4 Next 4 most significant bytes of first scan criteria operand occurring after the + bytes specified at offset 64, if needed by the operand size. If first operand + is less than 12 bytes, the valid bytes are left-aligned to the lowest address. +76 4 Next 4 most significant bytes of second scan criteria operand occurring after + the bytes specified at offset 68, if needed by the operand size. If second + operand is less than 12 bytes, the valid bytes are left-aligned to the lowest + address. +80 4 Next 4 most significant bytes of first scan criteria operand occurring after the + bytes specified at offset 72, if needed by the operand size. If first operand + is less than 16 bytes, the valid bytes are left-aligned to the lowest address. +84 4 Next 4 most significant bytes of second scan criteria operand occurring after + the bytes specified at offset 76, if needed by the operand size. If second + operand is less than 16 bytes, the valid bytes are left-aligned to the lowest + address. + + + + + 521 + Coprocessor services + + +36.2.1.4. Translate commands + + The translate commands takes an input array of indices, and a table of single bit values indexed by those + indices, and outputs a bit vector or index array created by reading the tables bit value at each index in + the input array. The output should therefore contain exactly one bit per index in the input data stream, + when outputting as a bit vector. When outputting as an index array, the number of elements depends on the + values read in the bit table, but will always be less than, or equal to, the number of input elements. Only + a restricted subset of the possible input format types are supported. No variable width or Huffman/OZIP + encoded input streams are allowed. The primary input data element size must be 3 bytes or less. + + The maximum table index size allowed is 15 bits, however, larger input elements may be used to provide + additional processing of the output values. If 2 or 3 byte values are used, the least significant 15 bits are + used as an index into the bit table. The most significant 9 bits (when using 3-byte input elements) or single + bit (when using 2-byte input elements) are compared against a fixed 9-bit test value provided in the CCB. + If the values match, the value from the bit table is used as the output element value. If the values do not + match, the output data element value is forced to 0. + + In the inverted translate operation, the bit value read from bit table is inverted prior to its use. The additional + additional processing based on any additional non-index bits remains unchanged, and still forces the output + element value to 0 on a mismatch. The specific type of translate command is indicated by the command + code in the CCB header. + + There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8, + 0xD, and 0xE). The index array format is an array of indices of bits which would have been set if the + output format was a bit array. + + The return value of the CCB completion area contains the number of bits set in the output bit vector, + or number of elements in the output index array. The “number of elements processed” field in the CCB + completion area will be valid, indicating the number of input elements processed. + + These commands are 64-byte “short format” CCBs. + + The translate CCB command format can be specified by the following packed C structure for a big-endian + machine: + + + struct translate_ccb { + uint32_t header; + uint32_t control; + uint64_t completion; + uint64_t primary_input; + uint64_t data_access_control; + uint64_t secondary_input; + uint64_t reserved; + uint64_t output; + uint64_t table; + }; + + + The exact field offsets, sizes, and composition are as follows: + + + Offset Size Field Description + 0 4 CCB header (Table 36.1, “CCB Header Format”) + + + 522 + Coprocessor services + + +Offset Size Field Description +4 4 Command control + Bits Field Description + [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input + Format”) + [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary + Input Element Size”) + [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary + Input Format”) + [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, + “Secondary Input Element Size” + [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”) + [9] Reserved + [8:0] Test value used for comparison against the most significant bits + in the input values, when using 2 or 3 byte input elements. +8 8 Completion (same fields as Section 36.2.1.2, “Extract command” +16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command” +24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”, + except Primary Input Length Format may not use the 0x0 value) +32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary + Input. +40 8 Reserved +48 8 Output (same fields as Primary Input) +56 8 Bit Table + Bits Field Description + [63:60] ADI version (see Section 36.2.1.1.7, “Application Data + Integrity (ADI)”) + [59:56] If using real address, these bits should be filled in with the + page size code for the page boundary checking the guest wants + the virtual machine to use when accessing this data stream + (checking is only guaranteed to be performed when using API + version 1.1 and later). If using a virtual address, this field will + be used as as bit table address bits [59:56] + [55:4] Bit table address bits [55:4]. Address type is determined by + CCB header. Address must be 64-byte aligned (CCB version + 0) or 16-byte aligned (CCB version 1). + [3:0] Bit table version + Value Description + 0 4KB table size + 1 8KB table size + + + + 523 + Coprocessor services + + +36.2.1.5. Select command + The select command filters the primary input data stream by using a secondary input bit vector to determine + which input elements to include in the output. For each bit set at a given index N within the bit vector, + the Nth input element is included in the output. If the bit is not set, the element is not included. Only a + restricted subset of the possible input format types are supported. No variable width or run length encoded + input streams are allowed, since the secondary input stream is used for the filtering bit vector. + + The only supported output format is a padded, byte-aligned output stream. The stream follows the same + rules and restrictions as padded output stream described in Section 36.2.1.2, “Extract command”. + + The return value of the CCB completion area contains the number of bits set in the input bit vector. The + "number of elements processed" field in the CCB completion area will be valid, indicating the number + of input elements processed. + + The select CCB is a 64-byte “short format” CCB. + + The select CCB command format can be specified by the following packed C structure for a big-endian + machine: + + + struct select_ccb { + uint32_t header; + uint32_t control; + uint64_t completion; + uint64_t primary_input; + uint64_t data_access_control; + uint64_t secondary_input; + uint64_t reserved; + uint64_t output; + uint64_t table; + }; + + + The exact field offsets, sizes, and composition are as follows: + + Offset Size Field Description + 0 4 CCB header (Table 36.1, “CCB Header Format”) + 4 4 Command control + Bits Field Description + [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input + Format”) + [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary + Input Element Size”) + [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary + Input Format”) + [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input + Element Offsets”) + [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, + “Secondary Input Element Size” + + + 524 + Coprocessor services + + + Offset Size Field Description + Bits Field Description + [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”) + [9] Padding Direction selector: A value of 1 causes padding bytes + to be added to the left side of output elements. A value of 0 + causes padding bytes to be added to the right side of output + elements. + [8:0] Reserved + 8 8 Completion (same fields as Section 36.2.1.2, “Extract command” + 16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command” + 24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”) + 32 8 Secondary Bit Vector Input. Same fields as Primary Input. + 40 8 Reserved + 48 8 Output (same fields as Primary Input) + 56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2, + “Extract command” + +36.2.1.6. No-op and Sync commands + The no-op (no operation) command is a CCB which has no processing effect. The CCB, when processed + by the virtual machine, simply updates the completion area with its execution status. The CCB may have + the serial-conditional flags set in order to restrict when it executes. + + The sync command is a variant of the no-op command which with restricted execution timing. A sync + command CCB will only execute when all previous commands submitted in the same request have + completed. This is stronger than the conditional flag sequencing, which is only dependent on a single + previous serial CCB. While the relative ordering is guaranteed, virtual machine implementations with + shared hardware resources may cause the sync command to wait for longer than the minimum required + time. + + The return value of the CCB completion area is invalid for these CCBs. The “number of elements + processed” field is also invalid for these CCBs. + + These commands are 64-byte “short format” CCBs. + + The no-op CCB command format can be specified by the following packed C structure for a big-endian + machine: + + + struct nop_ccb { + uint32_t header; + uint32_t control; + uint64_t completion; + uint64_t reserved[6]; + }; + + + The exact field offsets, sizes, and composition are as follows: + + Offset Size Field Description + 0 4 CCB header (Table 36.1, “CCB Header Format”) + + + 525 + Coprocessor services + + + Offset Size Field Description + 4 4 Command control + Bits Field Description + [31] If set, this CCB functions as a Sync command. If clear, this + CCB functions as a No-op command. + [30:0] Reserved + 8 8 Completion (same fields as Section 36.2.1.2, “Extract command” + 16 46 Reserved + +36.2.2. CCB Completion Area + All CCB commands use a common 128-byte Completion Area format, which can be specified by the + following packed C structure for a big-endian machine: + + + struct completion_area { + uint8_t status_flag; + uint8_t error_note; + uint8_t rsvd0[2]; + uint32_t error_values; + uint32_t output_size; + uint32_t rsvd1; + uint64_t run_time; + uint64_t run_stats; + uint32_t elements; + uint8_t rsvd2[20]; + uint64_t return_value; + uint64_t extra_return_value[8]; + }; + + + The Completion Area must be a 128-byte aligned memory location. The exact layout can be described + using byte offsets and sizes relative to the memory base: + + Offset Size Field Description + 0 1 CCB execution status + 0x0 Command not yet completed + 0x1 Command ran and succeeded + 0x2 Command ran and failed (partial results may be been + produced) + 0x3 Command ran and was killed (partial execution may + have occurred) + 0x4 Command was not run + 0x5-0xF Reserved + 1 1 Error reason code + 0x0 Reserved + 0x1 Buffer overflow + + + 526 + Coprocessor services + + +Offset Size Field Description + 0x2 CCB decoding error + 0x3 Page overflow + 0x4-0x6 Reserved + 0x7 Command was killed + 0x8 Command execution timeout + 0x9 ADI miscompare error + 0xA Data format error + 0xB-0xD Reserved + 0xE Unexpected hardware error (Do not retry) + 0xF Unexpected hardware error (Retry is ok) + 0x10-0x7F Reserved + 0x80 Partial Symbol Warning + 0x81-0xFF Reserved +2 2 Reserved +4 4 If a partial symbol warning was generated, this field contains the number + of remaining bits which were not decoded. +8 4 Number of bytes of output produced +12 4 Reserved +16 8 Runtime of command (unspecified time units) +24 8 Reserved +32 4 Number of elements processed +36 20 Reserved +56 8 Return value +64 64 Extended return value + +The CCB completion area should be treated as read-only by guest software. The CCB execution status +byte will be cleared by the Hypervisor to reflect the pending execution status when the CCB is submitted +successfully. All other fields are considered invalid upon CCB submission until the CCB execution status +byte becomes non-zero. + +CCBs which complete with status 0x2 or 0x3 may produce partial results and/or side effects due to partial +execution of the CCB command. Some valid data may be accessible depending on the fault type, however, +it is recommended that guest software treat the destination buffer as being in an unknown state. If a CCB +completes with a status byte of 0x2, the error reason code byte can be read to determine what corrective +action should be taken. + +A buffer overflow indicates that the results of the operation exceeded the size of the output buffer indicated +in the CCB. The operation can be retried by resubmitting the CCB with a larger output buffer. + +A CCB decoding error indicates that the CCB contained some invalid field values. It may be also be +triggered if the CCB output is directed at a non-existent secondary input and the pipelining hint is followed. + +A page overflow error indicates that the operation required accessing a memory location beyond the page +size associated with a given address. No data will have been read or written past the page boundary, but +partial results may have been written to the destination buffer. The CCB can be resubmitted with a larger +page size memory allocation to complete the operation. + + + 527 + Coprocessor services + + + In the case of pipelined CCBs, a page overflow error will be triggered if the output from the pipeline source + CCB ends before the input of the pipeline target CCB. Page boundaries are ignored when the pipeline + hint is followed. + + Command kill indicates that the CCB execution was halted or prevented by use of the ccb_kill API call. + + Command timeout indicates that the CCB execution began, but did not complete within a pre-determined + limit set by the virtual machine. The command may have produced some or no output. The CCB may be + resubmitted with no alterations. + + ADI miscompare indicates that the memory buffer version specified in the CCB did not match the value + in memory when accessed by the virtual machine. Guest software should not attempt to resubmit the CCB + without determining the cause of the version mismatch. + + A data format error indicates that the input data stream did not follow the specified data input formatting + selected in the CCB. + + Some CCBs which encounter hardware errors may be resubmitted without change. Persistent hardware + errors may result in multiple failures until RAS software can identify and isolate the faulty component. + + The output size field indicates the number of bytes of valid output in the destination buffer. This field is + not valid for all possible CCB commands. + + The runtime field indicates the execution time of the CCB command once it leaves the internal virtual + machine queue. The time units are fixed, but unspecified, allowing only relative timing comparisons + by guest software. The time units may also vary by hardware platform, and should not be construed to + represent any absolute time value. + + Some data query commands process data in units of elements. If applicable to the command, the number of + elements processed is indicated in the listed field. This field is not valid for all possible CCB commands. + + The return value and extended return value fields are output locations for commands which do not use + a destination output buffer, or have secondary return results. The field is not valid for all possible CCB + commands. + +36.3. Hypervisor API Functions +36.3.1. ccb_submit + trap# FAST_TRAP + function# CCB_SUBMIT + arg0 address + arg1 length + arg2 flags + arg3 reserved + ret0 status + ret1 length + ret2 status data + ret3 reserved + + Submit one or more coprocessor control blocks (CCBs) for evaluation and processing by the virtual + machine. The CCBs are passed in a linear array indicated by address. length indicates the size of + the array in bytes. + + + 528 + Coprocessor services + + +The address should be aligned to the size indicated by length, rounded up to the nearest power of +two. Virtual machines implementations may reject submissions which do not adhere to that alignment. +length must be a multiple of 64 bytes. If length is zero, the maximum supported array length will be +returned as length in ret1. In all other cases, the length value in ret1 will reflect the number of bytes +successfully consumed from the input CCB array. + + Implementation note + Virtual machines should never reject submissions based on the alignment of address if the + entire array is contained within a single memory page of the smallest page size supported by the + virtual machine. + +A guest may choose to submit addresses used in this API function, including the CCB array address, +as either a real or virtual addresses, with the type of each address indicated in flags. Virtual addresses +must be present in either the TLB or an active TSB to be processed. The translation context for virtual +addresses is determined by a combination of CCB contents and the flags argument. + +The flags argument is divided into multiple fields defined as follows: + + +Bits Field Description +[63:16] Reserved +[15] Disable ADI for VA reads (in API 2.0) + Reserved (in API 1.0) +[14] Virtual addresses within CCBs are translated in privileged context +[13:12] Alternate translation context for virtual addresses within CCBs: + 0b'00 CCBs requesting alternate context are rejected + 0b'01 Reserved + 0b'10 CCBs requesting alternate context use secondary context + 0b'11 CCBs requesting alternate context use nucleus context +[11:9] Reserved +[8] Queue info flag +[7] All-or-nothing flag +[6] If address is a virtual address, treat its translation context as privileged +[5:4] Address type of address: + 0b'00 Real address + 0b'01 Virtual address in primary context + 0b'10 Virtual address in secondary context + 0b'11 Virtual address in nucleus context +[3:2] Reserved +[1:0] CCB command type: + 0b'00 Reserved + 0b'01 Reserved + 0b'10 Query command + 0b'11 Reserved + + + + 529 + Coprocessor services + + + The CCB submission type and address type for the CCB array must be provided in the flags argument. + All other fields are optional values which change the default behavior of the CCB processing. + + When set to one, the "Disable ADI for VA reads" bit will turn off ADI checking when using a virtual + address to load data. ADI checking will still be done when loading real-addressed memory. This bit is only + available when using major version 2 of the coprocessor API group; at major version 1 it is reserved. For + more information about using ADI and DAX, see Section 36.2.1.1.7, “Application Data Integrity (ADI)”. + + By default, all virtual addresses are treated as user addresses. If the virtual address translations are + privileged, they must be marked as such in the appropriate flags field. The virtual addresses used within + the submitted CCBs must all be translated with the same privilege level. + + By default, all virtual addresses used within the submitted CCBs are translated using the primary context + active at the time of the submission. The address type field within a CCB allows each address to request + translation in an alternate address context. The address context used when the alternate address context is + requested is selected in the flags argument. + + The all-or-nothing flag specifies whether the virtual machine should allow partial submissions of the + input CCB array. When using CCBs with serial-conditional flags, it is strongly recommended to use + the all-or-nothing flag to avoid broken conditional chains. Using long CCB chains on a machine under + high coprocessor load may make this impractical, however, and require submitting without the flag. + When submitting serial-conditional CCBs without the all-or-nothing flag, guest software must manually + implement the serial-conditional behavior at any point where the chain was not submitted in a single API + call, and resubmission of the remaining CCBs should clear any conditional flag that might be set in the + first remaining CCB. Failure to do so will produce indeterminate CCB execution status and ordering. + + When the all-or-nothing flag is not specified, callers should check the value of length in ret1 to determine + how many CCBs from the array were successfully submitted. Any remaining CCBs can be resubmitted + without modifications. + + The value of length in ret1 is also valid when the API call returns an error, and callers should always + check its value to determine which CCBs in the array were already processed. This will additionally + identify which CCB encountered the processing error, and was not submitted successfully. + + If the queue info flag is used during submission, and at least one CCB was successfully submitted, the + length value in ret1 will be a multi-field value defined as follows: + Bits Field Description + [63:48] DAX unit instance identifier + [47:32] DAX queue instance identifier + [31:16] Reserved + [15:0] Number of CCB bytes successfully submitted + + The value of status data depends on the status value. See error status code descriptions for details. + The value is undefined for status values that do not specifically list a value for the status data. + + The API has a reserved input and output register which will be used in subsequent minor versions of this + API function. Guest software implementations should treat that register as voltile across the function call + in order to maintain forward compatibility. + +36.3.1.1. Errors + EOK One or more CCBs have been accepted and enqueued in the virtual machine + and no errors were been encountered during submission. Some submitted + CCBs may not have been enqueued due to internal virtual machine limitations, + and may be resubmitted without changes. + + + 530 + Coprocessor services + + +EWOULDBLOCK An internal resource conflict within the virtual machine has prevented it from + being able to complete the CCB submissions sufficiently quickly, requiring + it to abandon processing before it was complete. Some CCBs may have been + successfully enqueued prior to the block, and all remaining CCBs may be + resubmitted without changes. +EBADALIGN CCB array is not on a 64-byte boundary, or the array length is not a multiple + of 64 bytes. +ENORADDR A real address used either for the CCB array, or within one of the submitted + CCBs, is not valid for the guest. Some CCBs may have been enqueued prior + to the error being detected. +ENOMAP A virtual address used either for the CCB array, or within one of the submitted + CCBs, could not be translated by the virtual machine using either the TLB + or TSB contents. The submission may be retried after adding the required + mapping, or by converting the virtual address into a real address. Due to the + shared nature of address translation resources, there is no theoretical limit on + the number of times the translation may fail, and it is recommended all guests + implement some real address based backup. The virtual address which failed + translation is returned as status data in ret2. Some CCBs may have been + enqueued prior to the error being detected. +EINVAL The virtual machine detected an invalid CCB during submission, or invalid + input arguments, such as bad flag values. Note that not all invalid CCB values + will be detected during submission, and some may be reported as errors in the + completion area instead. Some CCBs may have been enqueued prior to the + error being detected. This error may be returned if the CCB version is invalid. +ETOOMANY The request was submitted with the all-or-nothing flag set, and the array size is + greater than the virtual machine can support in a single request. The maximum + supported size for the current virtual machine can be queried by submitting a + request with a zero length array, as described above. +ENOACCESS The guest does not have permission to submit CCBs, or an address used in a + CCBs lacks sufficient permissions to perform the required operation (no write + permission on the destination buffer address, for example). A virtual address + which fails permission checking is returned as status data in ret2. Some + CCBs may have been enqueued prior to the error being detected. +EUNAVAILABLE The requested CCB operation could not be performed at this time. The + restricted operation availability may apply only to the first unsuccessfully + submitted CCB, or may apply to a larger scope. The status should not be + interpreted as permanent, and the guest should attempt to submit CCBs in + the future which had previously been unable to be performed. The status + data provides additional information about scope of the restricted availability + as follows: + Value Description + 0 Processing for the exact CCB instance submitted was unavailable, + and it is recommended the guest emulate the operation. The + guest should continue to submit all other CCBs, and assume no + restrictions beyond this exact CCB instance. + 1 Processing is unavailable for all CCBs using the requested opcode, + and it is recommended the guest emulate the operation. The + guest should continue to submit all other CCBs that use different + opcodes, but can expect continued rejections of CCBs using the + same opcode in the near future. + + + 531 + Coprocessor services + + + Value Description + 2 Processing is unavailable for all CCBs using the requested CCB + version, and it is recommended the guest emulate the operation. + The guest should continue to submit all other CCBs that use + different CCB versions, but can expect continued rejections of + CCBs using the same CCB version in the near future. + 3 Processing is unavailable for all CCBs on the submitting vcpu, + and it is recommended the guest emulate the operation or resubmit + the CCB on a different vcpu. The guest should continue to submit + CCBs on all other vcpus but can expect continued rejections of all + CCBs on this vcpu in the near future. + 4 Processing is unavailable for all CCBs, and it is recommended + the guest emulate the operation. The guest should expect all CCB + submissions to be similarly rejected in the near future. + + +36.3.2. ccb_info + + trap# FAST_TRAP + function# CCB_INFO + arg0 address + ret0 status + ret1 CCB state + ret2 position + ret3 dax + ret4 queue + + Requests status information on a previously submitted CCB. The previously submitted CCB is identified + by the 64-byte aligned real address of the CCBs completion area. + + A CCB can be in one of 4 states: + + + State Value Description + COMPLETED 0 The CCB has been fetched and executed, and is no longer active in + the virtual machine. + ENQUEUED 1 The requested CCB is current in a queue awaiting execution. + INPROGRESS 2 The CCB has been fetched and is currently being executed. It may still + be possible to stop the execution using the ccb_kill hypercall. + NOTFOUND 3 The CCB could not be located in the virtual machine, and does not + appear to have been executed. This may occur if the CCB was lost + due to a hardware error, or the CCB may not have been successfully + submitted to the virtual machine in the first place. + + Implementation note + Some platforms may not be able to report CCBs that are currently being processed, and therefore + guest software should invoke the ccb_kill hypercall prior to assuming the request CCB will never + be executed because it was in the NOTFOUND state. + + + 532 + Coprocessor services + + + The position return value is only valid when the state is ENQUEUED. The value returned is the number + of other CCBs ahead of the requested CCB, to provide a relative estimate of when the CCB may execute. + + The dax return value is only valid when the state is ENQUEUED. The value returned is the DAX unit + instance identifier for the DAX unit processing the queue where the requested CCB is located. The value + matches the value that would have been, or was, returned by ccb_submit using the queue info flag. + + The queue return value is only valid when the state is ENQUEUED. The value returned is the DAX + queue instance identifier for the DAX unit processing the queue where the requested CCB is located. The + value matches the value that would have been, or was, returned by ccb_submit using the queue info flag. + +36.3.2.1. Errors + + EOK The request was processed and the CCB state is valid. + EBADALIGN address is not on a 64-byte aligned. + ENORADDR The real address provided for address is not valid. + EINVAL The CCB completion area contents are not valid. + EWOULDBLOCK Internal resource constraints prevented the CCB state from being queried at this + time. The guest should retry the request. + ENOACCESS The guest does not have permission to access the coprocessor virtual device + functionality. + +36.3.3. ccb_kill + + trap# FAST_TRAP + function# CCB_KILL + arg0 address + ret0 status + ret1 result + + Request to stop execution of a previously submitted CCB. The previously submitted CCB is identified by + the 64-byte aligned real address of the CCBs completion area. + + The kill attempt can produce one of several values in the result return value, reflecting the CCB state + and actions taken by the Hypervisor: + + Result Value Description + COMPLETED 0 The CCB has been fetched and executed, and is no longer active in + the virtual machine. It could not be killed and no action was taken. + DEQUEUED 1 The requested CCB was still enqueued when the kill request was + submitted, and has been removed from the queue. Since the CCB + never began execution, no memory modifications were produced by + it, and the completion area will never be updated. The same CCB may + be submitted again, if desired, with no modifications required. + KILLED 2 The CCB had been fetched and was being executed when the kill + request was submitted. The CCB execution was stopped, and the CCB + is no longer active in the virtual machine. The CCB completion area + will reflect the killed status, with the subsequent implications that + partial results may have been produced. Partial results may include full + + + 533 + Coprocessor services + + + Result Value Description + command execution if the command was stopped just prior to writing + to the completion area. + NOTFOUND 3 The CCB could not be located in the virtual machine, and does not + appear to have been executed. This may occur if the CCB was lost + due to a hardware error, or the CCB may not have been successfully + submitted to the virtual machine in the first place. CCBs in the state + are guaranteed to never execute in the future unless resubmitted. + +36.3.3.1. Interactions with Pipelined CCBs + + If the pipeline target CCB is killed but the pipeline source CCB was skipped, the completion area of the + target CCB may contain status (4,0) "Command was skipped" instead of (3,7) "Command was killed". + + If the pipeline source CCB is killed, the pipeline target CCB's completion status may read (1,0) "Success". + This does not mean the target CCB was processed; since the source CCB was killed, there was no + meaningful output on which the target CCB could operate. + +36.3.3.2. Errors + + EOK The request was processed and the result is valid. + EBADALIGN address is not on a 64-byte aligned. + ENORADDR The real address provided for address is not valid. + EINVAL The CCB completion area contents are not valid. + EWOULDBLOCK Internal resource constraints prevented the CCB from being killed at this time. + The guest should retry the request. + ENOACCESS The guest does not have permission to access the coprocessor virtual device + functionality. + +36.3.4. dax_info + trap# FAST_TRAP + function# DAX_INFO + ret0 status + ret1 Number of enabled DAX units + ret2 Number of disabled DAX units + + Returns the number of DAX units that are enabled for the calling guest to submit CCBs. The number of + DAX units that are disabled for the calling guest are also returned. A disabled DAX unit would have been + available for CCB submission to the calling guest had it not been offlined. + +36.3.4.1. Errors + + EOK The request was processed and the number of enabled/disabled DAX units + are valid. + + + + + 534 + diff --git a/Documentation/arch/sparc/oradax/oracle-dax.rst b/Documentation/arch/sparc/oradax/oracle-dax.rst new file mode 100644 index 000000000000..d1e14d572918 --- /dev/null +++ b/Documentation/arch/sparc/oradax/oracle-dax.rst @@ -0,0 +1,445 @@ +======================================= +Oracle Data Analytics Accelerator (DAX) +======================================= + +DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 +(DAX2) processor chips, and has direct access to the CPU's L3 caches +as well as physical memory. It can perform several operations on data +streams with various input and output formats. A driver provides a +transport mechanism and has limited knowledge of the various opcodes +and data formats. A user space library provides high level services +and translates these into low level commands which are then passed +into the driver and subsequently the Hypervisor and the coprocessor. +The library is the recommended way for applications to use the +coprocessor, and the driver interface is not intended for general use. +This document describes the general flow of the driver, its +structures, and its programmatic interface. It also provides example +code sufficient to write user or kernel applications that use DAX +functionality. + +The user library is open source and available at: + + https://oss.oracle.com/git/gitweb.cgi?p=libdax.git + +The Hypervisor interface to the coprocessor is described in detail in +the accompanying document, dax-hv-api.txt, which is a plain text +excerpt of the (Oracle internal) "UltraSPARC Virtual Machine +Specification" version 3.0.20+15, dated 2017-09-25. + + +High Level Overview +=================== + +A coprocessor request is described by a Command Control Block +(CCB). The CCB contains an opcode and various parameters. The opcode +specifies what operation is to be done, and the parameters specify +options, flags, sizes, and addresses. The CCB (or an array of CCBs) +is passed to the Hypervisor, which handles queueing and scheduling of +requests to the available coprocessor execution units. A status code +returned indicates if the request was submitted successfully or if +there was an error. One of the addresses given in each CCB is a +pointer to a "completion area", which is a 128 byte memory block that +is written by the coprocessor to provide execution status. No +interrupt is generated upon completion; the completion area must be +polled by software to find out when a transaction has finished, but +the M7 and later processors provide a mechanism to pause the virtual +processor until the completion status has been updated by the +coprocessor. This is done using the monitored load and mwait +instructions, which are described in more detail later. The DAX +coprocessor was designed so that after a request is submitted, the +kernel is no longer involved in the processing of it. The polling is +done at the user level, which results in almost zero latency between +completion of a request and resumption of execution of the requesting +thread. + + +Addressing Memory +================= + +The kernel does not have access to physical memory in the Sun4v +architecture, as there is an additional level of memory virtualization +present. This intermediate level is called "real" memory, and the +kernel treats this as if it were physical. The Hypervisor handles the +translations between real memory and physical so that each logical +domain (LDOM) can have a partition of physical memory that is isolated +from that of other LDOMs. When the kernel sets up a virtual mapping, +it specifies a virtual address and the real address to which it should +be mapped. + +The DAX coprocessor can only operate on physical memory, so before a +request can be fed to the coprocessor, all the addresses in a CCB must +be converted into physical addresses. The kernel cannot do this since +it has no visibility into physical addresses. So a CCB may contain +either the virtual or real addresses of the buffers or a combination +of them. An "address type" field is available for each address that +may be given in the CCB. In all cases, the Hypervisor will translate +all the addresses to physical before dispatching to hardware. Address +translations are performed using the context of the process initiating +the request. + + +The Driver API +============== + +An application makes requests to the driver via the write() system +call, and gets results (if any) via read(). The completion areas are +made accessible via mmap(), and are read-only for the application. + +The request may either be an immediate command or an array of CCBs to +be submitted to the hardware. + +Each open instance of the device is exclusive to the thread that +opened it, and must be used by that thread for all subsequent +operations. The driver open function creates a new context for the +thread and initializes it for use. This context contains pointers and +values used internally by the driver to keep track of submitted +requests. The completion area buffer is also allocated, and this is +large enough to contain the completion areas for many concurrent +requests. When the device is closed, any outstanding transactions are +flushed and the context is cleaned up. + +On a DAX1 system (M7), the device will be called "oradax1", while on a +DAX2 system (M8) it will be "oradax2". If an application requires one +or the other, it should simply attempt to open the appropriate +device. Only one of the devices will exist on any given system, so the +name can be used to determine what the platform supports. + +The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For +all of these, success is indicated by a return value from write() +equal to the number of bytes given in the call. Otherwise -1 is +returned and errno is set. + +CCB_DEQUEUE +----------- + +Tells the driver to clean up resources associated with past +requests. Since no interrupt is generated upon the completion of a +request, the driver must be told when it may reclaim resources. No +further status information is returned, so the user should not +subsequently call read(). + +CCB_KILL +-------- + +Kills a CCB during execution. The CCB is guaranteed to not continue +executing once this call returns successfully. On success, read() must +be called to retrieve the result of the action. + +CCB_INFO +-------- + +Retrieves information about a currently executing CCB. Note that some +Hypervisors might return 'notfound' when the CCB is in 'inprogress' +state. To ensure a CCB in the 'notfound' state will never be executed, +CCB_KILL must be invoked on that CCB. Upon success, read() must be +called to retrieve the details of the action. + +Submission of an array of CCBs for execution +--------------------------------------------- + +A write() whose length is a multiple of the CCB size is treated as a +submit operation. The file offset is treated as the index of the +completion area to use, and may be set via lseek() or using the +pwrite() system call. If -1 is returned then errno is set to indicate +the error. Otherwise, the return value is the length of the array that +was actually accepted by the coprocessor. If the accepted length is +equal to the requested length, then the submission was completely +successful and there is no further status needed; hence, the user +should not subsequently call read(). Partial acceptance of the CCB +array is indicated by a return value less than the requested length, +and read() must be called to retrieve further status information. The +status will reflect the error caused by the first CCB that was not +accepted, and status_data will provide additional data in some cases. + +MMAP +---- + +The mmap() function provides access to the completion area allocated +in the driver. Note that the completion area is not writeable by the +user process, and the mmap call must not specify PROT_WRITE. + + +Completion of a Request +======================= + +The first byte in each completion area is the command status which is +updated by the coprocessor hardware. Software may take advantage of +new M7/M8 processor capabilities to efficiently poll this status byte. +First, a "monitored load" is achieved via a Load from Alternate Space +(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a +"monitored wait" is achieved via the mwait instruction (a write to +%asr28). This instruction is like pause in that it suspends execution +of the virtual processor for the given number of nanoseconds, but in +addition will terminate early when one of several events occur. If the +block of data containing the monitored location is modified, then the +mwait terminates. This causes software to resume execution immediately +(without a context switch or kernel to user transition) after a +transaction completes. Thus the latency between transaction completion +and resumption of execution may be just a few nanoseconds. + + +Application Life Cycle of a DAX Submission +========================================== + + - open dax device + - call mmap() to get the completion area address + - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. + - submit CCB via write() or pwrite() + - go into a loop executing monitored load + monitored wait and + terminate when the command status indicates the request is complete + (CCB_KILL or CCB_INFO may be used any time as necessary) + - perform a CCB_DEQUEUE + - call munmap() for completion area + - close the dax device + + +Memory Constraints +================== + +The DAX hardware operates only on physical addresses. Therefore, it is +not aware of virtual memory mappings and the discontiguities that may +exist in the physical memory that a virtual buffer maps to. There is +no I/O TLB or any scatter/gather mechanism. All buffers, whether input +or output, must reside in a physically contiguous region of memory. + +The Hypervisor translates all addresses within a CCB to physical +before handing off the CCB to DAX. The Hypervisor determines the +virtual page size for each virtual address given, and uses this to +program a size limit for each address. This prevents the coprocessor +from reading or writing beyond the bound of the virtual page, even +though it is accessing physical memory directly. A simpler way of +saying this is that a DAX operation will never "cross" a virtual page +boundary. If an 8k virtual page is used, then the data is strictly +limited to 8k. If a user's buffer is larger than 8k, then a larger +page size must be used, or the transaction size will be truncated to +8k. + +Huge pages. A user may allocate huge pages using standard interfaces. +Memory buffers residing on huge pages may be used to achieve much +larger DAX transaction sizes, but the rules must still be followed, +and no transaction will cross a page boundary, even a huge page. A +major caveat is that Linux on Sparc presents 8Mb as one of the huge +page sizes. Sparc does not actually provide a 8Mb hardware page size, +and this size is synthesized by pasting together two 4Mb pages. The +reasons for this are historical, and it creates an issue because only +half of this 8Mb page can actually be used for any given buffer in a +DAX request, and it must be either the first half or the second half; +it cannot be a 4Mb chunk in the middle, since that crosses a +(hardware) page boundary. Note that this entire issue may be hidden by +higher level libraries. + + +CCB Structure +------------- +A CCB is an array of 8 64-bit words. Several of these words provide +command opcodes, parameters, flags, etc., and the rest are addresses +for the completion area, output buffer, and various inputs:: + + struct ccb { + u64 control; + u64 completion; + u64 input0; + u64 access; + u64 input1; + u64 op_data; + u64 output; + u64 table; + }; + +See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of +each of these fields, and see dax-hv-api.txt for a complete description +of the Hypervisor API available to the guest OS (ie, Linux kernel). + +The first word (control) is examined by the driver for the following: + - CCB version, which must be consistent with hardware version + - Opcode, which must be one of the documented allowable commands + - Address types, which must be set to "virtual" for all the addresses + given by the user, thereby ensuring that the application can + only access memory that it owns + + +Example Code +============ + +The DAX is accessible to both user and kernel code. The kernel code +can make hypercalls directly while the user code must use wrappers +provided by the driver. The setup of the CCB is nearly identical for +both; the only difference is in preparation of the completion area. An +example of user code is given now, with kernel code afterwards. + +In order to program using the driver API, the file +arch/sparc/include/uapi/asm/oradax.h must be included. + +First, the proper device must be opened. For M7 it will be +/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest +procedure is to attempt to open both, as only one will succeed:: + + fd = open("/dev/oradax1", O_RDWR); + if (fd < 0) + fd = open("/dev/oradax2", O_RDWR); + if (fd < 0) + /* No DAX found */ + +Next, the completion area must be mapped:: + + completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); + +All input and output buffers must be fully contained in one hardware +page, since as explained above, the DAX is strictly constrained by +virtual page boundaries. In addition, the output buffer must be +64-byte aligned and its size must be a multiple of 64 bytes because +the coprocessor writes in units of cache lines. + +This example demonstrates the DAX Scan command, which takes as input a +vector and a match value, and produces a bitmap as the output. For +each input element that matches the value, the corresponding bit is +set in the output. + +In this example, the input vector consists of a series of single bits, +and the match value is 0. So each 0 bit in the input will produce a 1 +in the output, and vice versa, which produces an output bitmap which +is the input bitmap inverted. + +For details of all the parameters and bits used in this CCB, please +refer to section 36.2.1.3 of the DAX Hypervisor API document, which +describes the Scan command in detail:: + + ccb->control = /* Table 36.1, CCB Header Format */ + (2L << 48) /* command = Scan Value */ + | (3L << 40) /* output address type = primary virtual */ + | (3L << 34) /* primary input address type = primary virtual */ + /* Section 36.2.1, Query CCB Command Formats */ + | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */ + | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ + | (8 << 10) /* 36.2.1.1.6 output format = bit vector */ + | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */ + | (31 << 0); /* 36.2.1.3 Disable second scan criteria */ + + ccb->completion = 0; /* Completion area address, to be filled in by driver */ + + ccb->input0 = (unsigned long) input; /* primary input address */ + + ccb->access = /* Section 36.2.1.2, Data Access Control */ + (2 << 24) /* Primary input length format = bits */ + | (nbits - 1); /* number of bits in primary input stream, minus 1 */ + + ccb->input1 = 0; /* secondary input address, unused */ + + ccb->op_data = 0; /* scan criteria (value to be matched) */ + + ccb->output = (unsigned long) output; /* output address */ + + ccb->table = 0; /* table address, unused */ + +The CCB submission is a write() or pwrite() system call to the +driver. If the call fails, then a read() must be used to retrieve the +status:: + + if (pwrite(fd, ccb, 64, 0) != 64) { + struct ccb_exec_result status; + read(fd, &status, sizeof(status)); + /* bail out */ + } + +After a successful submission of the CCB, the completion area may be +polled to determine when the DAX is finished. Detailed information on +the contents of the completion area can be found in section 36.2.2 of +the DAX HV API document:: + + while (1) { + /* Monitored Load */ + __asm__ __volatile__("lduba [%1] 0x84, %0\n" + : "=r" (status) + : "r" (completion_area)); + + if (status) /* 0 indicates command in progress */ + break; + + /* MWAIT */ + __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ + } + +A completion area status of 1 indicates successful completion of the +CCB and validity of the output bitmap, which may be used immediately. +All other non-zero values indicate error conditions which are +described in section 36.2.2:: + + if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ + /* completion_area[0] contains the completion status */ + /* completion_area[1] contains an error code, see 36.2.2 */ + } + +After the completion area has been processed, the driver must be +notified that it can release any resources associated with the +request. This is done via the dequeue operation:: + + struct dax_command cmd; + cmd.command = CCB_DEQUEUE; + if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { + /* bail out */ + } + +Finally, normal program cleanup should be done, i.e., unmapping +completion area, closing the dax device, freeing memory etc. + +Kernel example +-------------- + +The only difference in using the DAX in kernel code is the treatment +of the completion area. Unlike user applications which mmap the +completion area allocated by the driver, kernel code must allocate its +own memory to use for the completion area, and this address and its +type must be given in the CCB:: + + ccb->control |= /* Table 36.1, CCB Header Format */ + (3L << 32); /* completion area address type = primary virtual */ + + ccb->completion = (unsigned long) completion_area; /* Completion area address */ + +The dax submit hypercall is made directly. The flags used in the +ccb_submit call are documented in the DAX HV API in section 36.3.1/ + +:: + + #include <asm/hypervisor.h> + + hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, + HV_CCB_QUERY_CMD | + HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | + HV_CCB_VA_PRIVILEGED, + 0, &bytes_accepted, &status_data); + + if (hv_rv != HV_EOK) { + /* hv_rv is an error code, status_data contains */ + /* potential additional status, see 36.3.1.1 */ + } + +After the submission, the completion area polling code is identical to +that in user land:: + + while (1) { + /* Monitored Load */ + __asm__ __volatile__("lduba [%1] 0x84, %0\n" + : "=r" (status) + : "r" (completion_area)); + + if (status) /* 0 indicates command in progress */ + break; + + /* MWAIT */ + __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ + } + + if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ + /* completion_area[0] contains the completion status */ + /* completion_area[1] contains an error code, see 36.2.2 */ + } + +The output bitmap is ready for consumption immediately after the +completion status indicates success. + +Excer[t from UltraSPARC Virtual Machine Specification +===================================================== + + .. include:: dax-hv-api.txt + :literal: diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst new file mode 100644 index 000000000000..934310ce7258 --- /dev/null +++ b/Documentation/arch/x86/amd-memory-encryption.rst @@ -0,0 +1,133 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +AMD Memory Encryption +===================== + +Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV) are +features found on AMD processors. + +SME provides the ability to mark individual pages of memory as encrypted using +the standard x86 page tables. A page that is marked encrypted will be +automatically decrypted when read from DRAM and encrypted when written to +DRAM. SME can therefore be used to protect the contents of DRAM from physical +attacks on the system. + +SEV enables running encrypted virtual machines (VMs) in which the code and data +of the guest VM are secured so that a decrypted version is available only +within the VM itself. SEV guest VMs have the concept of private and shared +memory. Private memory is encrypted with the guest-specific key, while shared +memory may be encrypted with hypervisor key. When SME is enabled, the hypervisor +key is the same key which is used in SME. + +A page is encrypted when a page table entry has the encryption bit set (see +below on how to determine its position). The encryption bit can also be +specified in the cr3 register, allowing the PGD table to be encrypted. Each +successive level of page tables can also be encrypted by setting the encryption +bit in the page table entry that points to the next table. This allows the full +page table hierarchy to be encrypted. Note, this means that just because the +encryption bit is set in cr3, doesn't imply the full hierarchy is encrypted. +Each page table entry in the hierarchy needs to have the encryption bit set to +achieve that. So, theoretically, you could have the encryption bit set in cr3 +so that the PGD is encrypted, but not set the encryption bit in the PGD entry +for a PUD which results in the PUD pointed to by that entry to not be +encrypted. + +When SEV is enabled, instruction pages and guest page tables are always treated +as private. All the DMA operations inside the guest must be performed on shared +memory. Since the memory encryption bit is controlled by the guest OS when it +is operating in 64-bit or 32-bit PAE mode, in all other modes the SEV hardware +forces the memory encryption bit to 1. + +Support for SME and SEV can be determined through the CPUID instruction. The +CPUID function 0x8000001f reports information related to SME:: + + 0x8000001f[eax]: + Bit[0] indicates support for SME + Bit[1] indicates support for SEV + 0x8000001f[ebx]: + Bits[5:0] pagetable bit number used to activate memory + encryption + Bits[11:6] reduction in physical address space, in bits, when + memory encryption is enabled (this only affects + system physical addresses, not guest physical + addresses) + +If support for SME is present, MSR 0xc00100010 (MSR_AMD64_SYSCFG) can be used to +determine if SME is enabled and/or to enable memory encryption:: + + 0xc0010010: + Bit[23] 0 = memory encryption features are disabled + 1 = memory encryption features are enabled + +If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if +SEV is active:: + + 0xc0010131: + Bit[0] 0 = memory encryption is not active + 1 = memory encryption is active + +Linux relies on BIOS to set this bit if BIOS has determined that the reduction +in the physical address space as a result of enabling memory encryption (see +CPUID information above) will not conflict with the address space resource +requirements for the system. If this bit is not set upon Linux startup then +Linux itself will not set it and memory encryption will not be possible. + +The state of SME in the Linux kernel can be documented as follows: + + - Supported: + The CPU supports SME (determined through CPUID instruction). + + - Enabled: + Supported and bit 23 of MSR_AMD64_SYSCFG is set. + + - Active: + Supported, Enabled and the Linux kernel is actively applying + the encryption bit to page table entries (the SME mask in the + kernel is non-zero). + +SME can also be enabled and activated in the BIOS. If SME is enabled and +activated in the BIOS, then all memory accesses will be encrypted and it will +not be necessary to activate the Linux memory encryption support. If the BIOS +merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate +memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or +by supplying mem_encrypt=on on the kernel command line. However, if BIOS does +not enable SME, then Linux will not be able to activate memory encryption, even +if configured to do so by default or the mem_encrypt=on command line parameter +is specified. + +Secure Nested Paging (SNP) +========================== + +SEV-SNP introduces new features (SEV_FEATURES[1:63]) which can be enabled +by the hypervisor for security enhancements. Some of these features need +guest side implementation to function correctly. The below table lists the +expected guest behavior with various possible scenarios of guest/hypervisor +SNP feature support. + ++-----------------+---------------+---------------+------------------+ +| Feature Enabled | Guest needs | Guest has | Guest boot | +| by the HV | implementation| implementation| behaviour | ++=================+===============+===============+==================+ +| No | No | No | Boot | +| | | | | ++-----------------+---------------+---------------+------------------+ +| No | Yes | No | Boot | +| | | | | ++-----------------+---------------+---------------+------------------+ +| No | Yes | Yes | Boot | +| | | | | ++-----------------+---------------+---------------+------------------+ +| Yes | No | No | Boot with | +| | | | feature enabled | ++-----------------+---------------+---------------+------------------+ +| Yes | Yes | No | Graceful boot | +| | | | failure | ++-----------------+---------------+---------------+------------------+ +| Yes | Yes | Yes | Boot with | +| | | | feature enabled | ++-----------------+---------------+---------------+------------------+ + +More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR + +[1] https://www.amd.com/system/files/TechDocs/40332.pdf diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst new file mode 100644 index 000000000000..440e4b645a1c --- /dev/null +++ b/Documentation/arch/x86/amd_hsmp.rst @@ -0,0 +1,86 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================ +AMD HSMP interface +============================================ + +Newer Fam19h EPYC server line of processors from AMD support system +management functionality via HSMP (Host System Management Port). + +The Host System Management Port (HSMP) is an interface to provide +OS-level software with access to system management functions via a +set of mailbox registers. + +More details on the interface can be found in chapter +"7 Host System Management Port (HSMP)" of the family/model PPR +Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip + +HSMP interface is supported on EPYC server CPU models only. + + +HSMP device +============================================ + +amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice +/dev/hsmp to let user space programs run hsmp mailbox commands. + +$ ls -al /dev/hsmp +crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp + +Characteristics of the dev node: + * Write mode is used for running set/configure commands + * Read mode is used for running get/status monitor commands + +Access restrictions: + * Only root user is allowed to open the file in write mode. + * The file can be opened in read mode by all the users. + +In-kernel integration: + * Other subsystems in the kernel can use the exported transport + function hsmp_send_message(). + * Locking across callers is taken care by the driver. + + +An example +========== + +To access hsmp device from a C program. +First, you need to include the headers:: + + #include <linux/amd_hsmp.h> + +Which defines the supported messages/message IDs. + +Next thing, open the device file, as follows:: + + int file; + + file = open("/dev/hsmp", O_RDWR); + if (file < 0) { + /* ERROR HANDLING; you can check errno to see what went wrong */ + exit(1); + } + +The following IOCTL is defined: + +``ioctl(file, HSMP_IOCTL_CMD, struct hsmp_message *msg)`` + The argument is a pointer to a:: + + struct hsmp_message { + __u32 msg_id; /* Message ID */ + __u16 num_args; /* Number of input argument words in message */ + __u16 response_sz; /* Number of expected output/response words */ + __u32 args[HSMP_MAX_MSG_LEN]; /* argument/response buffer */ + __u16 sock_ind; /* socket number */ + }; + +The ioctl would return a non-zero on failure; you can read errno to see +what happened. The transaction returns 0 on success. + +More details on the interface and message definitions can be found in chapter +"7 Host System Management Port (HSMP)" of the respective family/model PPR +eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip + +User space C-APIs are made available by linking against the esmi library, +which is provided by the E-SMS project https://developer.amd.com/e-sms/. +See: https://github.com/amd/esmi_ib_library diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst new file mode 100644 index 000000000000..33520ecdb37a --- /dev/null +++ b/Documentation/arch/x86/boot.rst @@ -0,0 +1,1443 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +The Linux/x86 Boot Protocol +=========================== + +On the x86 platform, the Linux kernel uses a rather complicated boot +convention. This has evolved partially due to historical aspects, as +well as the desire in the early days to have the kernel itself be a +bootable image, the complicated PC memory model and due to changed +expectations in the PC industry caused by the effective demise of +real-mode DOS as a mainstream operating system. + +Currently, the following versions of the Linux/x86 boot protocol exist. + +============= ============================================================ +Old kernels zImage/Image support only. Some very early kernels + may not even support a command line. + +Protocol 2.00 (Kernel 1.3.73) Added bzImage and initrd support, as + well as a formalized way to communicate between the + boot loader and the kernel. setup.S made relocatable, + although the traditional setup area still assumed + writable. + +Protocol 2.01 (Kernel 1.3.76) Added a heap overrun warning. + +Protocol 2.02 (Kernel 2.4.0-test3-pre3) New command line protocol. + Lower the conventional memory ceiling. No overwrite + of the traditional setup area, thus making booting + safe for systems which use the EBDA from SMM or 32-bit + BIOS entry points. zImage deprecated but still + supported. + +Protocol 2.03 (Kernel 2.4.18-pre1) Explicitly makes the highest possible + initrd address available to the bootloader. + +Protocol 2.04 (Kernel 2.6.14) Extend the syssize field to four bytes. + +Protocol 2.05 (Kernel 2.6.20) Make protected mode kernel relocatable. + Introduce relocatable_kernel and kernel_alignment fields. + +Protocol 2.06 (Kernel 2.6.22) Added a field that contains the size of + the boot command line. + +Protocol 2.07 (Kernel 2.6.24) Added paravirtualised boot protocol. + Introduced hardware_subarch and hardware_subarch_data + and KEEP_SEGMENTS flag in load_flags. + +Protocol 2.08 (Kernel 2.6.26) Added crc32 checksum and ELF format + payload. Introduced payload_offset and payload_length + fields to aid in locating the payload. + +Protocol 2.09 (Kernel 2.6.26) Added a field of 64-bit physical + pointer to single linked list of struct setup_data. + +Protocol 2.10 (Kernel 2.6.31) Added a protocol for relaxed alignment + beyond the kernel_alignment added, new init_size and + pref_address fields. Added extended boot loader IDs. + +Protocol 2.11 (Kernel 3.6) Added a field for offset of EFI handover + protocol entry point. + +Protocol 2.12 (Kernel 3.8) Added the xloadflags field and extension fields + to struct boot_params for loading bzImage and ramdisk + above 4G in 64bit. + +Protocol 2.13 (Kernel 3.14) Support 32- and 64-bit flags being set in + xloadflags to support booting a 64-bit kernel from 32-bit + EFI + +Protocol 2.14 BURNT BY INCORRECT COMMIT + ae7e1238e68f2a472a125673ab506d49158c1889 + (x86/boot: Add ACPI RSDP address to setup_header) + DO NOT USE!!! ASSUME SAME AS 2.13. + +Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max. +============= ============================================================ + +.. note:: + The protocol version number should be changed only if the setup header + is changed. There is no need to update the version number if boot_params + or kernel_info are changed. Additionally, it is recommended to use + xloadflags (in this case the protocol version number should not be + updated either) or kernel_info to communicate supported Linux kernel + features to the boot loader. Due to very limited space available in + the original setup header every update to it should be considered + with great care. Starting from the protocol 2.15 the primary way to + communicate things to the boot loader is the kernel_info. + + +Memory Layout +============= + +The traditional memory map for the kernel loader, used for Image or +zImage kernels, typically looks like:: + + | | + 0A0000 +------------------------+ + | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. + 09A000 +------------------------+ + | Command line | + | Stack/heap | For use by the kernel real-mode code. + 098000 +------------------------+ + | Kernel setup | The kernel real-mode code. + 090200 +------------------------+ + | Kernel boot sector | The kernel legacy boot sector. + 090000 +------------------------+ + | Protected-mode kernel | The bulk of the kernel image. + 010000 +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ + +When using bzImage, the protected-mode kernel was relocated to +0x100000 ("high memory"), and the kernel real-mode block (boot sector, +setup, and stack/heap) was made relocatable to any address between +0x10000 and end of low memory. Unfortunately, in protocols 2.00 and +2.01 the 0x90000+ memory range is still used internally by the kernel; +the 2.02 protocol resolves that problem. + +It is desirable to keep the "memory ceiling" -- the highest point in +low memory touched by the boot loader -- as low as possible, since +some newer BIOSes have begun to allocate some rather large amounts of +memory, called the Extended BIOS Data Area, near the top of low +memory. The boot loader should use the "INT 12h" BIOS call to verify +how much low memory is available. + +Unfortunately, if INT 12h reports that the amount of memory is too +low, there is usually nothing the boot loader can do but to report an +error to the user. The boot loader should therefore be designed to +take up as little space in low memory as it reasonably can. For +zImage or old bzImage kernels, which need data written into the +0x90000 segment, the boot loader should make sure not to use memory +above the 0x9A000 point; too many BIOSes will break above that point. + +For a modern bzImage kernel with boot protocol version >= 2.02, a +memory layout like the following is suggested:: + + ~ ~ + | Protected-mode kernel | + 100000 +------------------------+ + | I/O memory hole | + 0A0000 +------------------------+ + | Reserved for BIOS | Leave as much as possible unused + ~ ~ + | Command line | (Can also be below the X+10000 mark) + X+10000 +------------------------+ + | Stack/heap | For use by the kernel real-mode code. + X+08000 +------------------------+ + | Kernel setup | The kernel real-mode code. + | Kernel boot sector | The kernel legacy boot sector. + X +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ + + ... where the address X is as low as the design of the boot loader permits. + + +The Real-Mode Kernel Header +=========================== + +In the following text, and anywhere in the kernel boot sequence, "a +sector" refers to 512 bytes. It is independent of the actual sector +size of the underlying medium. + +The first step in loading a Linux kernel should be to load the +real-mode code (boot sector and setup code) and then examine the +following header at offset 0x01f1. The real-mode code can total up to +32K, although the boot loader may choose to load only the first two +sectors (1K) and then examine the bootup sector size. + +The header looks like: + +=========== ======== ===================== ============================================ +Offset/Size Proto Name Meaning +=========== ======== ===================== ============================================ +01F1/1 ALL(1) setup_sects The size of the setup in sectors +01F2/2 ALL root_flags If set, the root is mounted readonly +01F4/4 2.04+(2) syssize The size of the 32-bit code in 16-byte paras +01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only +01FA/2 ALL vid_mode Video mode control +01FC/2 ALL root_dev Default root device number +01FE/2 ALL boot_flag 0xAA55 magic number +0200/2 2.00+ jump Jump instruction +0202/4 2.00+ header Magic signature "HdrS" +0206/2 2.00+ version Boot protocol version supported +0208/4 2.00+ realmode_swtch Boot loader hook (see below) +020C/2 2.00+ start_sys_seg The load-low segment (0x1000) (obsolete) +020E/2 2.00+ kernel_version Pointer to kernel version string +0210/1 2.00+ type_of_loader Boot loader identifier +0211/1 2.00+ loadflags Boot protocol option flags +0212/2 2.00+ setup_move_size Move to high memory size (used with hooks) +0214/4 2.00+ code32_start Boot loader hook (see below) +0218/4 2.00+ ramdisk_image initrd load address (set by boot loader) +021C/4 2.00+ ramdisk_size initrd size (set by boot loader) +0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only +0224/2 2.01+ heap_end_ptr Free memory after setup end +0226/1 2.02+(3) ext_loader_ver Extended boot loader version +0227/1 2.02+(3) ext_loader_type Extended boot loader ID +0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line +022C/4 2.03+ initrd_addr_max Highest legal initrd address +0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel +0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not +0235/1 2.10+ min_alignment Minimum alignment, as a power of two +0236/2 2.12+ xloadflags Boot protocol option flags +0238/4 2.06+ cmdline_size Maximum size of the kernel command line +023C/4 2.07+ hardware_subarch Hardware subarchitecture +0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data +0248/4 2.08+ payload_offset Offset of kernel payload +024C/4 2.08+ payload_length Length of kernel payload +0250/8 2.09+ setup_data 64-bit physical pointer to linked list + of struct setup_data +0258/8 2.10+ pref_address Preferred loading address +0260/4 2.10+ init_size Linear memory required during initialization +0264/4 2.11+ handover_offset Offset of handover entry point +0268/4 2.15+ kernel_info_offset Offset of the kernel_info +=========== ======== ===================== ============================================ + +.. note:: + (1) For backwards compatibility, if the setup_sects field contains 0, the + real value is 4. + + (2) For boot protocol prior to 2.04, the upper two bytes of the syssize + field are unusable, which means the size of a bzImage kernel + cannot be determined. + + (3) Ignored, but safe to set, for boot protocols 2.02-2.09. + +If the "HdrS" (0x53726448) magic number is not found at offset 0x202, +the boot protocol version is "old". Loading an old kernel, the +following parameters should be assumed:: + + Image type = zImage + initrd not supported + Real-mode kernel must be located at 0x90000. + +Otherwise, the "version" field contains the protocol version, +e.g. protocol version 2.01 will contain 0x0201 in this field. When +setting fields in the header, you must make sure only to set fields +supported by the protocol version in use. + + +Details of Header Fields +======================== + +For each field, some are information from the kernel to the bootloader +("read"), some are expected to be filled out by the bootloader +("write"), and some are expected to be read and modified by the +bootloader ("modify"). + +All general purpose boot loaders should write the fields marked +(obligatory). Boot loaders who want to load the kernel at a +nonstandard address should fill in the fields marked (reloc); other +boot loaders can ignore those fields. + +The byte order of all fields is littleendian (this is x86, after all.) + +============ =========== +Field name: setup_sects +Type: read +Offset/size: 0x1f1/1 +Protocol: ALL +============ =========== + + The size of the setup code in 512-byte sectors. If this field is + 0, the real value is 4. The real-mode code consists of the boot + sector (always one 512-byte sector) plus the setup code. + +============ ================= +Field name: root_flags +Type: modify (optional) +Offset/size: 0x1f2/2 +Protocol: ALL +============ ================= + + If this field is nonzero, the root defaults to readonly. The use of + this field is deprecated; use the "ro" or "rw" options on the + command line instead. + +============ =============================================== +Field name: syssize +Type: read +Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL) +Protocol: 2.04+ +============ =============================================== + + The size of the protected-mode code in units of 16-byte paragraphs. + For protocol versions older than 2.04 this field is only two bytes + wide, and therefore cannot be trusted for the size of a kernel if + the LOAD_HIGH flag is set. + +============ =============== +Field name: ram_size +Type: kernel internal +Offset/size: 0x1f8/2 +Protocol: ALL +============ =============== + + This field is obsolete. + +============ =================== +Field name: vid_mode +Type: modify (obligatory) +Offset/size: 0x1fa/2 +============ =================== + + Please see the section on SPECIAL COMMAND LINE OPTIONS. + +============ ================= +Field name: root_dev +Type: modify (optional) +Offset/size: 0x1fc/2 +Protocol: ALL +============ ================= + + The default root device device number. The use of this field is + deprecated, use the "root=" option on the command line instead. + +============ ========= +Field name: boot_flag +Type: read +Offset/size: 0x1fe/2 +Protocol: ALL +============ ========= + + Contains 0xAA55. This is the closest thing old Linux kernels have + to a magic number. + +============ ======= +Field name: jump +Type: read +Offset/size: 0x200/2 +Protocol: 2.00+ +============ ======= + + Contains an x86 jump instruction, 0xEB followed by a signed offset + relative to byte 0x202. This can be used to determine the size of + the header. + +============ ======= +Field name: header +Type: read +Offset/size: 0x202/4 +Protocol: 2.00+ +============ ======= + + Contains the magic number "HdrS" (0x53726448). + +============ ======= +Field name: version +Type: read +Offset/size: 0x206/2 +Protocol: 2.00+ +============ ======= + + Contains the boot protocol version, in (major << 8)+minor format, + e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version + 10.17. + +============ ================= +Field name: realmode_swtch +Type: modify (optional) +Offset/size: 0x208/4 +Protocol: 2.00+ +============ ================= + + Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.) + +============ ============= +Field name: start_sys_seg +Type: read +Offset/size: 0x20c/2 +Protocol: 2.00+ +============ ============= + + The load low segment (0x1000). Obsolete. + +============ ============== +Field name: kernel_version +Type: read +Offset/size: 0x20e/2 +Protocol: 2.00+ +============ ============== + + If set to a nonzero value, contains a pointer to a NUL-terminated + human-readable kernel version number string, less 0x200. This can + be used to display the kernel version to the user. This value + should be less than (0x200*setup_sects). + + For example, if this value is set to 0x1c00, the kernel version + number string can be found at offset 0x1e00 in the kernel file. + This is a valid value if and only if the "setup_sects" field + contains the value 15 or higher, as:: + + 0x1c00 < 15*0x200 (= 0x1e00) but + 0x1c00 >= 14*0x200 (= 0x1c00) + + 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. + +============ ================== +Field name: type_of_loader +Type: write (obligatory) +Offset/size: 0x210/1 +Protocol: 2.00+ +============ ================== + + If your boot loader has an assigned id (see table below), enter + 0xTV here, where T is an identifier for the boot loader and V is + a version number. Otherwise, enter 0xFF here. + + For boot loader IDs above T = 0xD, write T = 0xE to this field and + write the extended ID minus 0x10 to the ext_loader_type field. + Similarly, the ext_loader_ver field can be used to provide more than + four bits for the bootloader version. + + For example, for T = 0x15, V = 0x234, write:: + + type_of_loader <- 0xE4 + ext_loader_type <- 0x05 + ext_loader_ver <- 0x23 + + Assigned boot loader ids (hexadecimal): + + == ======================================= + 0 LILO + (0x00 reserved for pre-2.00 bootloader) + 1 Loadlin + 2 bootsect-loader + (0x20, all other values reserved) + 3 Syslinux + 4 Etherboot/gPXE/iPXE + 5 ELILO + 7 GRUB + 8 U-Boot + 9 Xen + A Gujin + B Qemu + C Arcturus Networks uCbootloader + D kexec-tools + E Extended (see ext_loader_type) + F Special (0xFF = undefined) + 10 Reserved + 11 Minimal Linux Bootloader + <http://sebastian-plotz.blogspot.de> + 12 OVMF UEFI virtualization stack + 13 barebox + == ======================================= + + Please contact <hpa@zytor.com> if you need a bootloader ID value assigned. + +============ =================== +Field name: loadflags +Type: modify (obligatory) +Offset/size: 0x211/1 +Protocol: 2.00+ +============ =================== + + This field is a bitmask. + + Bit 0 (read): LOADED_HIGH + + - If 0, the protected-mode code is loaded at 0x10000. + - If 1, the protected-mode code is loaded at 0x100000. + + Bit 1 (kernel internal): KASLR_FLAG + + - Used internally by the compressed kernel to communicate + KASLR status to kernel proper. + + - If 1, KASLR enabled. + - If 0, KASLR disabled. + + Bit 5 (write): QUIET_FLAG + + - If 0, print early messages. + - If 1, suppress early messages. + + This requests to the kernel (decompressor and early + kernel) to not write early messages that require + accessing the display hardware directly. + + Bit 6 (obsolete): KEEP_SEGMENTS + + Protocol: 2.07+ + + - This flag is obsolete. + + Bit 7 (write): CAN_USE_HEAP + + Set this bit to 1 to indicate that the value entered in the + heap_end_ptr is valid. If this field is clear, some setup code + functionality will be disabled. + + +============ =================== +Field name: setup_move_size +Type: modify (obligatory) +Offset/size: 0x212/2 +Protocol: 2.00-2.01 +============ =================== + + When using protocol 2.00 or 2.01, if the real mode kernel is not + loaded at 0x90000, it gets moved there later in the loading + sequence. Fill in this field if you want additional data (such as + the kernel command line) moved in addition to the real-mode kernel + itself. + + The unit is bytes starting with the beginning of the boot sector. + + This field is can be ignored when the protocol is 2.02 or higher, or + if the real-mode code is loaded at 0x90000. + +============ ======================== +Field name: code32_start +Type: modify (optional, reloc) +Offset/size: 0x214/4 +Protocol: 2.00+ +============ ======================== + + The address to jump to in protected mode. This defaults to the load + address of the kernel, and can be used by the boot loader to + determine the proper load address. + + This field can be modified for two purposes: + + 1. as a boot loader hook (see Advanced Boot Loader Hooks below.) + + 2. if a bootloader which does not install a hook loads a + relocatable kernel at a nonstandard address it will have to modify + this field to point to the load address. + +============ ================== +Field name: ramdisk_image +Type: write (obligatory) +Offset/size: 0x218/4 +Protocol: 2.00+ +============ ================== + + The 32-bit linear address of the initial ramdisk or ramfs. Leave at + zero if there is no initial ramdisk/ramfs. + +============ ================== +Field name: ramdisk_size +Type: write (obligatory) +Offset/size: 0x21c/4 +Protocol: 2.00+ +============ ================== + + Size of the initial ramdisk or ramfs. Leave at zero if there is no + initial ramdisk/ramfs. + +============ =============== +Field name: bootsect_kludge +Type: kernel internal +Offset/size: 0x220/4 +Protocol: 2.00+ +============ =============== + + This field is obsolete. + +============ ================== +Field name: heap_end_ptr +Type: write (obligatory) +Offset/size: 0x224/2 +Protocol: 2.01+ +============ ================== + + Set this field to the offset (from the beginning of the real-mode + code) of the end of the setup stack/heap, minus 0x0200. + +============ ================ +Field name: ext_loader_ver +Type: write (optional) +Offset/size: 0x226/1 +Protocol: 2.02+ +============ ================ + + This field is used as an extension of the version number in the + type_of_loader field. The total version number is considered to be + (type_of_loader & 0x0f) + (ext_loader_ver << 4). + + The use of this field is boot loader specific. If not written, it + is zero. + + Kernels prior to 2.6.31 did not recognize this field, but it is safe + to write for protocol version 2.02 or higher. + +============ ===================================================== +Field name: ext_loader_type +Type: write (obligatory if (type_of_loader & 0xf0) == 0xe0) +Offset/size: 0x227/1 +Protocol: 2.02+ +============ ===================================================== + + This field is used as an extension of the type number in + type_of_loader field. If the type in type_of_loader is 0xE, then + the actual type is (ext_loader_type + 0x10). + + This field is ignored if the type in type_of_loader is not 0xE. + + Kernels prior to 2.6.31 did not recognize this field, but it is safe + to write for protocol version 2.02 or higher. + +============ ================== +Field name: cmd_line_ptr +Type: write (obligatory) +Offset/size: 0x228/4 +Protocol: 2.02+ +============ ================== + + Set this field to the linear address of the kernel command line. + The kernel command line can be located anywhere between the end of + the setup heap and 0xA0000; it does not have to be located in the + same 64K segment as the real-mode code itself. + + Fill in this field even if your boot loader does not support a + command line, in which case you can point this to an empty string + (or better yet, to the string "auto".) If this field is left at + zero, the kernel will assume that your boot loader does not support + the 2.02+ protocol. + +============ =============== +Field name: initrd_addr_max +Type: read +Offset/size: 0x22c/4 +Protocol: 2.03+ +============ =============== + + The maximum address that may be occupied by the initial + ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this + field is not present, and the maximum address is 0x37FFFFFF. (This + address is defined as the address of the highest safe byte, so if + your ramdisk is exactly 131072 bytes long and this field is + 0x37FFFFFF, you can start your ramdisk at 0x37FE0000.) + +============ ============================ +Field name: kernel_alignment +Type: read/modify (reloc) +Offset/size: 0x230/4 +Protocol: 2.05+ (read), 2.10+ (modify) +============ ============================ + + Alignment unit required by the kernel (if relocatable_kernel is + true.) A relocatable kernel that is loaded at an alignment + incompatible with the value in this field will be realigned during + kernel initialization. + + Starting with protocol version 2.10, this reflects the kernel + alignment preferred for optimal performance; it is possible for the + loader to modify this field to permit a lesser alignment. See the + min_alignment and pref_address field below. + +============ ================== +Field name: relocatable_kernel +Type: read (reloc) +Offset/size: 0x234/1 +Protocol: 2.05+ +============ ================== + + If this field is nonzero, the protected-mode part of the kernel can + be loaded at any address that satisfies the kernel_alignment field. + After loading, the boot loader must set the code32_start field to + point to the loaded code, or to a boot loader hook. + +============ ============= +Field name: min_alignment +Type: read (reloc) +Offset/size: 0x235/1 +Protocol: 2.10+ +============ ============= + + This field, if nonzero, indicates as a power of two the minimum + alignment required, as opposed to preferred, by the kernel to boot. + If a boot loader makes use of this field, it should update the + kernel_alignment field with the alignment unit desired; typically:: + + kernel_alignment = 1 << min_alignment + + There may be a considerable performance cost with an excessively + misaligned kernel. Therefore, a loader should typically try each + power-of-two alignment from kernel_alignment down to this alignment. + +============ ========== +Field name: xloadflags +Type: read +Offset/size: 0x236/2 +Protocol: 2.12+ +============ ========== + + This field is a bitmask. + + Bit 0 (read): XLF_KERNEL_64 + + - If 1, this kernel has the legacy 64-bit entry point at 0x200. + + Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G + + - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G. + + Bit 2 (read): XLF_EFI_HANDOVER_32 + + - If 1, the kernel supports the 32-bit EFI handoff entry point + given at handover_offset. + + Bit 3 (read): XLF_EFI_HANDOVER_64 + + - If 1, the kernel supports the 64-bit EFI handoff entry point + given at handover_offset + 0x200. + + Bit 4 (read): XLF_EFI_KEXEC + + - If 1, the kernel supports kexec EFI boot with EFI runtime support. + + +============ ============ +Field name: cmdline_size +Type: read +Offset/size: 0x238/4 +Protocol: 2.06+ +============ ============ + + The maximum size of the command line without the terminating + zero. This means that the command line can contain at most + cmdline_size characters. With protocol version 2.05 and earlier, the + maximum size was 255. + +============ ==================================== +Field name: hardware_subarch +Type: write (optional, defaults to x86/PC) +Offset/size: 0x23c/4 +Protocol: 2.07+ +============ ==================================== + + In a paravirtualized environment the hardware low level architectural + pieces such as interrupt handling, page table handling, and + accessing process control registers needs to be done differently. + + This field allows the bootloader to inform the kernel we are in one + one of those environments. + + ========== ============================== + 0x00000000 The default x86/PC environment + 0x00000001 lguest + 0x00000002 Xen + 0x00000003 Moorestown MID + 0x00000004 CE4100 TV Platform + ========== ============================== + +============ ========================= +Field name: hardware_subarch_data +Type: write (subarch-dependent) +Offset/size: 0x240/8 +Protocol: 2.07+ +============ ========================= + + A pointer to data that is specific to hardware subarch + This field is currently unused for the default x86/PC environment, + do not modify. + +============ ============== +Field name: payload_offset +Type: read +Offset/size: 0x248/4 +Protocol: 2.08+ +============ ============== + + If non-zero then this field contains the offset from the beginning + of the protected-mode code to the payload. + + The payload may be compressed. The format of both the compressed and + uncompressed data should be determined using the standard magic + numbers. The currently supported compression formats are gzip + (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA + (magic number 5D 00), XZ (magic number FD 37), LZ4 (magic number + 02 21) and ZSTD (magic number 28 B5). The uncompressed payload is + currently always ELF (magic number 7F 45 4C 46). + +============ ============== +Field name: payload_length +Type: read +Offset/size: 0x24c/4 +Protocol: 2.08+ +============ ============== + + The length of the payload. + +============ =============== +Field name: setup_data +Type: write (special) +Offset/size: 0x250/8 +Protocol: 2.09+ +============ =============== + + The 64-bit physical pointer to NULL terminated single linked list of + struct setup_data. This is used to define a more extensible boot + parameters passing mechanism. The definition of struct setup_data is + as follow:: + + struct setup_data { + u64 next; + u32 type; + u32 len; + u8 data[0]; + }; + + Where, the next is a 64-bit physical pointer to the next node of + linked list, the next field of the last node is 0; the type is used + to identify the contents of data; the len is the length of data + field; the data holds the real payload. + + This list may be modified at a number of points during the bootup + process. Therefore, when modifying this list one should always make + sure to consider the case where the linked list already contains + entries. + + The setup_data is a bit awkward to use for extremely large data objects, + both because the setup_data header has to be adjacent to the data object + and because it has a 32-bit length field. However, it is important that + intermediate stages of the boot process have a way to identify which + chunks of memory are occupied by kernel data. + + Thus setup_indirect struct and SETUP_INDIRECT type were introduced in + protocol 2.15:: + + struct setup_indirect { + __u32 type; + __u32 reserved; /* Reserved, must be set to zero. */ + __u64 len; + __u64 addr; + }; + + The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be + SETUP_INDIRECT itself since making the setup_indirect a tree structure + could require a lot of stack space in something that needs to parse it + and stack space can be limited in boot contexts. + + Let's give an example how to point to SETUP_E820_EXT data using setup_indirect. + In this case setup_data and setup_indirect will look like this:: + + struct setup_data { + __u64 next = 0 or <addr_of_next_setup_data_struct>; + __u32 type = SETUP_INDIRECT; + __u32 len = sizeof(setup_indirect); + __u8 data[sizeof(setup_indirect)] = struct setup_indirect { + __u32 type = SETUP_INDIRECT | SETUP_E820_EXT; + __u32 reserved = 0; + __u64 len = <len_of_SETUP_E820_EXT_data>; + __u64 addr = <addr_of_SETUP_E820_EXT_data>; + } + } + +.. note:: + SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished + from SETUP_INDIRECT itself. So, this kind of objects cannot be provided + by the bootloaders. + +============ ============ +Field name: pref_address +Type: read (reloc) +Offset/size: 0x258/8 +Protocol: 2.10+ +============ ============ + + This field, if nonzero, represents a preferred load address for the + kernel. A relocating bootloader should attempt to load at this + address if possible. + + A non-relocatable kernel will unconditionally move itself and to run + at this address. + +============ ======= +Field name: init_size +Type: read +Offset/size: 0x260/4 +============ ======= + + This field indicates the amount of linear contiguous memory starting + at the kernel runtime start address that the kernel needs before it + is capable of examining its memory map. This is not the same thing + as the total amount of memory the kernel needs to boot, but it can + be used by a relocating boot loader to help select a safe load + address for the kernel. + + The kernel runtime start address is determined by the following algorithm:: + + if (relocatable_kernel) + runtime_start = align_up(load_address, kernel_alignment) + else + runtime_start = pref_address + +============ =============== +Field name: handover_offset +Type: read +Offset/size: 0x264/4 +============ =============== + + This field is the offset from the beginning of the kernel image to + the EFI handover protocol entry point. Boot loaders using the EFI + handover protocol to boot the kernel should jump to this offset. + + See EFI HANDOVER PROTOCOL below for more details. + +============ ================== +Field name: kernel_info_offset +Type: read +Offset/size: 0x268/4 +Protocol: 2.15+ +============ ================== + + This field is the offset from the beginning of the kernel image to the + kernel_info. The kernel_info structure is embedded in the Linux image + in the uncompressed protected mode region. + + +The kernel_info +=============== + +The relationships between the headers are analogous to the various data +sections: + + setup_header = .data + boot_params/setup_data = .bss + +What is missing from the above list? That's right: + + kernel_info = .rodata + +We have been (ab)using .data for things that could go into .rodata or .bss for +a long time, for lack of alternatives and -- especially early on -- inertia. +Also, the BIOS stub is responsible for creating boot_params, so it isn't +available to a BIOS-based loader (setup_data is, though). + +setup_header is permanently limited to 144 bytes due to the reach of the +2-byte jump field, which doubles as a length field for the structure, combined +with the size of the "hole" in struct boot_params that a protected-mode loader +or the BIOS stub has to copy it into. It is currently 119 bytes long, which +leaves us with 25 very precious bytes. This isn't something that can be fixed +without revising the boot protocol entirely, breaking backwards compatibility. + +boot_params proper is limited to 4096 bytes, but can be arbitrarily extended +by adding setup_data entries. It cannot be used to communicate properties of +the kernel image, because it is .bss and has no image-provided content. + +kernel_info solves this by providing an extensible place for information about +the kernel image. It is readonly, because the kernel cannot rely on a +bootloader copying its contents anywhere, but that is OK; if it becomes +necessary it can still contain data items that an enabled bootloader would be +expected to copy into a setup_data chunk. + +All kernel_info data should be part of this structure. Fixed size data have to +be put before kernel_info_var_len_data label. Variable size data have to be put +after kernel_info_var_len_data label. Each chunk of variable size data has to +be prefixed with header/magic and its size, e.g.:: + + kernel_info: + .ascii "LToP" /* Header, Linux top (structure). */ + .long kernel_info_var_len_data - kernel_info + .long kernel_info_end - kernel_info + .long 0x01234567 /* Some fixed size data for the bootloaders. */ + kernel_info_var_len_data: + example_struct: /* Some variable size data for the bootloaders. */ + .ascii "0123" /* Header/Magic. */ + .long example_struct_end - example_struct + .ascii "Struct" + .long 0x89012345 + example_struct_end: + example_strings: /* Some variable size data for the bootloaders. */ + .ascii "ABCD" /* Header/Magic. */ + .long example_strings_end - example_strings + .asciz "String_0" + .asciz "String_1" + example_strings_end: + kernel_info_end: + +This way the kernel_info is self-contained blob. + +.. note:: + Each variable size data header/magic can be any 4-character string, + without \0 at the end of the string, which does not collide with + existing variable length data headers/magics. + + +Details of the kernel_info Fields +================================= + +============ ======== +Field name: header +Offset/size: 0x0000/4 +============ ======== + + Contains the magic number "LToP" (0x506f544c). + +============ ======== +Field name: size +Offset/size: 0x0004/4 +============ ======== + + This field contains the size of the kernel_info including kernel_info.header. + It does not count kernel_info.kernel_info_var_len_data size. This field should be + used by the bootloaders to detect supported fixed size fields in the kernel_info + and beginning of kernel_info.kernel_info_var_len_data. + +============ ======== +Field name: size_total +Offset/size: 0x0008/4 +============ ======== + + This field contains the size of the kernel_info including kernel_info.header + and kernel_info.kernel_info_var_len_data. + +============ ============== +Field name: setup_type_max +Offset/size: 0x000c/4 +============ ============== + + This field contains maximal allowed type for setup_data and setup_indirect structs. + + +The Image Checksum +================== + +From boot protocol version 2.08 onwards the CRC-32 is calculated over +the entire file using the characteristic polynomial 0x04C11DB7 and an +initial remainder of 0xffffffff. The checksum is appended to the +file; therefore the CRC of the file up to the limit specified in the +syssize field of the header is always 0. + + +The Kernel Command Line +======================= + +The kernel command line has become an important way for the boot +loader to communicate with the kernel. Some of its options are also +relevant to the boot loader itself, see "special command line options" +below. + +The kernel command line is a null-terminated string. The maximum +length can be retrieved from the field cmdline_size. Before protocol +version 2.06, the maximum was 255 characters. A string that is too +long will be automatically truncated by the kernel. + +If the boot protocol version is 2.02 or later, the address of the +kernel command line is given by the header field cmd_line_ptr (see +above.) This address can be anywhere between the end of the setup +heap and 0xA0000. + +If the protocol version is *not* 2.02 or higher, the kernel +command line is entered using the following protocol: + + - At offset 0x0020 (word), "cmd_line_magic", enter the magic + number 0xA33F. + + - At offset 0x0022 (word), "cmd_line_offset", enter the offset + of the kernel command line (relative to the start of the + real-mode kernel). + + - The kernel command line *must* be within the memory region + covered by setup_move_size, so you may need to adjust this + field. + + +Memory Layout of The Real-Mode Code +=================================== + +The real-mode code requires a stack/heap to be set up, as well as +memory allocated for the kernel command line. This needs to be done +in the real-mode accessible memory in bottom megabyte. + +It should be noted that modern machines often have a sizable Extended +BIOS Data Area (EBDA). As a result, it is advisable to use as little +of the low megabyte as possible. + +Unfortunately, under the following circumstances the 0x90000 memory +segment has to be used: + + - When loading a zImage kernel ((loadflags & 0x01) == 0). + - When loading a 2.01 or earlier boot protocol kernel. + +.. note:: + For the 2.00 and 2.01 boot protocols, the real-mode code + can be loaded at another address, but it is internally + relocated to 0x90000. For the "old" protocol, the + real-mode code must be loaded at 0x90000. + +When loading at 0x90000, avoid using memory above 0x9a000. + +For boot protocol 2.02 or higher, the command line does not have to be +located in the same 64K segment as the real-mode setup code; it is +thus permitted to give the stack/heap the full 64K segment and locate +the command line above it. + +The kernel command line should not be located below the real-mode +code, nor should it be located in high memory. + + +Sample Boot Configuartion +========================= + +As a sample configuration, assume the following layout of the real +mode segment. + + When loading below 0x90000, use the entire segment: + + ============= =================== + 0x0000-0x7fff Real mode kernel + 0x8000-0xdfff Stack and heap + 0xe000-0xffff Kernel command line + ============= =================== + + When loading at 0x90000 OR the protocol version is 2.01 or earlier: + + ============= =================== + 0x0000-0x7fff Real mode kernel + 0x8000-0x97ff Stack and heap + 0x9800-0x9fff Kernel command line + ============= =================== + +Such a boot loader should enter the following fields in the header:: + + unsigned long base_ptr; /* base address for real-mode segment */ + + if ( setup_sects == 0 ) { + setup_sects = 4; + } + + if ( protocol >= 0x0200 ) { + type_of_loader = <type code>; + if ( loading_initrd ) { + ramdisk_image = <initrd_address>; + ramdisk_size = <initrd_size>; + } + + if ( protocol >= 0x0202 && loadflags & 0x01 ) + heap_end = 0xe000; + else + heap_end = 0x9800; + + if ( protocol >= 0x0201 ) { + heap_end_ptr = heap_end - 0x200; + loadflags |= 0x80; /* CAN_USE_HEAP */ + } + + if ( protocol >= 0x0202 ) { + cmd_line_ptr = base_ptr + heap_end; + strcpy(cmd_line_ptr, cmdline); + } else { + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + setup_move_size = heap_end + strlen(cmdline)+1; + strcpy(base_ptr+cmd_line_offset, cmdline); + } + } else { + /* Very old kernel */ + + heap_end = 0x9800; + + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + + /* A very old kernel MUST have its real-mode code + loaded at 0x90000 */ + + if ( base_ptr != 0x90000 ) { + /* Copy the real-mode kernel */ + memcpy(0x90000, base_ptr, (setup_sects+1)*512); + base_ptr = 0x90000; /* Relocated */ + } + + strcpy(0x90000+cmd_line_offset, cmdline); + + /* It is recommended to clear memory up to the 32K mark */ + memset(0x90000 + (setup_sects+1)*512, 0, + (64-(setup_sects+1))*512); + } + + +Loading The Rest of The Kernel +============================== + +The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 +in the kernel file (again, if setup_sects == 0 the real value is 4.) +It should be loaded at address 0x10000 for Image/zImage kernels and +0x100000 for bzImage kernels. + +The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 +bit (LOAD_HIGH) in the loadflags field is set:: + + is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); + load_address = is_bzImage ? 0x100000 : 0x10000; + +Note that Image/zImage kernels can be up to 512K in size, and thus use +the entire 0x10000-0x90000 range of memory. This means it is pretty +much a requirement for these kernels to load the real-mode part at +0x90000. bzImage kernels allow much more flexibility. + +Special Command Line Options +============================ + +If the command line provided by the boot loader is entered by the +user, the user may expect the following command line options to work. +They should normally not be deleted from the kernel command line even +though not all of them are actually meaningful to the kernel. Boot +loader authors who need additional command line options for the boot +loader itself should get them registered in +Documentation/admin-guide/kernel-parameters.rst to make sure they will not +conflict with actual kernel options now or in the future. + + vga=<mode> + <mode> here is either an integer (in C notation, either + decimal, octal, or hexadecimal) or one of the strings + "normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask" + (meaning 0xFFFD). This value should be entered into the + vid_mode field, as it is used by the kernel before the command + line is parsed. + + mem=<size> + <size> is an integer in C notation optionally followed by + (case insensitive) K, M, G, T, P or E (meaning << 10, << 20, + << 30, << 40, << 50 or << 60). This specifies the end of + memory to the kernel. This affects the possible placement of + an initrd, since an initrd should be placed near end of + memory. Note that this is an option to *both* the kernel and + the bootloader! + + initrd=<file> + An initrd should be loaded. The meaning of <file> is + obviously bootloader-dependent, and some boot loaders + (e.g. LILO) do not have such a command. + +In addition, some boot loaders add the following options to the +user-specified command line: + + BOOT_IMAGE=<file> + The boot image which was loaded. Again, the meaning of <file> + is obviously bootloader-dependent. + + auto + The kernel was booted without explicit user intervention. + +If these options are added by the boot loader, it is highly +recommended that they are located *first*, before the user-specified +or configuration-specified command line. Otherwise, "init=/bin/sh" +gets confused by the "auto" option. + + +Running the Kernel +================== + +The kernel is started by jumping to the kernel entry point, which is +located at *segment* offset 0x20 from the start of the real mode +kernel. This means that if you loaded your real-mode kernel code at +0x90000, the kernel entry point is 9020:0000. + +At entry, ds = es = ss should point to the start of the real-mode +kernel code (0x9000 if the code is loaded at 0x90000), sp should be +set up properly, normally pointing to the top of the heap, and +interrupts should be disabled. Furthermore, to guard against bugs in +the kernel, it is recommended that the boot loader sets fs = gs = ds = +es = ss. + +In our example from above, we would do:: + + /* Note: in the case of the "old" kernel protocol, base_ptr must + be == 0x90000 at this point; see the previous sample code */ + + seg = base_ptr >> 4; + + cli(); /* Enter with interrupts disabled! */ + + /* Set up the real-mode kernel stack */ + _SS = seg; + _SP = heap_end; + + _DS = _ES = _FS = _GS = seg; + jmp_far(seg+0x20, 0); /* Run the kernel */ + +If your boot sector accesses a floppy drive, it is recommended to +switch off the floppy motor before running the kernel, since the +kernel boot leaves interrupts off and thus the motor will not be +switched off, especially if the loaded kernel has the floppy driver as +a demand-loaded module! + + +Advanced Boot Loader Hooks +========================== + +If the boot loader runs in a particularly hostile environment (such as +LOADLIN, which runs under DOS) it may be impossible to follow the +standard memory location requirements. Such a boot loader may use the +following hooks that, if set, are invoked by the kernel at the +appropriate time. The use of these hooks should probably be +considered an absolutely last resort! + +IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and +%edi across invocation. + + realmode_swtch: + A 16-bit real mode far subroutine invoked immediately before + entering protected mode. The default routine disables NMI, so + your routine should probably do so, too. + + code32_start: + A 32-bit flat-mode routine *jumped* to immediately after the + transition to protected mode, but before the kernel is + uncompressed. No segments, except CS, are guaranteed to be + set up (current kernels do, but older ones do not); you should + set them up to BOOT_DS (0x18) yourself. + + After completing your hook, you should jump to the address + that was in this field before your boot loader overwrote it + (relocated, if appropriate.) + + +32-bit Boot Protocol +==================== + +For machine with some new BIOS other than legacy BIOS, such as EFI, +LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel +based on legacy BIOS can not be used, so a 32-bit boot protocol needs +to be defined. + +In 32-bit boot protocol, the first step in loading a Linux kernel +should be to setup the boot parameters (struct boot_params, +traditionally known as "zero page"). The memory for struct boot_params +should be allocated and initialized to all zero. Then the setup header +from offset 0x01f1 of kernel image on should be loaded into struct +boot_params and examined. The end of setup header can be calculated as +follow:: + + 0x0202 + byte value at offset 0x0201 + +In addition to read/modify/write the setup header of the struct +boot_params as that of 16-bit boot protocol, the boot loader should +also fill the additional fields of the struct boot_params as +described in chapter Documentation/arch/x86/zero-page.rst. + +After setting up the struct boot_params, the boot loader can load the +32/64-bit kernel in the same way as that of 16-bit boot protocol. + +In 32-bit boot protocol, the kernel is started by jumping to the +32-bit kernel entry point, which is the start address of loaded +32/64-bit kernel. + +At entry, the CPU must be in 32-bit protected mode with paging +disabled; a GDT must be loaded with the descriptors for selectors +__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat +segment; __BOOT_CS must have execute/read permission, and __BOOT_DS +must have read/write permission; CS must be __BOOT_CS and DS, ES, SS +must be __BOOT_DS; interrupt must be disabled; %esi must hold the base +address of the struct boot_params; %ebp, %edi and %ebx must be zero. + +64-bit Boot Protocol +==================== + +For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader +and we need a 64-bit boot protocol. + +In 64-bit boot protocol, the first step in loading a Linux kernel +should be to setup the boot parameters (struct boot_params, +traditionally known as "zero page"). The memory for struct boot_params +could be allocated anywhere (even above 4G) and initialized to all zero. +Then, the setup header at offset 0x01f1 of kernel image on should be +loaded into struct boot_params and examined. The end of setup header +can be calculated as follows:: + + 0x0202 + byte value at offset 0x0201 + +In addition to read/modify/write the setup header of the struct +boot_params as that of 16-bit boot protocol, the boot loader should +also fill the additional fields of the struct boot_params as described +in chapter Documentation/arch/x86/zero-page.rst. + +After setting up the struct boot_params, the boot loader can load +64-bit kernel in the same way as that of 16-bit boot protocol, but +kernel could be loaded above 4G. + +In 64-bit boot protocol, the kernel is started by jumping to the +64-bit kernel entry point, which is the start address of loaded +64-bit kernel plus 0x200. + +At entry, the CPU must be in 64-bit mode with paging enabled. +The range with setup_header.init_size from start address of loaded +kernel and zero page and command line buffer get ident mapping; +a GDT must be loaded with the descriptors for selectors +__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat +segment; __BOOT_CS must have execute/read permission, and __BOOT_DS +must have read/write permission; CS must be __BOOT_CS and DS, ES, SS +must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base +address of the struct boot_params. + +EFI Handover Protocol (deprecated) +================================== + +This protocol allows boot loaders to defer initialisation to the EFI +boot stub. The boot loader is required to load the kernel/initrd(s) +from the boot media and jump to the EFI handover protocol entry point +which is hdr->handover_offset bytes from the beginning of +startup_{32,64}. + +The boot loader MUST respect the kernel's PE/COFF metadata when it comes +to section alignment, the memory footprint of the executable image beyond +the size of the file itself, and any other aspect of the PE/COFF header +that may affect correct operation of the image as a PE/COFF binary in the +execution context provided by the EFI firmware. + +The function prototype for the handover entry point looks like this:: + + efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp) + +'handle' is the EFI image handle passed to the boot loader by the EFI +firmware, 'table' is the EFI system table - these are the first two +arguments of the "handoff state" as described in section 2.3 of the +UEFI specification. 'bp' is the boot loader-allocated boot params. + +The boot loader *must* fill out the following fields in bp:: + + - hdr.cmd_line_ptr + - hdr.ramdisk_image (if applicable) + - hdr.ramdisk_size (if applicable) + +All other fields should be zero. + +NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF + entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd + loading protocol (refer to [0] for an example of the bootloader side of + this), which removes the need for any knowledge on the part of the EFI + bootloader regarding the internal representation of boot_params or any + requirements/limitations regarding the placement of the command line + and ramdisk in memory, or the placement of the kernel image itself. + +[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0 diff --git a/Documentation/arch/x86/booting-dt.rst b/Documentation/arch/x86/booting-dt.rst new file mode 100644 index 000000000000..b089ffd56e6e --- /dev/null +++ b/Documentation/arch/x86/booting-dt.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +DeviceTree Booting +------------------ + + There is one single 32bit entry point to the kernel at code32_start, + the decompressor (the real mode entry point goes to the same 32bit + entry point once it switched into protected mode). That entry point + supports one calling convention which is documented in + Documentation/arch/x86/boot.rst + The physical pointer to the device-tree block is passed via setup_data + which requires at least boot protocol 2.09. + The type filed is defined as + + #define SETUP_DTB 2 + + This device-tree is used as an extension to the "boot page". As such it + does not parse / consider data which is already covered by the boot + page. This includes memory size, reserved ranges, command line arguments + or initrd address. It simply holds information which can not be retrieved + otherwise like interrupt routing or a list of devices behind an I2C bus. diff --git a/Documentation/arch/x86/buslock.rst b/Documentation/arch/x86/buslock.rst new file mode 100644 index 000000000000..31ec0ef78086 --- /dev/null +++ b/Documentation/arch/x86/buslock.rst @@ -0,0 +1,132 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: <isonum.txt> + +=============================== +Bus lock detection and handling +=============================== + +:Copyright: |copy| 2021 Intel Corporation +:Authors: - Fenghua Yu <fenghua.yu@intel.com> + - Tony Luck <tony.luck@intel.com> + +Problem +======= + +A split lock is any atomic operation whose operand crosses two cache lines. +Since the operand spans two cache lines and the operation must be atomic, +the system locks the bus while the CPU accesses the two cache lines. + +A bus lock is acquired through either split locked access to writeback (WB) +memory or any locked access to non-WB memory. This is typically thousands of +cycles slower than an atomic operation within a cache line. It also disrupts +performance on other cores and brings the whole system to its knees. + +Detection +========= + +Intel processors may support either or both of the following hardware +mechanisms to detect split locks and bus locks. + +#AC exception for split lock detection +-------------------------------------- + +Beginning with the Tremont Atom CPU split lock operations may raise an +Alignment Check (#AC) exception when a split lock operation is attemped. + +#DB exception for bus lock detection +------------------------------------ + +Some CPUs have the ability to notify the kernel by an #DB trap after a user +instruction acquires a bus lock and is executed. This allows the kernel to +terminate the application or to enforce throttling. + +Software handling +================= + +The kernel #AC and #DB handlers handle bus lock based on the kernel +parameter "split_lock_detect". Here is a summary of different options: + ++------------------+----------------------------+-----------------------+ +|split_lock_detect=|#AC for split lock |#DB for bus lock | ++------------------+----------------------------+-----------------------+ +|off |Do nothing |Do nothing | ++------------------+----------------------------+-----------------------+ +|warn |Kernel OOPs |Warn once per task and | +|(default) |Warn once per task, add a |and continues to run. | +| |delay, add synchronization | | +| |to prevent more than one | | +| |core from executing a | | +| |split lock in parallel. | | +| |sysctl split_lock_mitigate | | +| |can be used to avoid the | | +| |delay and synchronization | | +| |When both features are | | +| |supported, warn in #AC | | ++------------------+----------------------------+-----------------------+ +|fatal |Kernel OOPs |Send SIGBUS to user. | +| |Send SIGBUS to user | | +| |When both features are | | +| |supported, fatal in #AC | | ++------------------+----------------------------+-----------------------+ +|ratelimit:N |Do nothing |Limit bus lock rate to | +|(0 < N <= 1000) | |N bus locks per second | +| | |system wide and warn on| +| | |bus locks. | ++------------------+----------------------------+-----------------------+ + +Usages +====== + +Detecting and handling bus lock may find usages in various areas: + +It is critical for real time system designers who build consolidated real +time systems. These systems run hard real time code on some cores and run +"untrusted" user processes on other cores. The hard real time cannot afford +to have any bus lock from the untrusted processes to hurt real time +performance. To date the designers have been unable to deploy these +solutions as they have no way to prevent the "untrusted" user code from +generating split lock and bus lock to block the hard real time code to +access memory during bus locking. + +It's also useful for general computing to prevent guests or user +applications from slowing down the overall system by executing instructions +with bus lock. + + +Guidance +======== +off +--- + +Disable checking for split lock and bus lock. This option can be useful if +there are legacy applications that trigger these events at a low rate so +that mitigation is not needed. + +warn +---- + +A warning is emitted when a bus lock is detected which allows to identify +the offending application. This is the default behavior. + +fatal +----- + +In this case, the bus lock is not tolerated and the process is killed. + +ratelimit +--------- + +A system wide bus lock rate limit N is specified where 0 < N <= 1000. This +allows a bus lock rate up to N bus locks per second. When the bus lock rate +is exceeded then any task which is caught via the buslock #DB exception is +throttled by enforced sleeps until the rate goes under the limit again. + +This is an effective mitigation in cases where a minimal impact can be +tolerated, but an eventual Denial of Service attack has to be prevented. It +allows to identify the offending processes and analyze whether they are +malicious or just badly written. + +Selecting a rate limit of 1000 allows the bus to be locked for up to about +seven million cycles each second (assuming 7000 cycles for each bus +lock). On a 2 GHz processor that would be about 0.35% system slowdown. diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst new file mode 100644 index 000000000000..08246e8ac835 --- /dev/null +++ b/Documentation/arch/x86/cpuinfo.rst @@ -0,0 +1,154 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +x86 Feature Flags +================= + +Introduction +============ + +On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition +in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature +or KVM want to expose the feature to a KVM guest, it can and should have +an X86_FEATURE_* defined. These flags represent hardware features as +well as software features. + +If users want to know if a feature is available on a given system, they +try to find the flag in /proc/cpuinfo. If a given flag is present, it +means that the kernel supports it and is currently making it available. +If such flag represents a hardware feature, it also means that the +hardware supports it. + +If the expected flag does not appear in /proc/cpuinfo, things are murkier. +Users need to find out the reason why the flag is missing and find the way +how to enable it, which is not always easy. There are several factors that +can explain missing flags: the expected feature failed to enable, the feature +is missing in hardware, platform firmware did not enable it, the feature is +disabled at build or run time, an old kernel is in use, or the kernel does +not support the feature and thus has not enabled it. In general, /proc/cpuinfo +shows features which the kernel supports. For a full list of CPUID flags +which the CPU supports, use tools/arch/x86/kcpuid. + +How are feature flags created? +============================== + +a: Feature flags can be derived from the contents of CPUID leaves. +------------------------------------------------------------------ +These feature definitions are organized mirroring the layout of CPUID +leaves and grouped in words with offsets as mapped in enum cpuid_leafs +in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details). +If a feature is defined with a X86_FEATURE_<name> definition in +cpufeatures.h, and if it is detected at run time, the flags will be +displayed accordingly in /proc/cpuinfo. For example, the flag "avx2" +comes from X86_FEATURE_AVX2 in cpufeatures.h. + +b: Flags can be from scattered CPUID-based features. +---------------------------------------------------- +Hardware features enumerated in sparsely populated CPUID leaves get +software-defined values. Still, CPUID needs to be queried to determine +if a given feature is present. This is done in init_scattered_cpuid_features(). +For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is +checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1]. + +The intent of scattering CPUID leaves is to not bloat struct +cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf +[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1] +has only one feature and would waste 31 bits of space in the x86_capability[] +array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted +memory is not trivial. + +c: Flags can be created synthetically under certain conditions for hardware features. +------------------------------------------------------------------------------------- +Examples of conditions include whether certain features are present in +MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed +conditions are met, the features are enabled by the set_cpu_cap or +setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS, +the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and +"split_lock_detect" will be displayed. The flag "ring3mwait" will be +displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors. + +d: Flags can represent purely software features. +------------------------------------------------ +These flags do not represent hardware features. Instead, they represent a +software feature implemented in the kernel. For example, Kernel Page Table +Isolation is purely software feature and its feature flag X86_FEATURE_PTI is +also defined in cpufeatures.h. + +Naming of Flags +=============== + +The script arch/x86/kernel/cpu/mkcapflags.sh processes the +#define X86_FEATURE_<name> from cpufeatures.h and generates the +x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the +resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming +of flags in the x86_cap/bug_flags[] are as follows: + +a: The name of the flag is from the string in X86_FEATURE_<name> by default. +---------------------------------------------------------------------------- +By default, the flag <name> in /proc/cpuinfo is extracted from the respective +X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from +X86_FEATURE_AVX2. + +b: The naming can be overridden. +-------------------------------- +If the comment on the line for the #define X86_FEATURE_* starts with a +double-quote character (""), the string inside the double-quote characters +will be the name of the flags. For example, the flag "sse4_1" comes from +the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition. + +There are situations in which overriding the displayed name of the flag is +needed. For instance, /proc/cpuinfo is a userspace interface and must remain +constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one +shall override the new naming with the name already used in /proc/cpuinfo. + +c: The naming override can be "", which means it will not appear in /proc/cpuinfo. +---------------------------------------------------------------------------------- +The feature shall be omitted from /proc/cpuinfo if it does not make sense for +the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is +defined in cpufeatures.h but that flag is an internal kernel feature used +in the alternative runtime patching functionality. So, its name is overridden +with "". Its flag will not appear in /proc/cpuinfo. + +Flags are missing when one or more of these happen +================================================== + +a: The hardware does not enumerate support for it. +-------------------------------------------------- +For example, when a new kernel is running on old hardware or the feature is +not enabled by boot firmware. Even if the hardware is new, there might be a +problem enabling the feature at run time, the flag will not be displayed. + +b: The kernel does not know about the flag. +------------------------------------------- +For example, when an old kernel is running on new hardware. + +c: The kernel disabled support for it at compile-time. +------------------------------------------------------ +For example, if 5-level-paging is not enabled when building (i.e., +CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_. +Even though the feature will still be detected via CPUID, the kernel disables +it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57). + +d: The feature is disabled at boot-time. +---------------------------------------- +A feature can be disabled either using a command-line parameter or because +it failed to be enabled. The command-line parameter clearcpuid= can be used +to disable features using the feature number as defined in +/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction +Protection can be disabled using clearcpuid=514. The number 514 is calculated +from #define X86_FEATURE_UMIP (16*32 + 2). + +In addition, there exists a variety of custom command-line parameters that +disable specific features. The list of parameters includes, but is not limited +to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using +"no5lvl". + +e: The feature was known to be non-functional. +---------------------------------------------- +The feature was known to be non-functional because a dependency was +missing at runtime. For example, AVX flags will not show up if XSAVE feature +is disabled since they depend on XSAVE feature. Another example would be broken +CPUs and them missing microcode patches. Due to that, the kernel decides not to +enable a feature. + +.. [#f1] 5-level paging uses linear address of 57 bits. diff --git a/Documentation/arch/x86/earlyprintk.rst b/Documentation/arch/x86/earlyprintk.rst new file mode 100644 index 000000000000..51ef11e8f725 --- /dev/null +++ b/Documentation/arch/x86/earlyprintk.rst @@ -0,0 +1,151 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +Early Printk +============ + +Mini-HOWTO for using the earlyprintk=dbgp boot option with a +USB2 Debug port key and a debug cable, on x86 systems. + +You need two computers, the 'USB debug key' special gadget and +two USB cables, connected like this:: + + [host/target] <-------> [USB debug key] <-------> [client/console] + +Hardware requirements +===================== + + a) Host/target system needs to have USB debug port capability. + + You can check this capability by looking at a 'Debug port' bit in + the lspci -vvv output:: + + # lspci -vvv + ... + 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI]) + Subsystem: Lenovo ThinkPad T61 + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- + Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- + Latency: 0 + Interrupt: pin D routed to IRQ 19 + Region 0: Memory at fe227000 (32-bit, non-prefetchable) [size=1K] + Capabilities: [50] Power Management version 2 + Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) + Status: D0 PME-Enable- DSel=0 DScale=0 PME+ + Capabilities: [58] Debug port: BAR=1 offset=00a0 + ^^^^^^^^^^^ <==================== [ HERE ] + Kernel driver in use: ehci_hcd + Kernel modules: ehci-hcd + ... + + .. note:: + If your system does not list a debug port capability then you probably + won't be able to use the USB debug key. + + b) You also need a NetChip USB debug cable/key: + + http://www.plxtech.com/products/NET2000/NET20DC/default.asp + + This is a small blue plastic connector with two USB connections; + it draws power from its USB connections. + + c) You need a second client/console system with a high speed USB 2.0 port. + + d) The NetChip device must be plugged directly into the physical + debug port on the "host/target" system. You cannot use a USB hub in + between the physical debug port and the "host/target" system. + + The EHCI debug controller is bound to a specific physical USB + port and the NetChip device will only work as an early printk + device in this port. The EHCI host controllers are electrically + wired such that the EHCI debug controller is hooked up to the + first physical port and there is no way to change this via software. + You can find the physical port through experimentation by trying + each physical port on the system and rebooting. Or you can try + and use lsusb or look at the kernel info messages emitted by the + usb stack when you plug a usb device into various ports on the + "host/target" system. + + Some hardware vendors do not expose the usb debug port with a + physical connector and if you find such a device send a complaint + to the hardware vendor, because there is no reason not to wire + this port into one of the physically accessible ports. + + e) It is also important to note, that many versions of the NetChip + device require the "client/console" system to be plugged into the + right hand side of the device (with the product logo facing up and + readable left to right). The reason being is that the 5 volt + power supply is taken from only one side of the device and it + must be the side that does not get rebooted. + +Software requirements +===================== + + a) On the host/target system: + + You need to enable the following kernel config option:: + + CONFIG_EARLY_PRINTK_DBGP=y + + And you need to add the boot command line: "earlyprintk=dbgp". + + .. note:: + If you are using Grub, append it to the 'kernel' line in + /etc/grub.conf. If you are using Grub2 on a BIOS firmware system, + append it to the 'linux' line in /boot/grub2/grub.cfg. If you are + using Grub2 on an EFI firmware system, append it to the 'linux' + or 'linuxefi' line in /boot/grub2/grub.cfg or + /boot/efi/EFI/<distro>/grub.cfg. + + On systems with more than one EHCI debug controller you must + specify the correct EHCI debug controller number. The ordering + comes from the PCI bus enumeration of the EHCI controllers. The + default with no number argument is "0" or the first EHCI debug + controller. To use the second EHCI debug controller, you would + use the command line: "earlyprintk=dbgp1" + + .. note:: + normally earlyprintk console gets turned off once the + regular console is alive - use "earlyprintk=dbgp,keep" to keep + this channel open beyond early bootup. This can be useful for + debugging crashes under Xorg, etc. + + b) On the client/console system: + + You should enable the following kernel config option:: + + CONFIG_USB_SERIAL_DEBUG=y + + On the next bootup with the modified kernel you should + get a /dev/ttyUSBx device(s). + + Now this channel of kernel messages is ready to be used: start + your favorite terminal emulator (minicom, etc.) and set + it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to + see the raw output. + + c) On Nvidia Southbridge based systems: the kernel will try to probe + and find out which port has a debug device connected. + +Testing +======= + +You can test the output by using earlyprintk=dbgp,keep and provoking +kernel messages on the host/target system. You can provoke a harmless +kernel message by for example doing:: + + echo h > /proc/sysrq-trigger + +On the host/target system you should see this help line in "dmesg" output:: + + SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) + +On the client/console system do:: + + cat /dev/ttyUSB0 + +And you should see the help line above displayed shortly after you've +provoked it on the host system. + +If it does not work then please ask about it on the linux-kernel@vger.kernel.org +mailing list or contact the x86 maintainers. diff --git a/Documentation/arch/x86/elf_auxvec.rst b/Documentation/arch/x86/elf_auxvec.rst new file mode 100644 index 000000000000..18e4744717f9 --- /dev/null +++ b/Documentation/arch/x86/elf_auxvec.rst @@ -0,0 +1,53 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +x86-specific ELF Auxiliary Vectors +================================== + +This document describes the semantics of the x86 auxiliary vectors. + +Introduction +============ + +ELF Auxiliary vectors enable the kernel to efficiently provide +configuration-specific parameters to userspace. In this example, a program +allocates an alternate stack based on the kernel-provided size:: + + #include <sys/auxv.h> + #include <elf.h> + #include <signal.h> + #include <stdlib.h> + #include <assert.h> + #include <err.h> + + #ifndef AT_MINSIGSTKSZ + #define AT_MINSIGSTKSZ 51 + #endif + + .... + stack_t ss; + + ss.ss_sp = malloc(ss.ss_size); + assert(ss.ss_sp); + + ss.ss_size = getauxval(AT_MINSIGSTKSZ) + SIGSTKSZ; + ss.ss_flags = 0; + + if (sigaltstack(&ss, NULL)) + err(1, "sigaltstack"); + + +The exposed auxiliary vectors +============================= + +AT_SYSINFO is used for locating the vsyscall entry point. It is not +exported on 64-bit mode. + +AT_SYSINFO_EHDR is the start address of the page containing the vDSO. + +AT_MINSIGSTKSZ denotes the minimum stack size required by the kernel to +deliver a signal to user-space. AT_MINSIGSTKSZ comprehends the space +consumed by the kernel to accommodate the user context for the current +hardware configuration. It does not comprehend subsequent user-space stack +consumption, which must be added by the user. (e.g. Above, user-space adds +SIGSTKSZ to AT_MINSIGSTKSZ.) diff --git a/Documentation/arch/x86/entry_64.rst b/Documentation/arch/x86/entry_64.rst new file mode 100644 index 000000000000..0afdce3c06f4 --- /dev/null +++ b/Documentation/arch/x86/entry_64.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Kernel Entries +============== + +This file documents some of the kernel entries in +arch/x86/entry/entry_64.S. A lot of this explanation is adapted from +an email from Ingo Molnar: + +https://lore.kernel.org/r/20110529191055.GC9835%40elte.hu + +The x86 architecture has quite a few different ways to jump into +kernel code. Most of these entry points are registered in +arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S +for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally +arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility +syscall entry points and thus provides for 32-bit processes the +ability to execute syscalls when running on 64-bit kernels. + +The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h. + +Some of these entries are: + + - system_call: syscall instruction from 64-bit code. + + - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall + either way. + + - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit + code + + - interrupt: An array of entries. Every IDT vector that doesn't + explicitly point somewhere else gets set to the corresponding + value in interrupts. These point to a whole array of + magically-generated functions that make their way to common_interrupt() + with the interrupt number as a parameter. + + - APIC interrupts: Various special-purpose interrupts for things + like TLB shootdown. + + - Architecturally-defined exceptions like divide_error. + +There are a few complexities here. The different x86-64 entries +have different calling conventions. The syscall and sysenter +instructions have their own peculiar calling conventions. Some of +the IDT entries push an error code onto the stack; others don't. +IDT entries using the IST alternative stack mechanism need their own +magic to get the stack frames right. (You can find some +documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM, +Volume 3, Chapter 6.) + +Dealing with the swapgs instruction is especially tricky. Swapgs +toggles whether gs is the kernel gs or the user gs. The swapgs +instruction is rather fragile: it must nest perfectly and only in +single depth, it should only be used if entering from user mode to +kernel mode and then when returning to user-space, and precisely +so. If we mess that up even slightly, we crash. + +So when we have a secondary entry, already in kernel mode, we *must +not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's +not switched/swapped yet. + +Now, there's a secondary complication: there's a cheap way to test +which mode the CPU is in and an expensive way. + +The cheap way is to pick this info off the entry frame on the kernel +stack, from the CS of the ptregs area of the kernel stack:: + + xorl %ebx,%ebx + testl $3,CS+8(%rsp) + je error_kernelspace + SWAPGS + +The expensive (paranoid) way is to read back the MSR_GS_BASE value +(which is what SWAPGS modifies):: + + movl $1,%ebx + movl $MSR_GS_BASE,%ecx + rdmsr + testl %edx,%edx + js 1f /* negative -> in kernel */ + SWAPGS + xorl %ebx,%ebx + 1: ret + +If we are at an interrupt or user-trap/gate-alike boundary then we can +use the faster check: the stack will be a reliable indicator of +whether SWAPGS was already done: if we see that we are a secondary +entry interrupting kernel mode execution, then we know that the GS +base has already been switched. If it says that we interrupted +user-space execution then we must do the SWAPGS. + +But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, +which might have triggered right after a normal entry wrote CS to the +stack but before we executed SWAPGS, then the only safe way to check +for GS is the slower method: the RDMSR. + +Therefore, super-atomic entries (except NMI, which is handled separately) +must use idtentry with paranoid=1 to handle gsbase correctly. This +triggers three main behavior changes: + + - Interrupt entry will use the slower gsbase check. + - Interrupt entry from user mode will switch off the IST stack. + - Interrupt exit to kernel mode will not attempt to reschedule. + +We try to only use IST entries and the paranoid entry code for vectors +that absolutely need the more expensive check for the GS base - and we +generate all 'normal' entry points with the regular (faster) paranoid=0 +variant. diff --git a/Documentation/arch/x86/exception-tables.rst b/Documentation/arch/x86/exception-tables.rst new file mode 100644 index 000000000000..efde1fef4fbd --- /dev/null +++ b/Documentation/arch/x86/exception-tables.rst @@ -0,0 +1,357 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Kernel level exception handling +=============================== + +Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> + +When a process runs in kernel mode, it often has to access user +mode memory whose address has been passed by an untrusted program. +To protect itself the kernel has to verify this address. + +In older versions of Linux this was done with the +int verify_area(int type, const void * addr, unsigned long size) +function (which has since been replaced by access_ok()). + +This function verified that the memory area starting at address +'addr' and of size 'size' was accessible for the operation specified +in type (read or write). To do this, verify_read had to look up the +virtual memory area (vma) that contained the address addr. In the +normal case (correctly working program), this test was successful. +It only failed for a few buggy programs. In some kernel profiling +tests, this normally unneeded verification used up a considerable +amount of time. + +To overcome this situation, Linus decided to let the virtual memory +hardware present in every Linux-capable CPU handle this test. + +How does this work? + +Whenever the kernel tries to access an address that is currently not +accessible, the CPU generates a page fault exception and calls the +page fault handler:: + + void exc_page_fault(struct pt_regs *regs, unsigned long error_code) + +in arch/x86/mm/fault.c. The parameters on the stack are set up by +the low level assembly glue in arch/x86/entry/entry_32.S. The parameter +regs is a pointer to the saved registers on the stack, error_code +contains a reason code for the exception. + +exc_page_fault() first obtains the inaccessible address from the CPU +control register CR2. If the address is within the virtual address +space of the process, the fault probably occurred, because the page +was not swapped in, write protected or something similar. However, +we are interested in the other case: the address is not valid, there +is no vma that contains this address. In this case, the kernel jumps +to the bad_area label. + +There it uses the address of the instruction that caused the exception +(i.e. regs->eip) to find an address where the execution can continue +(fixup). If this search is successful, the fault handler modifies the +return address (again regs->eip) and returns. The execution will +continue at the address in fixup. + +Where does fixup point to? + +Since we jump to the contents of fixup, fixup obviously points +to executable code. This code is hidden inside the user access macros. +I have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h +as an example. The definition is somewhat hard to follow, so let's peek at +the code generated by the preprocessor and the compiler. I selected +the get_user() call in drivers/char/sysrq.c for a detailed examination. + +The original code in sysrq.c line 587:: + + get_user(c, buf); + +The preprocessor output (edited to become somewhat readable):: + + ( + { + long __gu_err = - 14 , __gu_val = 0; + const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); + if (((((0 + current_set[0])->tss.segment) == 0x18 ) || + (((sizeof(*(buf))) <= 0xC0000000UL) && + ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) + do { + __gu_err = 0; + switch ((sizeof(*(buf)))) { + case 1: + __asm__ __volatile__( + "1: mov" "b" " %2,%" "b" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "b" " %" "b" "1,%" "b" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b,3b\n" + ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) + ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; + break; + case 2: + __asm__ __volatile__( + "1: mov" "w" " %2,%" "w" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "w" " %" "w" "1,%" "w" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b,3b\n" + ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) + ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); + break; + case 4: + __asm__ __volatile__( + "1: mov" "l" " %2,%" "" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "l" " %" "" "1,%" "" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" " .long 1b,3b\n" + ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) + ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); + break; + default: + (__gu_val) = __get_user_bad(); + } + } while (0) ; + ((c)) = (__typeof__(*((buf))))__gu_val; + __gu_err; + } + ); + +WOW! Black GCC/assembly magic. This is impossible to follow, so let's +see what code gcc generates:: + + > xorl %edx,%edx + > movl current_set,%eax + > cmpl $24,788(%eax) + > je .L1424 + > cmpl $-1073741825,64(%esp) + > ja .L1423 + > .L1424: + > movl %edx,%eax + > movl 64(%esp),%ebx + > #APP + > 1: movb (%ebx),%dl /* this is the actual user access */ + > 2: + > .section .fixup,"ax" + > 3: movl $-14,%eax + > xorb %dl,%dl + > jmp 2b + > .section __ex_table,"a" + > .align 4 + > .long 1b,3b + > .text + > #NO_APP + > .L1423: + > movzbl %dl,%esi + +The optimizer does a good job and gives us something we can actually +understand. Can we? The actual user access is quite obvious. Thanks +to the unified address space we can just access the address in user +memory. But what does the .section stuff do????? + +To understand this we have to look at the final kernel:: + + > objdump --section-headers vmlinux + > + > vmlinux: file format elf32-i386 + > + > Sections: + > Idx Name Size VMA LMA File off Algn + > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 + > CONTENTS, ALLOC, LOAD, READONLY, CODE + > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0 + > CONTENTS, ALLOC, LOAD, READONLY, CODE + > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 + > CONTENTS, ALLOC, LOAD, READONLY, DATA + > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 + > CONTENTS, ALLOC, LOAD, READONLY, DATA + > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 + > CONTENTS, ALLOC, LOAD, DATA + > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2 + > ALLOC + > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0 + > CONTENTS, READONLY + > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0 + > CONTENTS, READONLY + +There are obviously 2 non standard ELF sections in the generated object +file. But first we want to find out what happened to our code in the +final kernel executable:: + + > objdump --disassemble --section=.text vmlinux + > + > c017e785 <do_con_write+c1> xorl %edx,%edx + > c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax + > c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax) + > c017e793 <do_con_write+cf> je c017e79f <do_con_write+db> + > c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1) + > c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3> + > c017e79f <do_con_write+db> movl %edx,%eax + > c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx + > c017e7a5 <do_con_write+e1> movb (%ebx),%dl + > c017e7a7 <do_con_write+e3> movzbl %dl,%esi + +The whole user memory access is reduced to 10 x86 machine instructions. +The instructions bracketed in the .section directives are no longer +in the normal execution path. They are located in a different section +of the executable file:: + + > objdump --disassemble --section=.fixup vmlinux + > + > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax + > c0199ffa <.fixup+10ba> xorb %dl,%dl + > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> + +And finally:: + + > objdump --full-contents --section=__ex_table vmlinux + > + > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ + > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ + > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ + +or in human readable byte order:: + + > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................ + > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ + ^^^^^^^^^^^^^^^^^ + this is the interesting part! + > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................ + +What happened? The assembly directives:: + + .section .fixup,"ax" + .section __ex_table,"a" + +told the assembler to move the following code to the specified +sections in the ELF object file. So the instructions:: + + 3: movl $-14,%eax + xorb %dl,%dl + jmp 2b + +ended up in the .fixup section of the object file and the addresses:: + + .long 1b,3b + +ended up in the __ex_table section of the object file. 1b and 3b +are local labels. The local label 1b (1b stands for next label 1 +backward) is the address of the instruction that might fault, i.e. +in our case the address of the label 1 is c017e7a5: +the original assembly code: > 1: movb (%ebx),%dl +and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl + +The local label 3 (backwards again) is the address of the code to handle +the fault, in our case the actual value is c0199ff5: +the original assembly code: > 3: movl $-14,%eax +and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax + +If the fixup was able to handle the exception, control flow may be returned +to the instruction after the one that triggered the fault, ie. local label 2b. + +The assembly code:: + + > .section __ex_table,"a" + > .align 4 + > .long 1b,3b + +becomes the value pair:: + + > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ + ^this is ^this is + 1b 3b + +c017e7a5,c0199ff5 in the exception table of the kernel. + +So, what actually happens if a fault from kernel mode with no suitable +vma occurs? + +#. access to invalid address:: + + > c017e7a5 <do_con_write+e1> movb (%ebx),%dl +#. MMU generates exception +#. CPU calls exc_page_fault() +#. exc_page_fault() calls do_user_addr_fault() +#. do_user_addr_fault() calls kernelmode_fixup_or_oops() +#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5); +#. fixup_exception() calls search_exception_tables() +#. search_exception_tables() looks up the address c017e7a5 in the + exception table (i.e. the contents of the ELF section __ex_table) + and returns the address of the associated fault handle code c0199ff5. +#. fixup_exception() modifies its own return address to point to the fault + handle code and returns. +#. execution continues in the fault handling code. +#. a) EAX becomes -EFAULT (== -14) + b) DL becomes zero (the value we "read" from user space) + c) execution continues at local label 2 (address of the + instruction immediately after the faulting user access). + +The steps 8a to 8c in a certain way emulate the faulting instruction. + +That's it, mostly. If you look at our example, you might ask why +we set EAX to -EFAULT in the exception handler code. Well, the +get_user() macro actually returns a value: 0, if the user access was +successful, -EFAULT on failure. Our original code did not test this +return value, however the inline assembly code in get_user() tries to +return -EFAULT. GCC selected EAX to return this value. + +NOTE: +Due to the way that the exception table is built and needs to be ordered, +only use exceptions for code in the .text section. Any other section +will cause the exception table to not be sorted correctly, and the +exceptions will fail. + +Things changed when 64-bit support was added to x86 Linux. Rather than +double the size of the exception table by expanding the two entries +from 32-bits to 64 bits, a clever trick was used to store addresses +as relative offsets from the table itself. The assembly code changed +from:: + + .long 1b,3b + to: + .long (from) - . + .long (to) - . + +and the C-code that uses these values converts back to absolute addresses +like this:: + + ex_insn_addr(const struct exception_table_entry *x) + { + return (unsigned long)&x->insn + x->insn; + } + +In v4.6 the exception table entry was expanded with a new field "handler". +This is also 32-bits wide and contains a third relative function +pointer which points to one of: + +1) ``int ex_handler_default(const struct exception_table_entry *fixup)`` + This is legacy case that just jumps to the fixup code + +2) ``int ex_handler_fault(const struct exception_table_entry *fixup)`` + This case provides the fault number of the trap that occurred at + entry->insn. It is used to distinguish page faults from machine + check. + +More functions can easily be added. + +CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post +link of the kernel image, via a host utility scripts/sorttable. It will set the +symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section +at boot time. With the exception table sorted, at runtime when an exception +occurs we can quickly lookup the __ex_table entry via binary search. + +This is not just a boot time optimization, some architectures require this +table to be sorted in order to handle exceptions relatively early in the boot +process. For example, i386 makes use of this form of exception handling before +paging support is even enabled! diff --git a/Documentation/arch/x86/features.rst b/Documentation/arch/x86/features.rst new file mode 100644 index 000000000000..b663f15053ce --- /dev/null +++ b/Documentation/arch/x86/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features x86 diff --git a/Documentation/arch/x86/i386/IO-APIC.rst b/Documentation/arch/x86/i386/IO-APIC.rst new file mode 100644 index 000000000000..ce4d8df15e7c --- /dev/null +++ b/Documentation/arch/x86/i386/IO-APIC.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +IO-APIC +======= + +:Author: Ingo Molnar <mingo@kernel.org> + +Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC', +which is an enhanced interrupt controller. It enables us to route +hardware interrupts to multiple CPUs, or to CPU groups. Without an +IO-APIC, interrupts from hardware will be delivered only to the +CPU which boots the operating system (usually CPU#0). + +Linux supports all variants of compliant SMP boards, including ones with +multiple IO-APICs. Multiple IO-APICs are used in high-end servers to +distribute IRQ load further. + +There are (a few) known breakages in certain older boards, such bugs are +usually worked around by the kernel. If your MP-compliant SMP board does +not boot Linux, then consult the linux-smp mailing list archives first. + +If your box boots fine with enabled IO-APIC IRQs, then your +/proc/interrupts will look like this one:: + + hell:~> cat /proc/interrupts + CPU0 + 0: 1360293 IO-APIC-edge timer + 1: 4 IO-APIC-edge keyboard + 2: 0 XT-PIC cascade + 13: 1 XT-PIC fpu + 14: 1448 IO-APIC-edge ide0 + 16: 28232 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet + 17: 51304 IO-APIC-level eth0 + NMI: 0 + ERR: 0 + hell:~> + +Some interrupts are still listed as 'XT PIC', but this is not a problem; +none of those IRQ sources is performance-critical. + + +In the unlikely case that your board does not create a working mp-table, +you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This +is non-trivial though and cannot be automated. One sample /etc/lilo.conf +entry:: + + append="pirq=15,11,10" + +The actual numbers depend on your system, on your PCI cards and on their +PCI slot position. Usually PCI slots are 'daisy chained' before they are +connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4 +lines):: + + ,-. ,-. ,-. ,-. ,-. + PIRQ4 ----| |-. ,-| |-. ,-| |-. ,-| |--------| | + |S| \ / |S| \ / |S| \ / |S| |S| + PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l| + |o| \/ |o| \/ |o| \/ |o| |o| + PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t| + |1| /\ |2| /\ |3| /\ |4| |5| + PIRQ1 ----| |- `----| |- `----| |- `----| |--------| | + `-' `-' `-' `-' `-' + +Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD:: + + ,-. + INTD--| | + |S| + INTC--|l| + |o| + INTB--|t| + |x| + INTA--| | + `-' + +These INTA-D PCI IRQs are always 'local to the card', their real meaning +depends on which slot they are in. If you look at the daisy chaining diagram, +a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of +the PCI chipset. Most cards issue INTA, this creates optimal distribution +between the PIRQ lines. (distributing IRQ sources properly is not a +necessity, PCI IRQs can be shared at will, but it's a good for performance +to have non shared interrupts). Slot5 should be used for videocards, they +do not use interrupts normally, thus they are not daisy chained either. + +so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in +Slot2, then you'll have to specify this pirq= line:: + + append="pirq=11,9" + +the following script tries to figure out such a default pirq= line from +your PCI configuration:: + + echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g' + +note that this script won't work if you have skipped a few slots or if your +board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins +connected in some strange way). E.g. if in the above case you have your SCSI +card (IRQ11) in Slot3, and have Slot1 empty:: + + append="pirq=0,9,11" + +[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting) +slots.] + +Generally, it's always possible to find out the correct pirq= settings, just +permute all IRQ numbers properly ... it will take some time though. An +'incorrect' pirq line will cause the booting process to hang, or a device +won't function properly (e.g. if it's inserted as a module). + +If you have 2 PCI buses, then you can use up to 8 pirq values, although such +boards tend to have a good configuration. + +Be prepared that it might happen that you need some strange pirq line:: + + append="pirq=0,0,0,0,0,0,9,11" + +Use smart trial-and-error techniques to find out the correct pirq line ... + +Good luck and mail to linux-smp@vger.kernel.org or +linux-kernel@vger.kernel.org if you have any problems that are not covered +by this document. + diff --git a/Documentation/arch/x86/i386/index.rst b/Documentation/arch/x86/i386/index.rst new file mode 100644 index 000000000000..8747cf5bbd49 --- /dev/null +++ b/Documentation/arch/x86/i386/index.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +i386 Support +============ + +.. toctree:: + :maxdepth: 2 + + IO-APIC diff --git a/Documentation/arch/x86/ifs.rst b/Documentation/arch/x86/ifs.rst new file mode 100644 index 000000000000..97abb696a680 --- /dev/null +++ b/Documentation/arch/x86/ifs.rst @@ -0,0 +1,2 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. kernel-doc:: drivers/platform/x86/intel/ifs/ifs.h diff --git a/Documentation/arch/x86/index.rst b/Documentation/arch/x86/index.rst new file mode 100644 index 000000000000..c73d133fd37c --- /dev/null +++ b/Documentation/arch/x86/index.rst @@ -0,0 +1,44 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +x86-specific Documentation +========================== + +.. toctree:: + :maxdepth: 2 + :numbered: + + boot + booting-dt + cpuinfo + topology + exception-tables + kernel-stacks + entry_64 + earlyprintk + orc-unwinder + zero-page + tlb + mtrr + pat + intel-hfi + iommu + intel_txt + amd-memory-encryption + amd_hsmp + tdx + pti + mds + microcode + resctrl + tsx_async_abort + buslock + usb-legacy-support + i386/index + x86_64/index + ifs + sva + sgx + features + elf_auxvec + xstate diff --git a/Documentation/arch/x86/intel-hfi.rst b/Documentation/arch/x86/intel-hfi.rst new file mode 100644 index 000000000000..49dea58ea4fb --- /dev/null +++ b/Documentation/arch/x86/intel-hfi.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Hardware-Feedback Interface for scheduling on Intel Hardware +============================================================ + +Overview +-------- + +Intel has described the Hardware Feedback Interface (HFI) in the Intel 64 and +IA-32 Architectures Software Developer's Manual (Intel SDM) Volume 3 Section +14.6 [1]_. + +The HFI gives the operating system a performance and energy efficiency +capability data for each CPU in the system. Linux can use the information from +the HFI to influence task placement decisions. + +The Hardware Feedback Interface +------------------------------- + +The Hardware Feedback Interface provides to the operating system information +about the performance and energy efficiency of each CPU in the system. Each +capability is given as a unit-less quantity in the range [0-255]. Higher values +indicate higher capability. Energy efficiency and performance are reported in +separate capabilities. Even though on some systems these two metrics may be +related, they are specified as independent capabilities in the Intel SDM. + +These capabilities may change at runtime as a result of changes in the +operating conditions of the system or the action of external factors. The rate +at which these capabilities are updated is specific to each processor model. On +some models, capabilities are set at boot time and never change. On others, +capabilities may change every tens of milliseconds. For instance, a remote +mechanism may be used to lower Thermal Design Power. Such change can be +reflected in the HFI. Likewise, if the system needs to be throttled due to +excessive heat, the HFI may reflect reduced performance on specific CPUs. + +The kernel or a userspace policy daemon can use these capabilities to modify +task placement decisions. For instance, if either the performance or energy +capabilities of a given logical processor becomes zero, it is an indication that +the hardware recommends to the operating system to not schedule any tasks on +that processor for performance or energy efficiency reasons, respectively. + +Implementation details for Linux +-------------------------------- + +The infrastructure to handle thermal event interrupts has two parts. In the +Local Vector Table of a CPU's local APIC, there exists a register for the +Thermal Monitor Register. This register controls how interrupts are delivered +to a CPU when the thermal monitor generates and interrupt. Further details +can be found in the Intel SDM Vol. 3 Section 10.5 [1]_. + +The thermal monitor may generate interrupts per CPU or per package. The HFI +generates package-level interrupts. This monitor is configured and initialized +via a set of machine-specific registers. Specifically, the HFI interrupt and +status are controlled via designated bits in the IA32_PACKAGE_THERM_INTERRUPT +and IA32_PACKAGE_THERM_STATUS registers, respectively. There exists one HFI +table per package. Further details can be found in the Intel SDM Vol. 3 +Section 14.9 [1]_. + +The hardware issues an HFI interrupt after updating the HFI table and is ready +for the operating system to consume it. CPUs receive such interrupt via the +thermal entry in the Local APIC's Local Vector Table. + +When servicing such interrupt, the HFI driver parses the updated table and +relays the update to userspace using the thermal notification framework. Given +that there may be many HFI updates every second, the updates relayed to +userspace are throttled at a rate of CONFIG_HZ jiffies. + +References +---------- + +.. [1] https://www.intel.com/sdm diff --git a/Documentation/arch/x86/intel_txt.rst b/Documentation/arch/x86/intel_txt.rst new file mode 100644 index 000000000000..d83c1a2122c9 --- /dev/null +++ b/Documentation/arch/x86/intel_txt.rst @@ -0,0 +1,227 @@ +===================== +Intel(R) TXT Overview +===================== + +Intel's technology for safer computing, Intel(R) Trusted Execution +Technology (Intel(R) TXT), defines platform-level enhancements that +provide the building blocks for creating trusted platforms. + +Intel TXT was formerly known by the code name LaGrande Technology (LT). + +Intel TXT in Brief: + +- Provides dynamic root of trust for measurement (DRTM) +- Data protection in case of improper shutdown +- Measurement and verification of launched environment + +Intel TXT is part of the vPro(TM) brand and is also available some +non-vPro systems. It is currently available on desktop systems +based on the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell +Optiplex 755, HP dc7800, etc.) and mobile systems based on the GM45, +PM45, and GS45 Express chipsets. + +For more information, see http://www.intel.com/technology/security/. +This site also has a link to the Intel TXT MLE Developers Manual, +which has been updated for the new released platforms. + +Intel TXT has been presented at various events over the past few +years, some of which are: + + - LinuxTAG 2008: + http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html + + - TRUST2008: + http://www.trust-conference.eu/downloads/Keynote-Speakers/ + 3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf + + - IDF, Shanghai: + http://www.prcidf.com.cn/index_en.html + + - IDFs 2006, 2007 + (I'm not sure if/where they are online) + +Trusted Boot Project Overview +============================= + +Trusted Boot (tboot) is an open source, pre-kernel/VMM module that +uses Intel TXT to perform a measured and verified launch of an OS +kernel/VMM. + +It is hosted on SourceForge at http://sourceforge.net/projects/tboot. +The mercurial source repo is available at http://www.bughost.org/ +repos.hg/tboot.hg. + +Tboot currently supports launching Xen (open source VMM/hypervisor +w/ TXT support since v3.2), and now Linux kernels. + + +Value Proposition for Linux or "Why should you care?" +===================================================== + +While there are many products and technologies that attempt to +measure or protect the integrity of a running kernel, they all +assume the kernel is "good" to begin with. The Integrity +Measurement Architecture (IMA) and Linux Integrity Module interface +are examples of such solutions. + +To get trust in the initial kernel without using Intel TXT, a +static root of trust must be used. This bases trust in BIOS +starting at system reset and requires measurement of all code +executed between system reset through the completion of the kernel +boot as well as data objects used by that code. In the case of a +Linux kernel, this means all of BIOS, any option ROMs, the +bootloader and the boot config. In practice, this is a lot of +code/data, much of which is subject to change from boot to boot +(e.g. changing NICs may change option ROMs). Without reference +hashes, these measurement changes are difficult to assess or +confirm as benign. This process also does not provide DMA +protection, memory configuration/alias checks and locks, crash +protection, or policy support. + +By using the hardware-based root of trust that Intel TXT provides, +many of these issues can be mitigated. Specifically: many +pre-launch components can be removed from the trust chain, DMA +protection is provided to all launched components, a large number +of platform configuration checks are performed and values locked, +protection is provided for any data in the event of an improper +shutdown, and there is support for policy-based execution/verification. +This provides a more stable measurement and a higher assurance of +system configuration and initial state than would be otherwise +possible. Since the tboot project is open source, source code for +almost all parts of the trust chain is available (excepting SMM and +Intel-provided firmware). + +How Does it Work? +================= + +- Tboot is an executable that is launched by the bootloader as + the "kernel" (the binary the bootloader executes). +- It performs all of the work necessary to determine if the + platform supports Intel TXT and, if so, executes the GETSEC[SENTER] + processor instruction that initiates the dynamic root of trust. + + - If tboot determines that the system does not support Intel TXT + or is not configured correctly (e.g. the SINIT AC Module was + incorrect), it will directly launch the kernel with no changes + to any state. + - Tboot will output various information about its progress to the + terminal, serial port, and/or an in-memory log; the output + locations can be configured with a command line switch. + +- The GETSEC[SENTER] instruction will return control to tboot and + tboot then verifies certain aspects of the environment (e.g. TPM NV + lock, e820 table does not have invalid entries, etc.). +- It will wake the APs from the special sleep state the GETSEC[SENTER] + instruction had put them in and place them into a wait-for-SIPI + state. + + - Because the processors will not respond to an INIT or SIPI when + in the TXT environment, it is necessary to create a small VT-x + guest for the APs. When they run in this guest, they will + simply wait for the INIT-SIPI-SIPI sequence, which will cause + VMEXITs, and then disable VT and jump to the SIPI vector. This + approach seemed like a better choice than having to insert + special code into the kernel's MP wakeup sequence. + +- Tboot then applies an (optional) user-defined launch policy to + verify the kernel and initrd. + + - This policy is rooted in TPM NV and is described in the tboot + project. The tboot project also contains code for tools to + create and provision the policy. + - Policies are completely under user control and if not present + then any kernel will be launched. + - Policy action is flexible and can include halting on failures + or simply logging them and continuing. + +- Tboot adjusts the e820 table provided by the bootloader to reserve + its own location in memory as well as to reserve certain other + TXT-related regions. +- As part of its launch, tboot DMA protects all of RAM (using the + VT-d PMRs). Thus, the kernel must be booted with 'intel_iommu=on' + in order to remove this blanket protection and use VT-d's + page-level protection. +- Tboot will populate a shared page with some data about itself and + pass this to the Linux kernel as it transfers control. + + - The location of the shared page is passed via the boot_params + struct as a physical address. + +- The kernel will look for the tboot shared page address and, if it + exists, map it. +- As one of the checks/protections provided by TXT, it makes a copy + of the VT-d DMARs in a DMA-protected region of memory and verifies + them for correctness. The VT-d code will detect if the kernel was + launched with tboot and use this copy instead of the one in the + ACPI table. +- At this point, tboot and TXT are out of the picture until a + shutdown (S<n>) +- In order to put a system into any of the sleep states after a TXT + launch, TXT must first be exited. This is to prevent attacks that + attempt to crash the system to gain control on reboot and steal + data left in memory. + + - The kernel will perform all of its sleep preparation and + populate the shared page with the ACPI data needed to put the + platform in the desired sleep state. + - Then the kernel jumps into tboot via the vector specified in the + shared page. + - Tboot will clean up the environment and disable TXT, then use the + kernel-provided ACPI information to actually place the platform + into the desired sleep state. + - In the case of S3, tboot will also register itself as the resume + vector. This is necessary because it must re-establish the + measured environment upon resume. Once the TXT environment + has been restored, it will restore the TPM PCRs and then + transfer control back to the kernel's S3 resume vector. + In order to preserve system integrity across S3, the kernel + provides tboot with a set of memory ranges (RAM and RESERVED_KERN + in the e820 table, but not any memory that BIOS might alter over + the S3 transition) that tboot will calculate a MAC (message + authentication code) over and then seal with the TPM. On resume + and once the measured environment has been re-established, tboot + will re-calculate the MAC and verify it against the sealed value. + Tboot's policy determines what happens if the verification fails. + Note that the c/s 194 of tboot which has the new MAC code supports + this. + +That's pretty much it for TXT support. + + +Configuring the System +====================== + +This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels. + +In BIOS, the user must enable: TPM, TXT, VT-x, VT-d. Not all BIOSes +allow these to be individually enabled/disabled and the screens in +which to find them are BIOS-specific. + +grub.conf needs to be modified as follows:: + + title Linux 2.6.29-tip w/ tboot + root (hd0,0) + kernel /tboot.gz logging=serial,vga,memory + module /vmlinuz-2.6.29-tip intel_iommu=on ro + root=LABEL=/ rhgb console=ttyS0,115200 3 + module /initrd-2.6.29-tip.img + module /Q35_SINIT_17.BIN + +The kernel option for enabling Intel TXT support is found under the +Security top-level menu and is called "Enable Intel(R) Trusted +Execution Technology (TXT)". It is considered EXPERIMENTAL and +depends on the generic x86 support (to allow maximum flexibility in +kernel build options), since the tboot code will detect whether the +platform actually supports Intel TXT and thus whether any of the +kernel code is executed. + +The Q35_SINIT_17.BIN file is what Intel TXT refers to as an +Authenticated Code Module. It is specific to the chipset in the +system and can also be found on the Trusted Boot site. It is an +(unencrypted) module signed by Intel that is used as part of the +DRTM process to verify and configure the system. It is signed +because it operates at a higher privilege level in the system than +any other macrocode and its correct operation is critical to the +establishment of the DRTM. The process for determining the correct +SINIT ACM for a system is documented in the SINIT-guide.txt file +that is on the tboot SourceForge site under the SINIT ACM downloads. diff --git a/Documentation/arch/x86/iommu.rst b/Documentation/arch/x86/iommu.rst new file mode 100644 index 000000000000..42c7a6faa39a --- /dev/null +++ b/Documentation/arch/x86/iommu.rst @@ -0,0 +1,151 @@ +================= +x86 IOMMU Support +================= + +The architecture specs can be obtained from the below locations. + +- Intel: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf +- AMD: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf + +This guide gives a quick cheat sheet for some basic understanding. + +Basic stuff +----------- + +ACPI enumerates and lists the different IOMMUs on the platform, and +device scope relationships between devices and which IOMMU controls +them. + +Some ACPI Keywords: + +- DMAR - Intel DMA Remapping table +- DRHD - Intel DMA Remapping Hardware Unit Definition +- RMRR - Intel Reserved Memory Region Reporting Structure +- IVRS - AMD I/O Virtualization Reporting Structure +- IVDB - AMD I/O Virtualization Definition Block +- IVHD - AMD I/O Virtualization Hardware Definition + +What is Intel RMRR? +^^^^^^^^^^^^^^^^^^^ + +There are some devices the BIOS controls, for e.g USB devices to perform +PS2 emulation. The regions of memory used for these devices are marked +reserved in the e820 map. When we turn on DMA translation, DMA to those +regions will fail. Hence BIOS uses RMRR to specify these regions along with +devices that need to access these regions. OS is expected to setup +unity mappings for these regions for these devices to access these regions. + +What is AMD IVRS? +^^^^^^^^^^^^^^^^^ + +The architecture defines an ACPI-compatible data structure called an I/O +Virtualization Reporting Structure (IVRS) that is used to convey information +related to I/O virtualization to system software. The IVRS describes the +configuration and capabilities of the IOMMUs contained in the platform as +well as information about the devices that each IOMMU virtualizes. + +The IVRS provides information about the following: + +- IOMMUs present in the platform including their capabilities and proper configuration +- System I/O topology relevant to each IOMMU +- Peripheral devices that cannot be otherwise enumerated +- Memory regions used by SMI/SMM, platform firmware, and platform hardware. These are generally exclusion ranges to be configured by system software. + +How is an I/O Virtual Address (IOVA) generated? +----------------------------------------------- + +Well behaved drivers call dma_map_*() calls before sending command to device +that needs to perform DMA. Once DMA is completed and mapping is no longer +required, driver performs dma_unmap_*() calls to unmap the region. + +Intel Specific Notes +-------------------- + +Graphics Problems? +^^^^^^^^^^^^^^^^^^ + +If you encounter issues with graphics devices, you can try adding +option intel_iommu=igfx_off to turn off the integrated graphics engine. +If this fixes anything, please ensure you file a bug reporting the problem. + +Some exceptions to IOVA +^^^^^^^^^^^^^^^^^^^^^^^ + +Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff). +The same is true for peer to peer transactions. Hence we reserve the +address from PCI MMIO ranges so they are not allocated for IOVA addresses. + +AMD Specific Notes +------------------ + +Graphics Problems? +^^^^^^^^^^^^^^^^^^ + +If you encounter issues with integrated graphics devices, you can try adding +option iommu=pt to the kernel command line use a 1:1 mapping for the IOMMU. If +this fixes anything, please ensure you file a bug reporting the problem. + +Fault reporting +--------------- +When errors are reported, the IOMMU signals via an interrupt. The fault +reason and device that caused it is printed on the console. + + +Kernel Log Samples +------------------ + +Intel Boot Messages +^^^^^^^^^^^^^^^^^^^ + +Something like this gets printed indicating presence of DMAR tables +in ACPI: + +:: + + ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 + +When DMAR is being processed and initialized by ACPI, prints DMAR locations +and any RMRR's processed: + +:: + + ACPI DMAR:Host address width 36 + ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 + ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 + ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 + ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff + ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff + +When DMAR is enabled for use, you will notice: + +:: + + PCI-DMA: Using DMAR IOMMU + +Intel Fault reporting +^^^^^^^^^^^^^^^^^^^^^ + +:: + + DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 + DMAR:[fault reason 05] PTE Write access is not set + DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 + DMAR:[fault reason 05] PTE Write access is not set + +AMD Boot Messages +^^^^^^^^^^^^^^^^^ + +Something like this gets printed indicating presence of the IOMMU: + +:: + + iommu: Default domain type: Translated + iommu: DMA domain TLB invalidation policy: lazy mode + +AMD Fault reporting +^^^^^^^^^^^^^^^^^^^ + +:: + + AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000] + AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000] diff --git a/Documentation/arch/x86/kernel-stacks.rst b/Documentation/arch/x86/kernel-stacks.rst new file mode 100644 index 000000000000..6b0bcf027ff1 --- /dev/null +++ b/Documentation/arch/x86/kernel-stacks.rst @@ -0,0 +1,152 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Kernel Stacks +============= + +Kernel stacks on x86-64 bit +=========================== + +Most of the text from Keith Owens, hacked by AK + +x86_64 page size (PAGE_SIZE) is 4K. + +Like all other architectures, x86_64 has a kernel stack for every +active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. +These stacks contain useful data as long as a thread is alive or a +zombie. While the thread is in user space the kernel stack is empty +except for the thread_info structure at the bottom. + +In addition to the per thread stacks, there are specialized stacks +associated with each CPU. These stacks are only used while the kernel +is in control on that CPU; when a CPU returns to user space the +specialized stacks contain no useful data. The main CPU stacks are: + +* Interrupt stack. IRQ_STACK_SIZE + + Used for external hardware interrupts. If this is the first external + hardware interrupt (i.e. not a nested hardware interrupt) then the + kernel switches from the current task to the interrupt stack. Like + the split thread and interrupt stacks on i386, this gives more room + for kernel interrupt processing without having to increase the size + of every per thread stack. + + The interrupt stack is also used when processing a softirq. + +Switching to the kernel interrupt stack is done by software based on a +per CPU interrupt nest counter. This is needed because x86-64 "IST" +hardware stacks cannot nest without races. + +x86_64 also has a feature which is not available on i386, the ability +to automatically switch to a new stack for designated events such as +double fault or NMI, which makes it easier to handle these unusual +events on x86_64. This feature is called the Interrupt Stack Table +(IST). There can be up to 7 IST entries per CPU. The IST code is an +index into the Task State Segment (TSS). The IST entries in the TSS +point to dedicated stacks; each stack can be a different size. + +An IST is selected by a non-zero value in the IST field of an +interrupt-gate descriptor. When an interrupt occurs and the hardware +loads such a descriptor, the hardware automatically sets the new stack +pointer based on the IST value, then invokes the interrupt handler. If +the interrupt came from user mode, then the interrupt handler prologue +will switch back to the per-thread stack. If software wants to allow +nested IST interrupts then the handler must adjust the IST values on +entry to and exit from the interrupt handler. (This is occasionally +done, e.g. for debug exceptions.) + +Events with different IST codes (i.e. with different stacks) can be +nested. For example, a debug interrupt can safely be interrupted by an +NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack +pointers on entry to and exit from all IST events, in theory allowing +IST events with the same code to be nested. However in most cases, the +stack size allocated to an IST assumes no nesting for the same code. +If that assumption is ever broken then the stacks will become corrupt. + +The currently assigned IST stacks are: + +* ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 8 - Double Fault Exception (#DF). + + Invoked when handling one exception causes another exception. Happens + when the kernel is very confused (e.g. kernel stack pointer corrupt). + Using a separate stack allows the kernel to recover from it well enough + in many cases to still output an oops. + +* ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for non-maskable interrupts (NMI). + + NMI can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for NMI events avoids making + assumptions about the previous state of the kernel stack. + +* ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for hardware debug interrupts (interrupt 1) and for software + debug interrupts (INT3). + + When debugging a kernel, debug interrupts (both hardware and + software) can occur at any time. Using IST for these interrupts + avoids making assumptions about the previous state of the kernel + stack. + + To handle nested #DB correctly there exist two instances of DB stacks. On + #DB entry the IST stackpointer for #DB is switched to the second instance + so a nested #DB starts from a clean stack. The nested #DB switches + the IST stackpointer to a guard hole to catch triple nesting. + +* ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 18 - Machine Check Exception (#MC). + + MCE can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for MCE events avoids making + assumptions about the previous state of the kernel stack. + +For more details see the Intel IA32 or AMD AMD64 architecture manuals. + + +Printing backtraces on x86 +========================== + +The question about the '?' preceding function names in an x86 stacktrace +keeps popping up, here's an indepth explanation. It helps if the reader +stares at print_context_stack() and the whole machinery in and around +arch/x86/kernel/dumpstack.c. + +Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: + +We always scan the full kernel stack for return addresses stored on +the kernel stack(s) [1]_, from stack top to stack bottom, and print out +anything that 'looks like' a kernel text address. + +If it fits into the frame pointer chain, we print it without a question +mark, knowing that it's part of the real backtrace. + +If the address does not fit into our expected frame pointer chain we +still print it, but we print a '?'. It can mean two things: + + - either the address is not part of the call chain: it's just stale + values on the kernel stack, from earlier function calls. This is + the common case. + + - or it is part of the call chain, but the frame pointer was not set + up properly within the function, so we don't recognize it. + +This way we will always print out the real call chain (plus a few more +entries), regardless of whether the frame pointer was set up correctly +or not - but in most cases we'll get the call chain right as well. The +entries printed are strictly in stack order, so you can deduce more +information from that as well. + +The most important property of this method is that we _never_ lose +information: we always strive to print _all_ addresses on the stack(s) +that look like kernel text addresses, so if debug information is wrong, +we still print out the real call chain as well - just with more question +marks than ideal. + +.. [1] For things like IRQ and IST stacks, we also scan those stacks, in + the right order, and try to cross from one stack into another + reconstructing the call chain. This works most of the time. diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst new file mode 100644 index 000000000000..5d4330be200f --- /dev/null +++ b/Documentation/arch/x86/mds.rst @@ -0,0 +1,193 @@ +Microarchitectural Data Sampling (MDS) mitigation +================================================= + +.. _mds: + +Overview +-------- + +Microarchitectural Data Sampling (MDS) is a family of side channel attacks +on internal buffers in Intel CPUs. The variants are: + + - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126) + - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130) + - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127) + - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091) + +MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a +dependent load (store-to-load forwarding) as an optimization. The forward +can also happen to a faulting or assisting load operation for a different +memory address, which can be exploited under certain conditions. Store +buffers are partitioned between Hyper-Threads so cross thread forwarding is +not possible. But if a thread enters or exits a sleep state the store +buffer is repartitioned which can expose data from one thread to the other. + +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage +L1 miss situations and to hold data which is returned or sent in response +to a memory or I/O operation. Fill buffers can forward data to a load +operation and also write data to the cache. When the fill buffer is +deallocated it can retain the stale data of the preceding operations which +can then be forwarded to a faulting or assisting load operation, which can +be exploited under certain conditions. Fill buffers are shared between +Hyper-Threads so cross thread leakage is possible. + +MLPDS leaks Load Port Data. Load ports are used to perform load operations +from memory or I/O. The received data is then forwarded to the register +file or a subsequent operation. In some implementations the Load Port can +contain stale data from a previous operation which can be forwarded to +faulting or assisting loads under certain conditions, which again can be +exploited eventually. Load ports are shared between Hyper-Threads so cross +thread leakage is possible. + +MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from +memory that takes a fault or assist can leave data in a microarchitectural +structure that may later be observed using one of the same methods used by +MSBDS, MFBDS or MLPDS. + +Exposure assumptions +-------------------- + +It is assumed that attack code resides in user space or in a guest with one +exception. The rationale behind this assumption is that the code construct +needed for exploiting MDS requires: + + - to control the load to trigger a fault or assist + + - to have a disclosure gadget which exposes the speculatively accessed + data for consumption through a side channel. + + - to control the pointer through which the disclosure gadget exposes the + data + +The existence of such a construct in the kernel cannot be excluded with +100% certainty, but the complexity involved makes it extremly unlikely. + +There is one exception, which is untrusted BPF. The functionality of +untrusted BPF is limited, but it needs to be thoroughly investigated +whether it can be used to create such a construct. + + +Mitigation strategy +------------------- + +All variants have the same mitigation strategy at least for the single CPU +thread case (SMT off): Force the CPU to clear the affected buffers. + +This is achieved by using the otherwise unused and obsolete VERW +instruction in combination with a microcode update. The microcode clears +the affected CPU buffers when the VERW instruction is executed. + +For virtualization there are two ways to achieve CPU buffer +clearing. Either the modified VERW instruction or via the L1D Flush +command. The latter is issued when L1TF mitigation is enabled so the extra +VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to +be issued. + +If the VERW instruction with the supplied segment selector argument is +executed on a CPU without the microcode update there is no side effect +other than a small number of pointlessly wasted CPU cycles. + +This does not protect against cross Hyper-Thread attacks except for MSBDS +which is only exploitable cross Hyper-thread when one of the Hyper-Threads +enters a C-state. + +The kernel provides a function to invoke the buffer clearing: + + mds_clear_cpu_buffers() + +The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state +(idle) transitions. + +As a special quirk to address virtualization scenarios where the host has +the microcode updated, but the hypervisor does not (yet) expose the +MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the +hope that it might actually clear the buffers. The state is reflected +accordingly. + +According to current knowledge additional mitigations inside the kernel +itself are not required because the necessary gadgets to expose the leaked +data cannot be controlled in a way which allows exploitation from malicious +user space or VM guests. + +Kernel internal mitigation modes +-------------------------------- + + ======= ============================================================ + off Mitigation is disabled. Either the CPU is not affected or + mds=off is supplied on the kernel command line + + full Mitigation is enabled. CPU is affected and MD_CLEAR is + advertised in CPUID. + + vmwerv Mitigation is enabled. CPU is affected and MD_CLEAR is not + advertised in CPUID. That is mainly for virtualization + scenarios where the host has the updated microcode but the + hypervisor does not expose MD_CLEAR in CPUID. It's a best + effort approach without guarantee. + ======= ============================================================ + +If the CPU is affected and mds=off is not supplied on the kernel command +line then the kernel selects the appropriate mitigation mode depending on +the availability of the MD_CLEAR CPUID bit. + +Mitigation points +----------------- + +1. Return to user space +^^^^^^^^^^^^^^^^^^^^^^^ + + When transitioning from kernel to user space the CPU buffers are flushed + on affected CPUs when the mitigation is not disabled on the kernel + command line. The migitation is enabled through the static key + mds_user_clear. + + The mitigation is invoked in prepare_exit_to_usermode() which covers + all but one of the kernel to user space transitions. The exception + is when we return from a Non Maskable Interrupt (NMI), which is + handled directly in do_nmi(). + + (The reason that NMI is special is that prepare_exit_to_usermode() can + enable IRQs. In NMI context, NMIs are blocked, and we don't want to + enable IRQs with NMIs blocked.) + + +2. C-State transition +^^^^^^^^^^^^^^^^^^^^^ + + When a CPU goes idle and enters a C-State the CPU buffers need to be + cleared on affected CPUs when SMT is active. This addresses the + repartitioning of the store buffer when one of the Hyper-Threads enters + a C-State. + + When SMT is inactive, i.e. either the CPU does not support it or all + sibling threads are offline CPU buffer clearing is not required. + + The idle clearing is enabled on CPUs which are only affected by MSBDS + and not by any other MDS variant. The other MDS variants cannot be + protected against cross Hyper-Thread attacks because the Fill Buffer and + the Load Ports are shared. So on CPUs affected by other variants, the + idle clearing would be a window dressing exercise and is therefore not + activated. + + The invocation is controlled by the static key mds_idle_clear which is + switched depending on the chosen mitigation mode and the SMT state of + the system. + + The buffer clear is only invoked before entering the C-State to prevent + that stale data from the idling CPU from spilling to the Hyper-Thread + sibling after the store buffer got repartitioned and all entries are + available to the non idle sibling. + + When coming out of idle the store buffer is partitioned again so each + sibling has half of it available. The back from idle CPU could be then + speculatively exposed to contents of the sibling. The buffers are + flushed either on exit to user space or on VMENTER so malicious code + in user space or the guest cannot speculatively access them. + + The mitigation is hooked into all variants of halt()/mwait(), but does + not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver + has been superseded by the intel_idle driver around 2010 and is + preferred on all affected CPUs which are expected to gain the MD_CLEAR + functionality in microcode. Aside of that the IO-Port mechanism is a + legacy interface which is only used on older systems which are either + not affected or do not receive microcode updates anymore. diff --git a/Documentation/arch/x86/microcode.rst b/Documentation/arch/x86/microcode.rst new file mode 100644 index 000000000000..b627c6f36bcf --- /dev/null +++ b/Documentation/arch/x86/microcode.rst @@ -0,0 +1,240 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +The Linux Microcode Loader +========================== + +:Authors: - Fenghua Yu <fenghua.yu@intel.com> + - Borislav Petkov <bp@suse.de> + - Ashok Raj <ashok.raj@intel.com> + +The kernel has a x86 microcode loading facility which is supposed to +provide microcode loading methods in the OS. Potential use cases are +updating the microcode on platforms beyond the OEM End-Of-Life support, +and updating the microcode on long-running systems without rebooting. + +The loader supports three loading methods: + +Early load microcode +==================== + +The kernel can update microcode very early during boot. Loading +microcode early can fix CPU issues before they are observed during +kernel boot time. + +The microcode is stored in an initrd file. During boot, it is read from +it and loaded into the CPU cores. + +The format of the combined initrd image is microcode in (uncompressed) +cpio format followed by the (possibly compressed) initrd image. The +loader parses the combined initrd image during boot. + +The microcode files in cpio name space are: + +on Intel: + kernel/x86/microcode/GenuineIntel.bin +on AMD : + kernel/x86/microcode/AuthenticAMD.bin + +During BSP (BootStrapping Processor) boot (pre-SMP), the kernel +scans the microcode file in the initrd. If microcode matching the +CPU is found, it will be applied in the BSP and later on in all APs +(Application Processors). + +The loader also saves the matching microcode for the CPU in memory. +Thus, the cached microcode patch is applied when CPUs resume from a +sleep state. + +Here's a crude example how to prepare an initrd with microcode (this is +normally done automatically by the distribution, when recreating the +initrd, so you don't really have to do it yourself. It is documented +here for future reference only). +:: + + #!/bin/bash + + if [ -z "$1" ]; then + echo "You need to supply an initrd file" + exit 1 + fi + + INITRD="$1" + + DSTDIR=kernel/x86/microcode + TMPDIR=/tmp/initrd + + rm -rf $TMPDIR + + mkdir $TMPDIR + cd $TMPDIR + mkdir -p $DSTDIR + + if [ -d /lib/firmware/amd-ucode ]; then + cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin + fi + + if [ -d /lib/firmware/intel-ucode ]; then + cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin + fi + + find . | cpio -o -H newc >../ucode.cpio + cd .. + mv $INITRD $INITRD.orig + cat ucode.cpio $INITRD.orig > $INITRD + + rm -rf $TMPDIR + + +The system needs to have the microcode packages installed into +/lib/firmware or you need to fixup the paths above if yours are +somewhere else and/or you've downloaded them directly from the processor +vendor's site. + +Late loading +============ + +You simply install the microcode packages your distro supplies and +run:: + + # echo 1 > /sys/devices/system/cpu/microcode/reload + +as root. + +The loading mechanism looks for microcode blobs in +/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation +packages already put them there. + +Since kernel 5.19, late loading is not enabled by default. + +The /dev/cpu/microcode method has been removed in 5.19. + +Why is late loading dangerous? +============================== + +Synchronizing all CPUs +---------------------- + +The microcode engine which receives the microcode update is shared +between the two logical threads in a SMT system. Therefore, when +the update is executed on one SMT thread of the core, the sibling +"automatically" gets the update. + +Since the microcode can "simulate" MSRs too, while the microcode update +is in progress, those simulated MSRs transiently cease to exist. This +can result in unpredictable results if the SMT sibling thread happens to +be in the middle of an access to such an MSR. The usual observation is +that such MSR accesses cause #GPs to be raised to signal that former are +not present. + +The disappearing MSRs are just one common issue which is being observed. +Any other instruction that's being patched and gets concurrently +executed by the other SMT sibling, can also result in similar, +unpredictable behavior. + +To eliminate this case, a stop_machine()-based CPU synchronization was +introduced as a way to guarantee that all logical CPUs will not execute +any code but just wait in a spin loop, polling an atomic variable. + +While this took care of device or external interrupts, IPIs including +LVT ones, such as CMCI etc, it cannot address other special interrupts +that can't be shut off. Those are Machine Check (#MC), System Management +(#SMI) and Non-Maskable interrupts (#NMI). + +Machine Checks +-------------- + +Machine Checks (#MC) are non-maskable. There are two kinds of MCEs. +Fatal un-recoverable MCEs and recoverable MCEs. While un-recoverable +errors are fatal, recoverable errors can also happen in kernel context +are also treated as fatal by the kernel. + +On certain Intel machines, MCEs are also broadcast to all threads in a +system. If one thread is in the middle of executing WRMSR, a MCE will be +taken at the end of the flow. Either way, they will wait for the thread +performing the wrmsr(0x79) to rendezvous in the MCE handler and shutdown +eventually if any of the threads in the system fail to check in to the +MCE rendezvous. + +To be paranoid and get predictable behavior, the OS can choose to set +MCG_STATUS.MCIP. Since MCEs can be at most one in a system, if an +MCE was signaled, the above condition will promote to a system reset +automatically. OS can turn off MCIP at the end of the update for that +core. + +System Management Interrupt +--------------------------- + +SMIs are also broadcast to all CPUs in the platform. Microcode update +requests exclusive access to the core before writing to MSR 0x79. So if +it does happen such that, one thread is in WRMSR flow, and the 2nd got +an SMI, that thread will be stopped in the first instruction in the SMI +handler. + +Since the secondary thread is stopped in the first instruction in SMI, +there is very little chance that it would be in the middle of executing +an instruction being patched. Plus OS has no way to stop SMIs from +happening. + +Non-Maskable Interrupts +----------------------- + +When thread0 of a core is doing the microcode update, if thread1 is +pulled into NMI, that can cause unpredictable behavior due to the +reasons above. + +OS can choose a variety of methods to avoid running into this situation. + + +Is the microcode suitable for late loading? +------------------------------------------- + +Late loading is done when the system is fully operational and running +real workloads. Late loading behavior depends on what the base patch on +the CPU is before upgrading to the new patch. + +This is true for Intel CPUs. + +Consider, for example, a CPU has patch level 1 and the update is to +patch level 3. + +Between patch1 and patch3, patch2 might have deprecated a software-visible +feature. + +This is unacceptable if software is even potentially using that feature. +For instance, say MSR_X is no longer available after an update, +accessing that MSR will cause a #GP fault. + +Basically there is no way to declare a new microcode update suitable +for late-loading. This is another one of the problems that caused late +loading to be not enabled by default. + +Builtin microcode +================= + +The loader supports also loading of a builtin microcode supplied through +the regular builtin firmware method CONFIG_EXTRA_FIRMWARE. Only 64-bit is +currently supported. + +Here's an example:: + + CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin" + CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" + +This basically means, you have the following tree structure locally:: + + /lib/firmware/ + |-- amd-ucode + ... + | |-- microcode_amd_fam15h.bin + ... + |-- intel-ucode + ... + | |-- 06-3a-09 + ... + +so that the build system can find those files and integrate them into +the final kernel image. The early loader finds them and applies them. + +Needless to say, this method is not the most flexible one because it +requires rebuilding the kernel each time updated microcode from the CPU +vendor is available. diff --git a/Documentation/arch/x86/mtrr.rst b/Documentation/arch/x86/mtrr.rst new file mode 100644 index 000000000000..f65ef034da7a --- /dev/null +++ b/Documentation/arch/x86/mtrr.rst @@ -0,0 +1,354 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +MTRR (Memory Type Range Register) control +========================================= + +:Authors: - Richard Gooch <rgooch@atnf.csiro.au> - 3 Jun 1999 + - Luis R. Rodriguez <mcgrof@do-not-panic.com> - April 9, 2015 + + +Phasing out MTRR use +==================== + +MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by +drivers on Linux is now completely phased out, device drivers should use +arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on +non-PAT systems while a no-op but equally effective on PAT enabled systems. + +Even if Linux does not use MTRRs directly, some x86 platform firmware may still +set up MTRRs early before booting the OS. They do this as some platform +firmware may still have implemented access to MTRRs which would be controlled +and handled by the platform firmware directly. An example of platform use of +MTRRs is through the use of SMI handlers, one case could be for fan control, +the platform code would need uncachable access to some of its fan control +registers. Such platform access does not need any Operating System MTRR code in +place other than mtrr_type_lookup() to ensure any OS specific mapping requests +are aligned with platform MTRR setup. If MTRRs are only set up by the platform +firmware code though and the OS does not make any specific MTRR mapping +requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID. + +For details refer to Documentation/arch/x86/pat.rst. + +.. tip:: + On Intel P6 family processors (Pentium Pro, Pentium II and later) + the Memory Type Range Registers (MTRRs) may be used to control + processor access to memory ranges. This is most useful when you have + a video (VGA) card on a PCI or AGP bus. Enabling write-combining + allows bus write transfers to be combined into a larger transfer + before bursting over the PCI/AGP bus. This can increase performance + of image write operations 2.5 times or more. + + The Cyrix 6x86, 6x86MX and M II processors have Address Range + Registers (ARRs) which provide a similar functionality to MTRRs. For + these, the ARRs are used to emulate the MTRRs. + + The AMD K6-2 (stepping 8 and above) and K6-3 processors have two + MTRRs. These are supported. The AMD Athlon family provide 8 Intel + style MTRRs. + + The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These + are supported. + + The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs. + + The CONFIG_MTRR option creates a /proc/mtrr file which may be used + to manipulate your MTRRs. Typically the X server should use + this. This should have a reasonably generic interface so that + similar control registers on other processors can be easily + supported. + +There are two interfaces to /proc/mtrr: one is an ASCII interface +which allows you to read and write. The other is an ioctl() +interface. The ASCII interface is meant for administration. The +ioctl() interface is meant for C programs (i.e. the X server). The +interfaces are described below, with sample commands and C code. + + +Reading MTRRs from the shell +============================ +:: + + % cat /proc/mtrr + reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1 + reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1 + +Creating MTRRs from the C-shell:: + + # echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr + +or if you use bash:: + + # echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr + +And the result thereof:: + + % cat /proc/mtrr + reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1 + reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1 + reg02: base=0xf8000000 (3968MB), size= 4MB: write-combining, count=1 + +This is for video RAM at base address 0xf8000000 and size 4 megabytes. To +find out your base address, you need to look at the output of your X +server, which tells you where the linear framebuffer address is. A +typical line that you may get is:: + + (--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000 + +Note that you should only use the value from the X server, as it may +move the framebuffer base address, so the only value you can trust is +that reported by the X server. + +To find out the size of your framebuffer (what, you don't actually +know?), the following line will tell you:: + + (--) S3: videoram: 4096k + +That's 4 megabytes, which is 0x400000 bytes (in hexadecimal). +A patch is being written for XFree86 which will make this automatic: +in other words the X server will manipulate /proc/mtrr using the +ioctl() interface, so users won't have to do anything. If you use a +commercial X server, lobby your vendor to add support for MTRRs. + + +Creating overlapping MTRRs +========================== +:: + + %echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr + %echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr + +And the results:: + + % cat /proc/mtrr + reg00: base=0x00000000 ( 0MB), size= 64MB: write-back, count=1 + reg01: base=0xfb000000 (4016MB), size= 16MB: write-combining, count=1 + reg02: base=0xfb000000 (4016MB), size= 4kB: uncachable, count=1 + +Some cards (especially Voodoo Graphics boards) need this 4 kB area +excluded from the beginning of the region because it is used for +registers. + +NOTE: You can only create type=uncachable region, if the first +region that you created is type=write-combining. + + +Removing MTRRs from the C-shel +============================== +:: + + % echo "disable=2" >! /proc/mtrr + +or using bash:: + + % echo "disable=2" >| /proc/mtrr + + +Reading MTRRs from a C program using ioctl()'s +============================================== +:: + + /* mtrr-show.c + + Source file for mtrr-show (example program to show MTRRs using ioctl()'s) + + Copyright (C) 1997-1998 Richard Gooch + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + Richard Gooch may be reached by email at rgooch@atnf.csiro.au + The postal address is: + Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. + */ + + /* + This program will use an ioctl() on /proc/mtrr to show the current MTRR + settings. This is an alternative to reading /proc/mtrr. + + + Written by Richard Gooch 17-DEC-1997 + + Last updated by Richard Gooch 2-MAY-1998 + + + */ + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <sys/types.h> + #include <sys/stat.h> + #include <fcntl.h> + #include <sys/ioctl.h> + #include <errno.h> + #include <asm/mtrr.h> + + #define TRUE 1 + #define FALSE 0 + #define ERRSTRING strerror (errno) + + static char *mtrr_strings[MTRR_NUM_TYPES] = + { + "uncachable", /* 0 */ + "write-combining", /* 1 */ + "?", /* 2 */ + "?", /* 3 */ + "write-through", /* 4 */ + "write-protect", /* 5 */ + "write-back", /* 6 */ + }; + + int main () + { + int fd; + struct mtrr_gentry gentry; + + if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 ) + { + if (errno == ENOENT) + { + fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n", + stderr); + exit (1); + } + fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING); + exit (2); + } + for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0; + ++gentry.regnum) + { + if (gentry.size < 1) + { + fprintf (stderr, "Register: %u disabled\n", gentry.regnum); + continue; + } + fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n", + gentry.regnum, gentry.base, gentry.size, + mtrr_strings[gentry.type]); + } + if (errno == EINVAL) exit (0); + fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING); + exit (3); + } /* End Function main */ + + +Creating MTRRs from a C programme using ioctl()'s +================================================= +:: + + /* mtrr-add.c + + Source file for mtrr-add (example programme to add an MTRRs using ioctl()) + + Copyright (C) 1997-1998 Richard Gooch + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + Richard Gooch may be reached by email at rgooch@atnf.csiro.au + The postal address is: + Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. + */ + + /* + This programme will use an ioctl() on /proc/mtrr to add an entry. The first + available mtrr is used. This is an alternative to writing /proc/mtrr. + + + Written by Richard Gooch 17-DEC-1997 + + Last updated by Richard Gooch 2-MAY-1998 + + + */ + #include <stdio.h> + #include <string.h> + #include <stdlib.h> + #include <unistd.h> + #include <sys/types.h> + #include <sys/stat.h> + #include <fcntl.h> + #include <sys/ioctl.h> + #include <errno.h> + #include <asm/mtrr.h> + + #define TRUE 1 + #define FALSE 0 + #define ERRSTRING strerror (errno) + + static char *mtrr_strings[MTRR_NUM_TYPES] = + { + "uncachable", /* 0 */ + "write-combining", /* 1 */ + "?", /* 2 */ + "?", /* 3 */ + "write-through", /* 4 */ + "write-protect", /* 5 */ + "write-back", /* 6 */ + }; + + int main (int argc, char **argv) + { + int fd; + struct mtrr_sentry sentry; + + if (argc != 4) + { + fprintf (stderr, "Usage:\tmtrr-add base size type\n"); + exit (1); + } + sentry.base = strtoul (argv[1], NULL, 0); + sentry.size = strtoul (argv[2], NULL, 0); + for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type) + { + if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break; + } + if (sentry.type >= MTRR_NUM_TYPES) + { + fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]); + exit (2); + } + if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 ) + { + if (errno == ENOENT) + { + fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n", + stderr); + exit (3); + } + fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING); + exit (4); + } + if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1) + { + fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING); + exit (5); + } + fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n"); + sleep (5); + close (fd); + fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n", + stderr); + } /* End Function main */ diff --git a/Documentation/arch/x86/orc-unwinder.rst b/Documentation/arch/x86/orc-unwinder.rst new file mode 100644 index 000000000000..cdb257015bd9 --- /dev/null +++ b/Documentation/arch/x86/orc-unwinder.rst @@ -0,0 +1,182 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +ORC unwinder +============ + +Overview +======== + +The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is +similar in concept to a DWARF unwinder. The difference is that the +format of the ORC data is much simpler than DWARF, which in turn allows +the ORC unwinder to be much simpler and faster. + +The ORC data consists of unwind tables which are generated by objtool. +They contain out-of-band data which is used by the in-kernel ORC +unwinder. Objtool generates the ORC data by first doing compile-time +stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing +all the code paths of a .o file, it determines information about the +stack state at each instruction address in the file and outputs that +information to the .orc_unwind and .orc_unwind_ip sections. + +The per-object ORC sections are combined at link time and are sorted and +post-processed at boot time. The unwinder uses the resulting data to +correlate instruction addresses with their stack states at run time. + + +ORC vs frame pointers +===================== + +With frame pointers enabled, GCC adds instrumentation code to every +function in the kernel. The kernel's .text size increases by about +3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel +Gorman [1]_ have shown a slowdown of 5-10% for some workloads. + +In contrast, the ORC unwinder has no effect on text size or runtime +performance, because the debuginfo is out of band. So if you disable +frame pointers and enable the ORC unwinder, you get a nice performance +improvement across the board, and still have reliable stack traces. + +Ingo Molnar says: + + "Note that it's not just a performance improvement, but also an + instruction cache locality improvement: 3.2% .text savings almost + directly transform into a similarly sized reduction in cache + footprint. That can transform to even higher speedups for workloads + whose cache locality is borderline." + +Another benefit of ORC compared to frame pointers is that it can +reliably unwind across interrupts and exceptions. Frame pointer based +unwinds can sometimes skip the caller of the interrupted function, if it +was a leaf function or if the interrupt hit before the frame pointer was +saved. + +The main disadvantage of the ORC unwinder compared to frame pointers is +that it needs more memory to store the ORC unwind tables: roughly 2-4MB +depending on the kernel config. + + +ORC vs DWARF +============ + +ORC debuginfo's advantage over DWARF itself is that it's much simpler. +It gets rid of the complex DWARF CFI state machine and also gets rid of +the tracking of unnecessary registers. This allows the unwinder to be +much simpler, meaning fewer bugs, which is especially important for +mission critical oops code. + +The simpler debuginfo format also enables the unwinder to be much faster +than DWARF, which is important for perf and lockdep. In a basic +performance test by Jiri Slaby [2]_, the ORC unwinder was about 20x +faster than an out-of-tree DWARF unwinder. (Note: That measurement was +taken before some performance tweaks were added, which doubled +performance, so the speedup over DWARF may be closer to 40x.) + +The ORC data format does have a few downsides compared to DWARF. ORC +unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel) +than DWARF-based eh_frame tables. + +Another potential downside is that, as GCC evolves, it's conceivable +that the ORC data may end up being *too* simple to describe the state of +the stack for certain optimizations. But IMO this is unlikely because +GCC saves the frame pointer for any unusual stack adjustments it does, +so I suspect we'll really only ever need to keep track of the stack +pointer and the frame pointer between call frames. But even if we do +end up having to track all the registers DWARF tracks, at least we will +still be able to control the format, e.g. no complex state machines. + + +ORC unwind table generation +=========================== + +The ORC data is generated by objtool. With the existing compile-time +stack metadata validation feature, objtool already follows all code +paths, and so it already has all the information it needs to be able to +generate ORC data from scratch. So it's an easy step to go from stack +validation to ORC data generation. + +It should be possible to instead generate the ORC data with a simple +tool which converts DWARF to ORC data. However, such a solution would +be incomplete due to the kernel's extensive use of asm, inline asm, and +special sections like exception tables. + +That could be rectified by manually annotating those special code paths +using GNU assembler .cfi annotations in .S files, and homegrown +annotations for inline asm in .c files. But asm annotations were tried +in the past and were found to be unmaintainable. They were often +incorrect/incomplete and made the code harder to read and keep updated. +And based on looking at glibc code, annotating inline asm in .c files +might be even worse. + +Objtool still needs a few annotations, but only in code which does +unusual things to the stack like entry code. And even then, far fewer +annotations are needed than what DWARF would need, so they're much more +maintainable than DWARF CFI annotations. + +So the advantages of using objtool to generate ORC data are that it +gives more accurate debuginfo, with very few annotations. It also +insulates the kernel from toolchain bugs which can be very painful to +deal with in the kernel since we often have to workaround issues in +older versions of the toolchain for years. + +The downside is that the unwinder now becomes dependent on objtool's +ability to reverse engineer GCC code flow. If GCC optimizations become +too complicated for objtool to follow, the ORC data generation might +stop working or become incomplete. (It's worth noting that livepatch +already has such a dependency on objtool's ability to follow GCC code +flow.) + +If newer versions of GCC come up with some optimizations which break +objtool, we may need to revisit the current implementation. Some +possible solutions would be asking GCC to make the optimizations more +palatable, or having objtool use DWARF as an additional input, or +creating a GCC plugin to assist objtool with its analysis. But for now, +objtool follows GCC code quite well. + + +Unwinder implementation details +=============================== + +Objtool generates the ORC data by integrating with the compile-time +stack metadata validation feature, which is described in detail in +tools/objtool/Documentation/objtool.txt. After analyzing all +the code paths of a .o file, it creates an array of orc_entry structs, +and a parallel array of instruction addresses associated with those +structs, and writes them to the .orc_unwind and .orc_unwind_ip sections +respectively. + +The ORC data is split into the two arrays for performance reasons, to +make the searchable part of the data (.orc_unwind_ip) more compact. The +arrays are sorted in parallel at boot time. + +Performance is further improved by the use of a fast lookup table which +is created at runtime. The fast lookup table associates a given address +with a range of indices for the .orc_unwind table, so that only a small +subset of the table needs to be searched. + + +Etymology +========= + +Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural +enemies. Similarly, the ORC unwinder was created in opposition to the +complexity and slowness of DWARF. + +"Although Orcs rarely consider multiple solutions to a problem, they do +excel at getting things done because they are creatures of action, not +thought." [3]_ Similarly, unlike the esoteric DWARF unwinder, the +veracious ORC unwinder wastes no time or siloconic effort decoding +variable-length zero-extended unsigned-integer byte-coded +state-machine-based debug information entries. + +Similar to how Orcs frequently unravel the well-intentioned plans of +their adversaries, the ORC unwinder frequently unravels stacks with +brutal, unyielding efficiency. + +ORC stands for Oops Rewind Capability. + + +.. [1] https://lore.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de +.. [2] https://lore.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz +.. [3] http://dustin.wikidot.com/half-orcs-and-orcs diff --git a/Documentation/arch/x86/pat.rst b/Documentation/arch/x86/pat.rst new file mode 100644 index 000000000000..5d901771016d --- /dev/null +++ b/Documentation/arch/x86/pat.rst @@ -0,0 +1,240 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +PAT (Page Attribute Table) +========================== + +x86 Page Attribute Table (PAT) allows for setting the memory attribute at the +page level granularity. PAT is complementary to the MTRR settings which allows +for setting of memory types over physical address ranges. However, PAT is +more flexible than MTRR due to its capability to set attributes at page level +and also due to the fact that there are no hardware limitations on number of +such attribute settings allowed. Added flexibility comes with guidelines for +not having memory type aliasing for the same physical memory with multiple +virtual addresses. + +PAT allows for different types of memory attributes. The most commonly used +ones that will be supported at this time are: + +=== ============== +WB Write-back +UC Uncached +WC Write-combined +WT Write-through +UC- Uncached Minus +=== ============== + + +PAT APIs +======== + +There are many different APIs in the kernel that allows setting of memory +attributes at the page level. In order to avoid aliasing, these interfaces +should be used thoughtfully. Below is a table of interfaces available, +their intended usage and their memory attribute relationships. Internally, +these APIs use a reserve_memtype()/free_memtype() interface on the physical +address range to avoid any aliasing. + ++------------------------+----------+--------------+------------------+ +| API | RAM | ACPI,... | Reserved/Holes | ++------------------------+----------+--------------+------------------+ +| ioremap | -- | UC- | UC- | ++------------------------+----------+--------------+------------------+ +| ioremap_cache | -- | WB | WB | ++------------------------+----------+--------------+------------------+ +| ioremap_uc | -- | UC | UC | ++------------------------+----------+--------------+------------------+ +| ioremap_wc | -- | -- | WC | ++------------------------+----------+--------------+------------------+ +| ioremap_wt | -- | -- | WT | ++------------------------+----------+--------------+------------------+ +| set_memory_uc, | UC- | -- | -- | +| set_memory_wb | | | | ++------------------------+----------+--------------+------------------+ +| set_memory_wc, | WC | -- | -- | +| set_memory_wb | | | | ++------------------------+----------+--------------+------------------+ +| set_memory_wt, | WT | -- | -- | +| set_memory_wb | | | | ++------------------------+----------+--------------+------------------+ +| pci sysfs resource | -- | -- | UC- | ++------------------------+----------+--------------+------------------+ +| pci sysfs resource_wc | -- | -- | WC | +| is IORESOURCE_PREFETCH | | | | ++------------------------+----------+--------------+------------------+ +| pci proc | -- | -- | UC- | +| !PCIIOC_WRITE_COMBINE | | | | ++------------------------+----------+--------------+------------------+ +| pci proc | -- | -- | WC | +| PCIIOC_WRITE_COMBINE | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | WB/WC/UC- | WB/WC/UC- | +| read-write | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | UC- | UC- | +| mmap SYNC flag | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | WB/WC/UC- | WB/WC/UC- | +| mmap !SYNC flag | | | | +| and | |(from existing| (from existing | +| any alias to this area | |alias) | alias) | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | WB | WB | +| mmap !SYNC flag | | | | +| no alias to this area | | | | +| and | | | | +| MTRR says WB | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | -- | UC- | +| mmap !SYNC flag | | | | +| no alias to this area | | | | +| and | | | | +| MTRR says !WB | | | | ++------------------------+----------+--------------+------------------+ + + +Advanced APIs for drivers +========================= + +A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range, +vmf_insert_pfn. + +Drivers wanting to export some pages to userspace do it by using mmap +interface and a combination of: + + 1) pgprot_noncached() + 2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn() + +With PAT support, a new API pgprot_writecombine is being added. So, drivers can +continue to use the above sequence, with either pgprot_noncached() or +pgprot_writecombine() in step 1, followed by step 2. + +In addition, step 2 internally tracks the region as UC or WC in memtype +list in order to ensure no conflicting mapping. + +Note that this set of APIs only works with IO (non RAM) regions. If driver +wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() +as step 0 above and also track the usage of those pages and use set_memory_wb() +before the page is freed to free pool. + +MTRR effects on PAT / non-PAT systems +===================================== + +The following table provides the effects of using write-combining MTRRs when +using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally +mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will +be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add() +is made, should already have been ioremapped with WC attributes or PAT entries, +this can be done by using ioremap_wc() / set_memory_wc(). Devices which +combine areas of IO memory desired to remain uncacheable with areas where +write-combining is desirable should consider use of ioremap_uc() followed by +set_memory_wc() to white-list effective write-combined areas. Such use is +nevertheless discouraged as the effective memory type is considered +implementation defined, yet this strategy can be used as last resort on devices +with size-constrained regions where otherwise MTRR write-combining would +otherwise not be effective. +:: + + ==== ======= === ========================= ===================== + MTRR Non-PAT PAT Linux ioremap value Effective memory type + ==== ======= === ========================= ===================== + PAT Non-PAT | PAT + |PCD | + ||PWT | + ||| | + WC 000 WB _PAGE_CACHE_MODE_WB WC | WC + WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC + WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC + WC 011 UC _PAGE_CACHE_MODE_UC UC | UC + ==== ======= === ========================= ===================== + + (*) denotes implementation defined and is discouraged + +.. note:: -- in the above table mean "Not suggested usage for the API". Some + of the --'s are strictly enforced by the kernel. Some others are not really + enforced today, but may be enforced in future. + +For ioremap and pci access through /sys or /proc - The actual type returned +can be more restrictive, in case of any existing aliasing for that address. +For example: If there is an existing uncached mapping, a new ioremap_wc can +return uncached mapping in place of write-combine requested. + +set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver +will first make a region uc, wc or wt and switch it back to wb after use. + +Over time writes to /proc/mtrr will be deprecated in favor of using PAT based +interfaces. Users writing to /proc/mtrr are suggested to use above interfaces. + +Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access +types. + +Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges. + + +PAT debugging +============= + +With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by:: + + # mount -t debugfs debugfs /sys/kernel/debug + # cat /sys/kernel/debug/x86/pat_memtype_list + PAT memtype list: + uncached-minus @ 0x7fadf000-0x7fae0000 + uncached-minus @ 0x7fb19000-0x7fb1a000 + uncached-minus @ 0x7fb1a000-0x7fb1b000 + uncached-minus @ 0x7fb1b000-0x7fb1c000 + uncached-minus @ 0x7fb1c000-0x7fb1d000 + uncached-minus @ 0x7fb1d000-0x7fb1e000 + uncached-minus @ 0x7fb1e000-0x7fb25000 + uncached-minus @ 0x7fb25000-0x7fb26000 + uncached-minus @ 0x7fb26000-0x7fb27000 + uncached-minus @ 0x7fb27000-0x7fb28000 + uncached-minus @ 0x7fb28000-0x7fb2e000 + uncached-minus @ 0x7fb2e000-0x7fb2f000 + uncached-minus @ 0x7fb2f000-0x7fb30000 + uncached-minus @ 0x7fb31000-0x7fb32000 + uncached-minus @ 0x80000000-0x90000000 + +This list shows physical address ranges and various PAT settings used to +access those physical address ranges. + +Another, more verbose way of getting PAT related debug messages is with +"debugpat" boot parameter. With this parameter, various debug messages are +printed to dmesg log. + +PAT Initialization +================== + +The following table describes how PAT is initialized under various +configurations. The PAT MSR must be updated by Linux in order to support WC +and WT attributes. Otherwise, the PAT MSR has the value programmed in it +by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests. + + ==== ===== ========================== ========= ======= + MTRR PAT Call Sequence PAT State PAT MSR + ==== ===== ========================== ========= ======= + E E MTRR -> PAT init Enabled OS + E D MTRR -> PAT init Disabled - + D E MTRR -> PAT disable Disabled BIOS + D D MTRR -> PAT disable Disabled - + - np/E PAT -> PAT disable Disabled BIOS + - np/D PAT -> PAT disable Disabled - + E !P/E MTRR -> PAT init Disabled BIOS + D !P/E MTRR -> PAT disable Disabled BIOS + !M !P/E MTRR stub -> PAT disable Disabled BIOS + ==== ===== ========================== ========= ======= + + Legend + + ========= ======================================= + E Feature enabled in CPU + D Feature disabled/unsupported in CPU + np "nopat" boot option specified + !P CONFIG_X86_PAT option unset + !M CONFIG_MTRR option unset + Enabled PAT state set to enabled + Disabled PAT state set to disabled + OS PAT initializes PAT MSR with OS setting + BIOS PAT keeps PAT MSR with BIOS setting + ========= ======================================= + diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst new file mode 100644 index 000000000000..4b858a9bad8d --- /dev/null +++ b/Documentation/arch/x86/pti.rst @@ -0,0 +1,195 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +Page Table Isolation (PTI) +========================== + +Overview +======== + +Page Table Isolation (pti, previously known as KAISER [1]_) is a +countermeasure against attacks on the shared user/kernel address +space such as the "Meltdown" approach [2]_. + +To mitigate this class of attacks, we create an independent set of +page tables for use only when running userspace applications. When +the kernel is entered via syscalls, interrupts or exceptions, the +page tables are switched to the full "kernel" copy. When the system +switches back to user mode, the user copy is used again. + +The userspace page tables contain only a minimal amount of kernel +data: only what is needed to enter/exit the kernel such as the +entry/exit functions themselves and the interrupt descriptor table +(IDT). There are a few strictly unnecessary things that get mapped +such as the first C function when entering an interrupt (see +comments in pti.c). + +This approach helps to ensure that side-channel attacks leveraging +the paging structures do not function when PTI is enabled. It can be +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. +Once enabled at compile-time, it can be disabled at boot with the +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). + +Page Table Management +===================== + +When PTI is enabled, the kernel manages two sets of page tables. +The first set is very similar to the single set which is present in +kernels without PTI. This includes a complete mapping of userspace +that the kernel can use for things like copy_to_user(). + +Although _complete_, the user portion of the kernel page tables is +crippled by setting the NX bit in the top level. This ensures +that any missed kernel->user CR3 switch will immediately crash +userspace upon executing its first instruction. + +The userspace page tables map only the kernel data needed to enter +and exit the kernel. This data is entirely contained in the 'struct +cpu_entry_area' structure which is placed in the fixmap which gives +each CPU's copy of the area a compile-time-fixed virtual address. + +For new userspace mappings, the kernel makes the entries in its +page tables like normal. The only difference is when the kernel +makes entries in the top (PGD) level. In addition to setting the +entry in the main kernel PGD, a copy of the entry is made in the +userspace page tables' PGD. + +This sharing at the PGD level also inherently shares all the lower +layers of the page tables. This leaves a single, shared set of +userspace page tables to manage. One PTE to lock, one set of +accessed bits, dirty bits, etc... + +Overhead +======== + +Protection against side-channel attacks is important. But, +this protection comes at a cost: + +1. Increased Memory Use + + a. Each process now needs an order-1 PGD instead of order-0. + (Consumes an additional 4k per process). + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB + aligned so that it can be mapped by setting a single PMD + entry. This consumes nearly 2MB of RAM once the kernel + is decompressed, but no space in the kernel image itself. + +2. Runtime Cost + + a. CR3 manipulation to switch between the page table copies + must be done at interrupt, syscall, and exception entry + and exit (it can be skipped when the kernel is interrupted, + though.) Moves to CR3 are on the order of a hundred + cycles, and are required at every entry and exit. + b. A "trampoline" must be used for SYSCALL entry. This + trampoline depends on a smaller set of resources than the + non-PTI SYSCALL entry code, so requires mapping fewer + things into the userspace page tables. The downside is + that stacks must be switched at entry time. + c. Global pages are disabled for all kernel structures not + mapped into both kernel and userspace page tables. This + feature of the MMU allows different processes to share TLB + entries mapping the kernel. Losing the feature means more + TLB misses after a context switch. The actual loss of + performance is very small, however, never exceeding 1%. + d. Process Context IDentifiers (PCID) is a CPU feature that + allows us to skip flushing the entire TLB when switching page + tables by setting a special bit in CR3 when the page tables + are changed. This makes switching the page tables (at context + switch, or kernel entry/exit) cheaper. But, on systems with + PCID support, the context switch code must flush both the user + and kernel entries out of the TLB. The user PCID TLB flush is + deferred until the exit to userspace, minimizing the cost. + See intel.com/sdm for the gory PCID/INVPCID details. + e. The userspace page tables must be populated for each new + process. Even without PTI, the shared kernel mappings + are created by copying top-level (PGD) entries into each + new process. But, with PTI, there are now *two* kernel + mappings: one in the kernel page tables that maps everything + and one for the entry/exit structures. At fork(), we need to + copy both. + f. In addition to the fork()-time copying, there must also + be an update to the userspace PGD any time a set_pgd() is done + on a PGD used to map userspace. This ensures that the kernel + and userspace copies always map the same userspace + memory. + g. On systems without PCID support, each CR3 write flushes + the entire TLB. That means that each syscall, interrupt + or exception flushes the TLB. + h. INVPCID is a TLB-flushing instruction which allows flushing + of TLB entries for non-current PCIDs. Some systems support + PCIDs, but do not support INVPCID. On these systems, addresses + can only be flushed from the TLB for the current PCID. When + flushing a kernel address, we need to flush all PCIDs, so a + single kernel address flush will require a TLB-flushing CR3 + write upon the next use of every PCID. + +Possible Future Work +==================== +1. We can be more careful about not actually writing to CR3 + unless its value is actually changed. +2. Allow PTI to be enabled/disabled at runtime in addition to the + boot-time switching. + +Testing +======== + +To test stability of PTI, the following test procedure is recommended, +ideally doing all of these in parallel: + +1. Set CONFIG_DEBUG_ENTRY=y +2. Run several copies of all of the tools/testing/selftests/x86/ tests + (excluding MPX and protection_keys) in a loop on multiple CPUs for + several minutes. These tests frequently uncover corner cases in the + kernel entry code. In general, old kernels might cause these tests + themselves to crash, but they should never crash the kernel. +3. Run the 'perf' tool in a mode (top or record) that generates many + frequent performance monitoring non-maskable interrupts (see "NMI" + in /proc/interrupts). This exercises the NMI entry/exit code which + is known to trigger bugs in code paths that did not expect to be + interrupted, including nested NMIs. Using "-c" boosts the rate of + NMIs, and using two -c with separate counters encourages nested NMIs + and less deterministic behavior. + :: + + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done + +4. Launch a KVM virtual machine. +5. Run 32-bit binaries on systems supporting the SYSCALL instruction. + This has been a lightly-tested code path and needs extra scrutiny. + +Debugging +========= + +Bugs in PTI cause a few different signatures of crashes +that are worth noting here. + + * Failures of the selftests/x86 code. Usually a bug in one of the + more obscure corners of entry_64.S + * Crashes in early boot, especially around CPU bringup. Bugs + in the trampoline code or mappings cause these. + * Crashes at the first interrupt. Caused by bugs in entry_64.S, + like screwing up a page table switch. Also caused by + incorrectly mapping the IRQ handler entry code. + * Crashes at the first NMI. The NMI code is separate from main + interrupt handlers and can have bugs that do not affect + normal interrupts. Also caused by incorrectly mapping NMI + code. NMIs that interrupt the entry code must be very + careful and can be the cause of crashes that show up when + running perf. + * Kernel crashes at the first exit to userspace. entry_64.S + bugs, or failing to map some of the exit code. + * Crashes at first interrupt that interrupts userspace. The paths + in entry_64.S that return to userspace are sometimes separate + from the ones that return to the kernel. + * Double faults: overflowing the kernel stack because of page + faults upon page faults. Caused by touching non-pti-mapped + data in the entry code, or forgetting to switch to kernel + CR3 before calling into C functions which are not pti-mapped. + * Userspace segfaults early in boot, sometimes manifesting + as mount(8) failing to mount the rootfs. These have + tended to be TLB invalidation issues. Usually invalidating + the wrong PCID, or otherwise missing an invalidation. + +.. [1] https://gruss.cc/files/kaiser.pdf +.. [2] https://meltdownattack.com/meltdown.pdf diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst new file mode 100644 index 000000000000..387ccbcb558f --- /dev/null +++ b/Documentation/arch/x86/resctrl.rst @@ -0,0 +1,1447 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +=========================================== +User Interface for Resource Control feature +=========================================== + +:Copyright: |copy| 2016 Intel Corporation +:Authors: - Fenghua Yu <fenghua.yu@intel.com> + - Tony Luck <tony.luck@intel.com> + - Vikas Shivappa <vikas.shivappa@intel.com> + + +Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). +AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). + +This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo +flag bits: + +=============================================== ================================ +RDT (Resource Director Technology) Allocation "rdt_a" +CAT (Cache Allocation Technology) "cat_l3", "cat_l2" +CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" +CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" +MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" +MBA (Memory Bandwidth Allocation) "mba" +SMBA (Slow Memory Bandwidth Allocation) "" +BMEC (Bandwidth Monitoring Event Configuration) "" +=============================================== ================================ + +Historically, new features were made visible by default in /proc/cpuinfo. This +resulted in the feature flags becoming hard to parse by humans. Adding a new +flag to /proc/cpuinfo should be avoided if user space can obtain information +about the feature from resctrl's info directory. + +To use the feature mount the file system:: + + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl + +mount options are: + +"cdp": + Enable code/data prioritization in L3 cache allocations. +"cdpl2": + Enable code/data prioritization in L2 cache allocations. +"mba_MBps": + Enable the MBA Software Controller(mba_sc) to specify MBA + bandwidth in MBps + +L2 and L3 CDP are controlled separately. + +RDT features are orthogonal. A particular system may support only +monitoring, only control, or both monitoring and control. Cache +pseudo-locking is a unique way of using cache control to "pin" or +"lock" data in the cache. Details can be found in +"Cache Pseudo-Locking". + + +The mount succeeds if either of allocation or monitoring is present, but +only those files and directories supported by the system will be created. +For more details on the behavior of the interface during monitoring +and allocation, see the "Resource alloc and monitor groups" section. + +Info directory +============== + +The 'info' directory contains information about the enabled +resources. Each resource has its own subdirectory. The subdirectory +names reflect the resource names. + +Each subdirectory contains the following files with respect to +allocation: + +Cache resource(L3/L2) subdirectory contains the following files +related to allocation: + +"num_closids": + The number of CLOSIDs which are valid for this + resource. The kernel uses the smallest number of + CLOSIDs of all enabled resources as limit. +"cbm_mask": + The bitmask which is valid for this resource. + This mask is equivalent to 100%. +"min_cbm_bits": + The minimum number of consecutive bits which + must be set when writing a mask. + +"shareable_bits": + Bitmask of shareable resource with other executing + entities (e.g. I/O). User can use this when + setting up exclusive cache partitions. Note that + some platforms support devices that have their + own settings for cache use which can over-ride + these bits. +"bit_usage": + Annotated capacity bitmasks showing how all + instances of the resource are used. The legend is: + + "0": + Corresponding region is unused. When the system's + resources have been allocated and a "0" is found + in "bit_usage" it is a sign that resources are + wasted. + + "H": + Corresponding region is used by hardware only + but available for software use. If a resource + has bits set in "shareable_bits" but not all + of these bits appear in the resource groups' + schematas then the bits appearing in + "shareable_bits" but no resource group will + be marked as "H". + "X": + Corresponding region is available for sharing and + used by hardware and software. These are the + bits that appear in "shareable_bits" as + well as a resource group's allocation. + "S": + Corresponding region is used by software + and available for sharing. + "E": + Corresponding region is used exclusively by + one resource group. No sharing allowed. + "P": + Corresponding region is pseudo-locked. No + sharing allowed. + +Memory bandwidth(MB) subdirectory contains the following files +with respect to allocation: + +"min_bandwidth": + The minimum memory bandwidth percentage which + user can request. + +"bandwidth_gran": + The granularity in which the memory bandwidth + percentage is allocated. The allocated + b/w percentage is rounded off to the next + control step available on the hardware. The + available bandwidth control steps are: + min_bandwidth + N * bandwidth_gran. + +"delay_linear": + Indicates if the delay scale is linear or + non-linear. This field is purely informational + only. + +"thread_throttle_mode": + Indicator on Intel systems of how tasks running on threads + of a physical core are throttled in cases where they + request different memory bandwidth percentages: + + "max": + the smallest percentage is applied + to all threads + "per-thread": + bandwidth percentages are directly applied to + the threads running on the core + +If RDT monitoring is available there will be an "L3_MON" directory +with the following files: + +"num_rmids": + The number of RMIDs available. This is the + upper bound for how many "CTRL_MON" + "MON" + groups can be created. + +"mon_features": + Lists the monitoring events if + monitoring is enabled for the resource. + Example:: + + # cat /sys/fs/resctrl/info/L3_MON/mon_features + llc_occupancy + mbm_total_bytes + mbm_local_bytes + + If the system supports Bandwidth Monitoring Event + Configuration (BMEC), then the bandwidth events will + be configurable. The output will be:: + + # cat /sys/fs/resctrl/info/L3_MON/mon_features + llc_occupancy + mbm_total_bytes + mbm_total_bytes_config + mbm_local_bytes + mbm_local_bytes_config + +"mbm_total_bytes_config", "mbm_local_bytes_config": + Read/write files containing the configuration for the mbm_total_bytes + and mbm_local_bytes events, respectively, when the Bandwidth + Monitoring Event Configuration (BMEC) feature is supported. + The event configuration settings are domain specific and affect + all the CPUs in the domain. When either event configuration is + changed, the bandwidth counters for all RMIDs of both events + (mbm_total_bytes as well as mbm_local_bytes) are cleared for that + domain. The next read for every RMID will report "Unavailable" + and subsequent reads will report the valid value. + + Following are the types of events supported: + + ==== ======================================================== + Bits Description + ==== ======================================================== + 6 Dirty Victims from the QOS domain to all types of memory + 5 Reads to slow memory in the non-local NUMA domain + 4 Reads to slow memory in the local NUMA domain + 3 Non-temporal writes to non-local NUMA domain + 2 Non-temporal writes to local NUMA domain + 1 Reads to memory in the non-local NUMA domain + 0 Reads to memory in the local NUMA domain + ==== ======================================================== + + By default, the mbm_total_bytes configuration is set to 0x7f to count + all the event types and the mbm_local_bytes configuration is set to + 0x15 to count all the local memory events. + + Examples: + + * To view the current configuration:: + :: + + # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config + 0=0x7f;1=0x7f;2=0x7f;3=0x7f + + # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config + 0=0x15;1=0x15;3=0x15;4=0x15 + + * To change the mbm_total_bytes to count only reads on domain 0, + the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary + (in hexadecimal 0x33): + :: + + # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config + + # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config + 0=0x33;1=0x7f;2=0x7f;3=0x7f + + * To change the mbm_local_bytes to count all the slow memory reads on + domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b + in binary (in hexadecimal 0x30): + :: + + # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config + + # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config + 0=0x30;1=0x30;3=0x15;4=0x15 + +"max_threshold_occupancy": + Read/write file provides the largest value (in + bytes) at which a previously used LLC_occupancy + counter can be considered for re-use. + +Finally, in the top level of the "info" directory there is a file +named "last_cmd_status". This is reset with every "command" issued +via the file system (making new directories or writing to any of the +control files). If the command was successful, it will read as "ok". +If the command failed, it will provide more information that can be +conveyed in the error returns from file operations. E.g. +:: + + # echo L3:0=f7 > schemata + bash: echo: write error: Invalid argument + # cat info/last_cmd_status + mask f7 has non-consecutive 1-bits + +Resource alloc and monitor groups +================================= + +Resource groups are represented as directories in the resctrl file +system. The default group is the root directory which, immediately +after mounting, owns all the tasks and cpus in the system and can make +full use of all resources. + +On a system with RDT control features additional directories can be +created in the root directory that specify different amounts of each +resource (see "schemata" below). The root and these additional top level +directories are referred to as "CTRL_MON" groups below. + +On a system with RDT monitoring the root directory and other top level +directories contain a directory named "mon_groups" in which additional +directories can be created to monitor subsets of tasks in the CTRL_MON +group that is their ancestor. These are called "MON" groups in the rest +of this document. + +Removing a directory will move all tasks and cpus owned by the group it +represents to the parent. Removing one of the created CTRL_MON groups +will automatically remove all MON groups below it. + +All groups contain the following files: + +"tasks": + Reading this file shows the list of all tasks that belong to + this group. Writing a task id to the file will add a task to the + group. If the group is a CTRL_MON group the task is removed from + whichever previous CTRL_MON group owned the task and also from + any MON group that owned the task. If the group is a MON group, + then the task must already belong to the CTRL_MON parent of this + group. The task is removed from any previous MON group. + + +"cpus": + Reading this file shows a bitmask of the logical CPUs owned by + this group. Writing a mask to this file will add and remove + CPUs to/from this group. As with the tasks file a hierarchy is + maintained where MON groups may only include CPUs owned by the + parent CTRL_MON group. + When the resource group is in pseudo-locked mode this file will + only be readable, reflecting the CPUs associated with the + pseudo-locked region. + + +"cpus_list": + Just like "cpus", only using ranges of CPUs instead of bitmasks. + + +When control is enabled all CTRL_MON groups will also contain: + +"schemata": + A list of all the resources available to this group. + Each resource has its own line and format - see below for details. + +"size": + Mirrors the display of the "schemata" file to display the size in + bytes of each allocation instead of the bits representing the + allocation. + +"mode": + The "mode" of the resource group dictates the sharing of its + allocations. A "shareable" resource group allows sharing of its + allocations while an "exclusive" resource group does not. A + cache pseudo-locked region is created by first writing + "pseudo-locksetup" to the "mode" file before writing the cache + pseudo-locked region's schemata to the resource group's "schemata" + file. On successful pseudo-locked region creation the mode will + automatically change to "pseudo-locked". + +When monitoring is enabled all MON groups will also contain: + +"mon_data": + This contains a set of files organized by L3 domain and by + RDT event. E.g. on a system with two L3 domains there will + be subdirectories "mon_L3_00" and "mon_L3_01". Each of these + directories have one file per event (e.g. "llc_occupancy", + "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these + files provide a read out of the current value of the event for + all tasks in the group. In CTRL_MON groups these files provide + the sum for all tasks in the CTRL_MON group and all tasks in + MON groups. Please see example section for more details on usage. + +Resource allocation rules +------------------------- + +When a task is running the following rules define which resources are +available to it: + +1) If the task is a member of a non-default group, then the schemata + for that group is used. + +2) Else if the task belongs to the default group, but is running on a + CPU that is assigned to some specific group, then the schemata for the + CPU's group is used. + +3) Otherwise the schemata for the default group is used. + +Resource monitoring rules +------------------------- +1) If a task is a member of a MON group, or non-default CTRL_MON group + then RDT events for the task will be reported in that group. + +2) If a task is a member of the default CTRL_MON group, but is running + on a CPU that is assigned to some specific group, then the RDT events + for the task will be reported in that group. + +3) Otherwise RDT events for the task will be reported in the root level + "mon_data" group. + + +Notes on cache occupancy monitoring and control +=============================================== +When moving a task from one group to another you should remember that +this only affects *new* cache allocations by the task. E.g. you may have +a task in a monitor group showing 3 MB of cache occupancy. If you move +to a new group and immediately check the occupancy of the old and new +groups you will likely see that the old group is still showing 3 MB and +the new group zero. When the task accesses locations still in cache from +before the move, the h/w does not update any counters. On a busy system +you will likely see the occupancy in the old group go down as cache lines +are evicted and re-used while the occupancy in the new group rises as +the task accesses memory and loads into the cache are counted based on +membership in the new group. + +The same applies to cache allocation control. Moving a task to a group +with a smaller cache partition will not evict any cache lines. The +process may continue to use them from the old partition. + +Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) +to identify a control group and a monitoring group respectively. Each of +the resource groups are mapped to these IDs based on the kind of group. The +number of CLOSid and RMID are limited by the hardware and hence the creation of +a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID +and creation of "MON" group may fail if we run out of RMIDs. + +max_threshold_occupancy - generic concepts +------------------------------------------ + +Note that an RMID once freed may not be immediately available for use as +the RMID is still tagged the cache lines of the previous user of RMID. +Hence such RMIDs are placed on limbo list and checked back if the cache +occupancy has gone down. If there is a time when system has a lot of +limbo RMIDs but which are not ready to be used, user may see an -EBUSY +during mkdir. + +max_threshold_occupancy is a user configurable value to determine the +occupancy at which an RMID can be freed. + +Schemata files - general concepts +--------------------------------- +Each line in the file describes one resource. The line starts with +the name of the resource, followed by specific values to be applied +in each of the instances of that resource on the system. + +Cache IDs +--------- +On current generation systems there is one L3 cache per socket and L2 +caches are generally just shared by the hyperthreads on a core, but this +isn't an architectural requirement. We could have multiple separate L3 +caches on a socket, multiple cores could share an L2 cache. So instead +of using "socket" or "core" to define the set of logical cpus sharing +a resource we use a "Cache ID". At a given cache level this will be a +unique number across the whole system (but it isn't guaranteed to be a +contiguous sequence, there may be gaps). To find the ID for each logical +CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id + +Cache Bit Masks (CBM) +--------------------- +For cache resources we describe the portion of the cache that is available +for allocation using a bitmask. The maximum value of the mask is defined +by each cpu model (and may be different for different cache levels). It +is found using CPUID, but is also provided in the "info" directory of +the resctrl file system in "info/{resource}/cbm_mask". Intel hardware +requires that these masks have all the '1' bits in a contiguous block. So +0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 +and 0xA are not. On a system with a 20-bit mask each bit represents 5% +of the capacity of the cache. You could partition the cache into four +equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. + +Memory bandwidth Allocation and monitoring +========================================== + +For Memory bandwidth resource, by default the user controls the resource +by indicating the percentage of total memory bandwidth. + +The minimum bandwidth percentage value for each cpu model is predefined +and can be looked up through "info/MB/min_bandwidth". The bandwidth +granularity that is allocated is also dependent on the cpu model and can +be looked up at "info/MB/bandwidth_gran". The available bandwidth +control steps are: min_bw + N * bw_gran. Intermediate values are rounded +to the next control step available on the hardware. + +The bandwidth throttling is a core specific mechanism on some of Intel +SKUs. Using a high bandwidth and a low bandwidth setting on two threads +sharing a core may result in both threads being throttled to use the +low bandwidth (see "thread_throttle_mode"). + +The fact that Memory bandwidth allocation(MBA) may be a core +specific mechanism where as memory bandwidth monitoring(MBM) is done at +the package level may lead to confusion when users try to apply control +via the MBA and then monitor the bandwidth to see if the controls are +effective. Below are such scenarios: + +1. User may *not* see increase in actual bandwidth when percentage + values are increased: + +This can occur when aggregate L2 external bandwidth is more than L3 +external bandwidth. Consider an SKL SKU with 24 cores on a package and +where L2 external is 10GBps (hence aggregate L2 external bandwidth is +240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 +threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 +bandwidth of 100GBps although the percentage value specified is only 50% +<< 100%. Hence increasing the bandwidth percentage will not yield any +more bandwidth. This is because although the L2 external bandwidth still +has capacity, the L3 external bandwidth is fully used. Also note that +this would be dependent on number of cores the benchmark is run on. + +2. Same bandwidth percentage may mean different actual bandwidth + depending on # of threads: + +For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 +thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although +they have same percentage bandwidth of 10%. This is simply because as +threads start using more cores in an rdtgroup, the actual bandwidth may +increase or vary although user specified bandwidth percentage is same. + +In order to mitigate this and make the interface more user friendly, +resctrl added support for specifying the bandwidth in MBps as well. The +kernel underneath would use a software feedback mechanism or a "Software +Controller(mba_sc)" which reads the actual bandwidth using MBM counters +and adjust the memory bandwidth percentages to ensure:: + + "actual bandwidth < user specified bandwidth". + +By default, the schemata would take the bandwidth percentage values +where as user can switch to the "MBA software controller" mode using +a mount option 'mba_MBps'. The schemata format is specified in the below +sections. + +L3 schemata file details (code and data prioritization disabled) +---------------------------------------------------------------- +With CDP disabled the L3 schemata format is:: + + L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + +L3 schemata file details (CDP enabled via mount option to resctrl) +------------------------------------------------------------------ +When CDP is enabled L3 control is split into two separate resources +so you can specify independent masks for code and data like this:: + + L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + +L2 schemata file details +------------------------ +CDP is supported at L2 using the 'cdpl2' mount option. The schemata +format is either:: + + L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + +or + + L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + + +Memory bandwidth Allocation (default mode) +------------------------------------------ + +Memory b/w domain is L3 cache. +:: + + MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... + +Memory bandwidth Allocation specified in MBps +--------------------------------------------- + +Memory bandwidth domain is L3 cache. +:: + + MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... + +Slow Memory Bandwidth Allocation (SMBA) +--------------------------------------- +AMD hardware supports Slow Memory Bandwidth Allocation (SMBA). +CXL.memory is the only supported "slow" memory device. With the +support of SMBA, the hardware enables bandwidth allocation on +the slow memory devices. If there are multiple such devices in +the system, the throttling logic groups all the slow sources +together and applies the limit on them as a whole. + +The presence of SMBA (with CXL.memory) is independent of slow memory +devices presence. If there are no such devices on the system, then +configuring SMBA will have no impact on the performance of the system. + +The bandwidth domain for slow memory is L3 cache. Its schemata file +is formatted as: +:: + + SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... + +Reading/writing the schemata file +--------------------------------- +Reading the schemata file will show the state of all resources +on all domains. When writing you only need to specify those values +which you wish to change. E.g. +:: + + # cat schemata + L3DATA:0=fffff;1=fffff;2=fffff;3=fffff + L3CODE:0=fffff;1=fffff;2=fffff;3=fffff + # echo "L3DATA:2=3c0;" > schemata + # cat schemata + L3DATA:0=fffff;1=fffff;2=3c0;3=fffff + L3CODE:0=fffff;1=fffff;2=fffff;3=fffff + +Reading/writing the schemata file (on AMD systems) +-------------------------------------------------- +Reading the schemata file will show the current bandwidth limit on all +domains. The allocated resources are in multiples of one eighth GB/s. +When writing to the file, you need to specify what cache id you wish to +configure the bandwidth limit. + +For example, to allocate 2GB/s limit on the first cache id: + +:: + + # cat schemata + MB:0=2048;1=2048;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + + # echo "MB:1=16" > schemata + # cat schemata + MB:0=2048;1= 16;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + +Reading/writing the schemata file (on AMD systems) with SMBA feature +-------------------------------------------------------------------- +Reading and writing the schemata file is the same as without SMBA in +above section. + +For example, to allocate 8GB/s limit on the first cache id: + +:: + + # cat schemata + SMBA:0=2048;1=2048;2=2048;3=2048 + MB:0=2048;1=2048;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + + # echo "SMBA:1=64" > schemata + # cat schemata + SMBA:0=2048;1= 64;2=2048;3=2048 + MB:0=2048;1=2048;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + +Cache Pseudo-Locking +==================== +CAT enables a user to specify the amount of cache space that an +application can fill. Cache pseudo-locking builds on the fact that a +CPU can still read and write data pre-allocated outside its current +allocated area on a cache hit. With cache pseudo-locking, data can be +preloaded into a reserved portion of cache that no application can +fill, and from that point on will only serve cache hits. The cache +pseudo-locked memory is made accessible to user space where an +application can map it into its virtual address space and thus have +a region of memory with reduced average read latency. + +The creation of a cache pseudo-locked region is triggered by a request +from the user to do so that is accompanied by a schemata of the region +to be pseudo-locked. The cache pseudo-locked region is created as follows: + +- Create a CAT allocation CLOSNEW with a CBM matching the schemata + from the user of the cache region that will contain the pseudo-locked + memory. This region must not overlap with any current CAT allocation/CLOS + on the system and no future overlap with this cache region is allowed + while the pseudo-locked region exists. +- Create a contiguous region of memory of the same size as the cache + region. +- Flush the cache, disable hardware prefetchers, disable preemption. +- Make CLOSNEW the active CLOS and touch the allocated memory to load + it into the cache. +- Set the previous CLOS as active. +- At this point the closid CLOSNEW can be released - the cache + pseudo-locked region is protected as long as its CBM does not appear in + any CAT allocation. Even though the cache pseudo-locked region will from + this point on not appear in any CBM of any CLOS an application running with + any CLOS will be able to access the memory in the pseudo-locked region since + the region continues to serve cache hits. +- The contiguous region of memory loaded into the cache is exposed to + user-space as a character device. + +Cache pseudo-locking increases the probability that data will remain +in the cache via carefully configuring the CAT feature and controlling +application behavior. There is no guarantee that data is placed in +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict +“locked” data from cache. Power management C-states may shrink or +power off cache. Deeper C-states will automatically be restricted on +pseudo-locked region creation. + +It is required that an application using a pseudo-locked region runs +with affinity to the cores (or a subset of the cores) associated +with the cache on which the pseudo-locked region resides. A sanity check +within the code will not allow an application to map pseudo-locked memory +unless it runs with affinity to cores associated with the cache on which the +pseudo-locked region resides. The sanity check is only done during the +initial mmap() handling, there is no enforcement afterwards and the +application self needs to ensure it remains affine to the correct cores. + +Pseudo-locking is accomplished in two stages: + +1) During the first stage the system administrator allocates a portion + of cache that should be dedicated to pseudo-locking. At this time an + equivalent portion of memory is allocated, loaded into allocated + cache portion, and exposed as a character device. +2) During the second stage a user-space application maps (mmap()) the + pseudo-locked memory into its address space. + +Cache Pseudo-Locking Interface +------------------------------ +A pseudo-locked region is created using the resctrl interface as follows: + +1) Create a new resource group by creating a new directory in /sys/fs/resctrl. +2) Change the new resource group's mode to "pseudo-locksetup" by writing + "pseudo-locksetup" to the "mode" file. +3) Write the schemata of the pseudo-locked region to the "schemata" file. All + bits within the schemata should be "unused" according to the "bit_usage" + file. + +On successful pseudo-locked region creation the "mode" file will contain +"pseudo-locked" and a new character device with the same name as the resource +group will exist in /dev/pseudo_lock. This character device can be mmap()'ed +by user space in order to obtain access to the pseudo-locked memory region. + +An example of cache pseudo-locked region creation and usage can be found below. + +Cache Pseudo-Locking Debugging Interface +---------------------------------------- +The pseudo-locking debugging interface is enabled by default (if +CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. + +There is no explicit way for the kernel to test if a provided memory +location is present in the cache. The pseudo-locking debugging interface uses +the tracing infrastructure to provide two ways to measure cache residency of +the pseudo-locked region: + +1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data + from these measurements are best visualized using a hist trigger (see + example below). In this test the pseudo-locked region is traversed at + a stride of 32 bytes while hardware prefetchers and preemption + are disabled. This also provides a substitute visualization of cache + hits and misses. +2) Cache hit and miss measurements using model specific precision counters if + available. Depending on the levels of cache on the system the pseudo_lock_l2 + and pseudo_lock_l3 tracepoints are available. + +When a pseudo-locked region is created a new debugfs directory is created for +it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single +write-only file, pseudo_lock_measure, is present in this directory. The +measurement of the pseudo-locked region depends on the number written to this +debugfs file: + +1: + writing "1" to the pseudo_lock_measure file will trigger the latency + measurement captured in the pseudo_lock_mem_latency tracepoint. See + example below. +2: + writing "2" to the pseudo_lock_measure file will trigger the L2 cache + residency (cache hits and misses) measurement captured in the + pseudo_lock_l2 tracepoint. See example below. +3: + writing "3" to the pseudo_lock_measure file will trigger the L3 cache + residency (cache hits and misses) measurement captured in the + pseudo_lock_l3 tracepoint. + +All measurements are recorded with the tracing infrastructure. This requires +the relevant tracepoints to be enabled before the measurement is triggered. + +Example of latency debugging interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this example a pseudo-locked region named "newlock" was created. Here is +how we can measure the latency in cycles of reading from this region and +visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS +is set:: + + # :> /sys/kernel/tracing/trace + # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger + # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable + # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure + # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable + # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist + + # event histogram + # + # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] + # + + { latency: 456 } hitcount: 1 + { latency: 50 } hitcount: 83 + { latency: 36 } hitcount: 96 + { latency: 44 } hitcount: 174 + { latency: 48 } hitcount: 195 + { latency: 46 } hitcount: 262 + { latency: 42 } hitcount: 693 + { latency: 40 } hitcount: 3204 + { latency: 38 } hitcount: 3484 + + Totals: + Hits: 8192 + Entries: 9 + Dropped: 0 + +Example of cache hits/misses debugging +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this example a pseudo-locked region named "newlock" was created on the L2 +cache of a platform. Here is how we can obtain details of the cache hits +and misses using the platform's precision counters. +:: + + # :> /sys/kernel/tracing/trace + # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable + # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure + # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable + # cat /sys/kernel/tracing/trace + + # tracer: nop + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / delay + # TASK-PID CPU# |||| TIMESTAMP FUNCTION + # | | | |||| | | + pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 + + +Examples for RDT allocation usage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1) Example 1 + +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks, minimum b/w of 10% with a memory bandwidth +granularity of 10%. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Similarly, tasks that are under the control of group "p0" may use a +maximum memory b/w of 50% on socket0 and 50% on socket 1. +Tasks in group "p1" may also use 50% memory b/w on both sockets. +Note that unlike cache masks, memory b/w cannot specify whether these +allocations can overlap or not. The allocations specifies the maximum +b/w that the group may be able to use and the system admin can configure +the b/w accordingly. + +If resctrl is using the software controller (mba_sc) then user can enter the +max b/w in MB rather than the percentage values. +:: + + # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata + +In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w +of 1024MB where as on socket 1 they would use 500MB. + +2) Example 2 + +Again two sockets, but this time with a more realistic 20-bit mask. + +Two real time tasks pid=1234 running on processor 0 and pid=5678 running on +processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy +neighbors, each of the two real-time tasks exclusively occupies one quarter +of L3 cache on socket 0. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + +First we reset the schemata for the default group so that the "upper" +50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by +ordinary tasks:: + + # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata + +Next we make a resource group for our first real time task and give +it access to the "top" 25% of the cache on socket 0. +:: + + # mkdir p0 + # echo "L3:0=f8000;1=fffff" > p0/schemata + +Finally we move our first real time task into this resource group. We +also use taskset(1) to ensure the task always runs on a dedicated CPU +on socket 0. Most uses of resource groups will also constrain which +processors tasks run on. +:: + + # echo 1234 > p0/tasks + # taskset -cp 1 1234 + +Ditto for the second real time task (with the remaining 25% of cache):: + + # mkdir p1 + # echo "L3:0=7c00;1=fffff" > p1/schemata + # echo 5678 > p1/tasks + # taskset -cp 2 5678 + +For the same 2 socket system with memory b/w resource and CAT L3 the +schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is +10): + +For our first real time task this would request 20% memory b/w on socket 0. +:: + + # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata + +For our second real time task this would request an other 20% memory b/w +on socket 0. +:: + + # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata + +3) Example 3 + +A single socket system which has real-time tasks running on core 4-7 and +non real-time workload assigned to core 0-3. The real-time tasks share text +and data, so a per task association is not required and due to interaction +with the kernel it's desired that the kernel on these cores shares L3 with +the tasks. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + +First we reset the schemata for the default group so that the "upper" +50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 +cannot be used by ordinary tasks:: + + # echo "L3:0=3ff\nMB:0=50" > schemata + +Next we make a resource group for our real time cores and give it access +to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on +socket 0. +:: + + # mkdir p0 + # echo "L3:0=ffc00\nMB:0=50" > p0/schemata + +Finally we move core 4-7 over to the new group and make sure that the +kernel and the tasks running there get 50% of the cache. They should +also get 50% of memory bandwidth assuming that the cores 4-7 are SMT +siblings and only the real time threads are scheduled on the cores 4-7. +:: + + # echo F0 > p0/cpus + +4) Example 4 + +The resource groups in previous examples were all in the default "shareable" +mode allowing sharing of their cache allocations. If one resource group +configures a cache allocation then nothing prevents another resource group +to overlap with that allocation. + +In this example a new exclusive resource group will be created on a L2 CAT +system with two L2 cache instances that can be configured with an 8-bit +capacity bitmask. The new exclusive resource group will be configured to use +25% of each cache instance. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + # cd /sys/fs/resctrl + +First, we observe that the default group is configured to allocate to all L2 +cache:: + + # cat schemata + L2:0=ff;1=ff + +We could attempt to create the new resource group at this point, but it will +fail because of the overlap with the schemata of the default group:: + + # mkdir p0 + # echo 'L2:0=0x3;1=0x3' > p0/schemata + # cat p0/mode + shareable + # echo exclusive > p0/mode + -sh: echo: write error: Invalid argument + # cat info/last_cmd_status + schemata overlaps + +To ensure that there is no overlap with another resource group the default +resource group's schemata has to change, making it possible for the new +resource group to become exclusive. +:: + + # echo 'L2:0=0xfc;1=0xfc' > schemata + # echo exclusive > p0/mode + # grep . p0/* + p0/cpus:0 + p0/mode:exclusive + p0/schemata:L2:0=03;1=03 + p0/size:L2:0=262144;1=262144 + +A new resource group will on creation not overlap with an exclusive resource +group:: + + # mkdir p1 + # grep . p1/* + p1/cpus:0 + p1/mode:shareable + p1/schemata:L2:0=fc;1=fc + p1/size:L2:0=786432;1=786432 + +The bit_usage will reflect how the cache is used:: + + # cat info/L2/bit_usage + 0=SSSSSSEE;1=SSSSSSEE + +A resource group cannot be forced to overlap with an exclusive resource group:: + + # echo 'L2:0=0x1;1=0x1' > p1/schemata + -sh: echo: write error: Invalid argument + # cat info/last_cmd_status + overlaps with exclusive group + +Example of Cache Pseudo-Locking +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked +region is exposed at /dev/pseudo_lock/newlock that can be provided to +application for argument to mmap(). +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + # cd /sys/fs/resctrl + +Ensure that there are bits available that can be pseudo-locked, since only +unused bits can be pseudo-locked the bits to be pseudo-locked needs to be +removed from the default resource group's schemata:: + + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSSSS + # echo 'L2:1=0xfc' > schemata + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSS00 + +Create a new resource group that will be associated with the pseudo-locked +region, indicate that it will be used for a pseudo-locked region, and +configure the requested pseudo-locked region capacity bitmask:: + + # mkdir newlock + # echo pseudo-locksetup > newlock/mode + # echo 'L2:1=0x3' > newlock/schemata + +On success the resource group's mode will change to pseudo-locked, the +bit_usage will reflect the pseudo-locked region, and the character device +exposing the pseudo-locked region will exist:: + + # cat newlock/mode + pseudo-locked + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSSPP + # ls -l /dev/pseudo_lock/newlock + crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock + +:: + + /* + * Example code to access one page of pseudo-locked cache region + * from user space. + */ + #define _GNU_SOURCE + #include <fcntl.h> + #include <sched.h> + #include <stdio.h> + #include <stdlib.h> + #include <unistd.h> + #include <sys/mman.h> + + /* + * It is required that the application runs with affinity to only + * cores associated with the pseudo-locked region. Here the cpu + * is hardcoded for convenience of example. + */ + static int cpuid = 2; + + int main(int argc, char *argv[]) + { + cpu_set_t cpuset; + long page_size; + void *mapping; + int dev_fd; + int ret; + + page_size = sysconf(_SC_PAGESIZE); + + CPU_ZERO(&cpuset); + CPU_SET(cpuid, &cpuset); + ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); + if (ret < 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } + + dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); + if (dev_fd < 0) { + perror("open"); + exit(EXIT_FAILURE); + } + + mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, + dev_fd, 0); + if (mapping == MAP_FAILED) { + perror("mmap"); + close(dev_fd); + exit(EXIT_FAILURE); + } + + /* Application interacts with pseudo-locked memory @mapping */ + + ret = munmap(mapping, page_size); + if (ret < 0) { + perror("munmap"); + close(dev_fd); + exit(EXIT_FAILURE); + } + + close(dev_fd); + exit(EXIT_SUCCESS); + } + +Locking between applications +---------------------------- + +Certain operations on the resctrl filesystem, composed of read/writes +to/from multiple files, must be atomic. + +As an example, the allocation of an exclusive reservation of L3 cache +involves: + + 1. Read the cbmmasks from each directory or the per-resource "bit_usage" + 2. Find a contiguous set of bits in the global CBM bitmask that is clear + in any of the directory cbmmasks + 3. Create a new directory + 4. Set the bits found in step 2 to the new directory "schemata" file + +If two applications attempt to allocate space concurrently then they can +end up allocating the same bits so the reservations are shared instead of +exclusive. + +To coordinate atomic operations on the resctrlfs and to avoid the problem +above, the following locking procedure is recommended: + +Locking is based on flock, which is available in libc and also as a shell +script command + +Write lock: + + A) Take flock(LOCK_EX) on /sys/fs/resctrl + B) Read/write the directory structure. + C) funlock + +Read lock: + + A) Take flock(LOCK_SH) on /sys/fs/resctrl + B) If success read the directory structure. + C) funlock + +Example with bash:: + + # Atomically read directory structure + $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl + + # Read directory contents and create new subdirectory + + $ cat create-dir.sh + find /sys/fs/resctrl/ > output.txt + mask = function-of(output.txt) + mkdir /sys/fs/resctrl/newres/ + echo mask > /sys/fs/resctrl/newres/schemata + + $ flock /sys/fs/resctrl/ ./create-dir.sh + +Example with C:: + + /* + * Example code do take advisory locks + * before accessing resctrl filesystem + */ + #include <sys/file.h> + #include <stdlib.h> + + void resctrl_take_shared_lock(int fd) + { + int ret; + + /* take shared lock on resctrl filesystem */ + ret = flock(fd, LOCK_SH); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void resctrl_take_exclusive_lock(int fd) + { + int ret; + + /* release lock on resctrl filesystem */ + ret = flock(fd, LOCK_EX); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void resctrl_release_lock(int fd) + { + int ret; + + /* take shared lock on resctrl filesystem */ + ret = flock(fd, LOCK_UN); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void main(void) + { + int fd, ret; + + fd = open("/sys/fs/resctrl", O_DIRECTORY); + if (fd == -1) { + perror("open"); + exit(-1); + } + resctrl_take_shared_lock(fd); + /* code to read directory contents */ + resctrl_release_lock(fd); + + resctrl_take_exclusive_lock(fd); + /* code to read and write directory contents */ + resctrl_release_lock(fd); + } + +Examples for RDT Monitoring along with allocation usage +======================================================= +Reading monitored data +---------------------- +Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would +show the current snapshot of LLC occupancy of the corresponding MON +group or CTRL_MON group. + + +Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) +------------------------------------------------------------------------ +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata + # echo 5678 > p1/tasks + # echo 5679 > p1/tasks + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Create monitor groups and assign a subset of tasks to each monitor group. +:: + + # cd /sys/fs/resctrl/p1/mon_groups + # mkdir m11 m12 + # echo 5678 > m11/tasks + # echo 5679 > m12/tasks + +fetch data (data shown in bytes) +:: + + # cat m11/mon_data/mon_L3_00/llc_occupancy + 16234000 + # cat m11/mon_data/mon_L3_01/llc_occupancy + 14789000 + # cat m12/mon_data/mon_L3_00/llc_occupancy + 16789000 + +The parent ctrl_mon group shows the aggregated data. +:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy + 31234000 + +Example 2 (Monitor a task from its creation) +-------------------------------------------- +On a two socket machine (one L3 cache per socket):: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + +An RMID is allocated to the group once its created and hence the <cmd> +below is monitored from its creation. +:: + + # echo $$ > /sys/fs/resctrl/p1/tasks + # <cmd> + +Fetch the data:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy + 31789000 + +Example 3 (Monitor without CAT support or before creating CAT groups) +--------------------------------------------------------------------- + +Assume a system like HSW has only CQM and no CAT support. In this case +the resctrl will still mount but cannot create CTRL_MON directories. +But user can create different MON groups within the root group thereby +able to monitor all tasks including kernel threads. + +This can also be used to profile jobs cache size footprint before being +able to allocate them to different allocation groups. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir mon_groups/m01 + # mkdir mon_groups/m02 + + # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks + # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks + +Monitor the groups separately and also get per domain data. From the +below its apparent that the tasks are mostly doing work on +domain(socket) 0. +:: + + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy + 31234000 + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy + 34555 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy + 31234000 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy + 32789 + + +Example 4 (Monitor real time tasks) +----------------------------------- + +A single socket system which has real time tasks running on cores 4-7 +and non real time tasks on other cpus. We want to monitor the cache +occupancy of the real time threads on these cores. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p1 + +Move the cpus 4-7 over to p1:: + + # echo f0 > p1/cpus + +View the llc occupancy snapshot:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy + 11234000 + +Intel RDT Errata +================ + +Intel MBM Counters May Report System Memory Bandwidth Incorrectly +----------------------------------------------------------------- + +Errata SKX99 for Skylake server and BDF102 for Broadwell server. + +Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics +according to the assigned Resource Monitor ID (RMID) for that logical +core. The IA32_QM_CTR register (MSR 0xC8E), used to report these +metrics, may report incorrect system bandwidth for certain RMID values. + +Implication: Due to the errata, system memory bandwidth may not match +what is reported. + +Workaround: MBM total and local readings are corrected according to the +following correction factor table: + ++---------------+---------------+---------------+-----------------+ +|core count |rmid count |rmid threshold |correction factor| ++---------------+---------------+---------------+-----------------+ +|1 |8 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|2 |16 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|3 |24 |15 |0.969650 | ++---------------+---------------+---------------+-----------------+ +|4 |32 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|6 |48 |31 |0.969650 | ++---------------+---------------+---------------+-----------------+ +|7 |56 |47 |1.142857 | ++---------------+---------------+---------------+-----------------+ +|8 |64 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|9 |72 |63 |1.185115 | ++---------------+---------------+---------------+-----------------+ +|10 |80 |63 |1.066553 | ++---------------+---------------+---------------+-----------------+ +|11 |88 |79 |1.454545 | ++---------------+---------------+---------------+-----------------+ +|12 |96 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|13 |104 |95 |1.230769 | ++---------------+---------------+---------------+-----------------+ +|14 |112 |95 |1.142857 | ++---------------+---------------+---------------+-----------------+ +|15 |120 |95 |1.066667 | ++---------------+---------------+---------------+-----------------+ +|16 |128 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|17 |136 |127 |1.254863 | ++---------------+---------------+---------------+-----------------+ +|18 |144 |127 |1.185255 | ++---------------+---------------+---------------+-----------------+ +|19 |152 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|20 |160 |127 |1.066667 | ++---------------+---------------+---------------+-----------------+ +|21 |168 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|22 |176 |159 |1.454334 | ++---------------+---------------+---------------+-----------------+ +|23 |184 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|24 |192 |127 |0.969744 | ++---------------+---------------+---------------+-----------------+ +|25 |200 |191 |1.280246 | ++---------------+---------------+---------------+-----------------+ +|26 |208 |191 |1.230921 | ++---------------+---------------+---------------+-----------------+ +|27 |216 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|28 |224 |191 |1.143118 | ++---------------+---------------+---------------+-----------------+ + +If rmid > rmid threshold, MBM total and local values should be multiplied +by the correction factor. + +See: + +1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update: +http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html + +2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update: +http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf + +3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual: +https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html + +for further information. diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst new file mode 100644 index 000000000000..2bcbffacbed5 --- /dev/null +++ b/Documentation/arch/x86/sgx.rst @@ -0,0 +1,302 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Software Guard eXtensions (SGX) +=============================== + +Overview +======== + +Software Guard eXtensions (SGX) hardware enables for user space applications +to set aside private memory regions of code and data: + +* Privileged (ring-0) ENCLS functions orchestrate the construction of the + regions. +* Unprivileged (ring-3) ENCLU functions allow an application to enter and + execute inside the regions. + +These memory regions are called enclaves. An enclave can be only entered at a +fixed set of entry points. Each entry point can hold a single hardware thread +at a time. While the enclave is loaded from a regular binary file by using +ENCLS functions, only the threads inside the enclave can access its memory. The +region is denied from outside access by the CPU, and encrypted before it leaves +from LLC. + +The support can be determined by + + ``grep sgx /proc/cpuinfo`` + +SGX must both be supported in the processor and enabled by the BIOS. If SGX +appears to be unsupported on a system which has hardware support, ensure +support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" +and "Software Enabled" modes for SGX, choose "Enabled". + +Enclave Page Cache +================== + +SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated +with an enclave. It is contained in a BIOS-reserved region of physical memory. +Unlike pages used for regular memory, pages can only be accessed from outside of +the enclave during enclave construction with special, limited SGX instructions. + +Only a CPU executing inside an enclave can directly access enclave memory. +However, a CPU executing inside an enclave may access normal memory outside the +enclave. + +The kernel manages enclave memory similar to how it treats device memory. + +Enclave Page Types +------------------ + +**SGX Enclave Control Structure (SECS)** + Enclave's address range, attributes and other global data are defined + by this structure. + +**Regular (REG)** + Regular EPC pages contain the code and data of an enclave. + +**Thread Control Structure (TCS)** + Thread Control Structure pages define the entry points to an enclave and + track the execution state of an enclave thread. + +**Version Array (VA)** + Version Array pages contain 512 slots, each of which can contain a version + number for a page evicted from the EPC. + +Enclave Page Cache Map +---------------------- + +The processor tracks EPC pages in a hardware metadata structure called the +*Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page +which describes the owning enclave, access rights and page type among the other +things. + +EPCM permissions are separate from the normal page tables. This prevents the +kernel from, for instance, allowing writes to data which an enclave wishes to +remain read-only. EPCM permissions may only impose additional restrictions on +top of normal x86 page permissions. + +For all intents and purposes, the SGX architecture allows the processor to +invalidate all EPCM entries at will. This requires that software be prepared to +handle an EPCM fault at any time. In practice, this can happen on events like +power transitions when the ephemeral key that encrypts enclave memory is lost. + +Application interface +===================== + +Enclave build functions +----------------------- + +In addition to the traditional compiler and linker build process, SGX has a +separate enclave “build” process. Enclaves must be built before they can be +executed (entered). The first step in building an enclave is opening the +**/dev/sgx_enclave** device. Since enclave memory is protected from direct +access, special privileged instructions are then used to copy data into enclave +pages and establish enclave page permissions. + +.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c + :functions: sgx_ioc_enclave_create + sgx_ioc_enclave_add_pages + sgx_ioc_enclave_init + sgx_ioc_enclave_provision + +Enclave runtime management +-------------------------- + +Systems supporting SGX2 additionally support changes to initialized +enclaves: modifying enclave page permissions and type, and dynamically +adding and removing of enclave pages. When an enclave accesses an address +within its address range that does not have a backing page then a new +regular page will be dynamically added to the enclave. The enclave is +still required to run EACCEPT on the new page before it can be used. + +.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c + :functions: sgx_ioc_enclave_restrict_permissions + sgx_ioc_enclave_modify_types + sgx_ioc_enclave_remove_pages + +Enclave vDSO +------------ + +Entering an enclave can only be done through SGX-specific EENTER and ERESUME +functions, and is a non-trivial process. Because of the complexity of +transitioning to and from an enclave, enclaves typically utilize a library to +handle the actual transitions. This is roughly analogous to how glibc +implementations are used by most applications to wrap system calls. + +Another crucial characteristic of enclaves is that they can generate exceptions +as part of their normal operation that need to be handled in the enclave or are +unique to SGX. + +Instead of the traditional signal mechanism to handle these exceptions, SGX +can leverage special exception fixup provided by the vDSO. The kernel-provided +vDSO function wraps low-level transitions to/from the enclave like EENTER and +ERESUME. The vDSO function intercepts exceptions that would otherwise generate +a signal and return the fault information directly to its caller. This avoids +the need to juggle signal handlers. + +.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h + :functions: vdso_sgx_enter_enclave_t + +ksgxd +===== + +SGX support includes a kernel thread called *ksgxd*. + +EPC sanitization +---------------- + +ksgxd is started when SGX initializes. Enclave memory is typically ready +for use when the processor powers on or resets. However, if SGX has been in +use since the reset, enclave pages may be in an inconsistent state. This might +occur after a crash and kexec() cycle, for instance. At boot, ksgxd +reinitializes all enclave pages so that they can be allocated and re-used. + +The sanitization is done by going through EPC address space and applying the +EREMOVE function to each physical page. Some enclave pages like SECS pages have +hardware dependencies on other pages which prevents EREMOVE from functioning. +Executing two EREMOVE passes removes the dependencies. + +Page reclaimer +-------------- + +Similar to the core kswapd, ksgxd, is responsible for managing the +overcommitment of enclave memory. If the system runs out of enclave memory, +*ksgxd* “swaps” enclave memory to normal memory. + +Launch Control +============== + +SGX provides a launch control mechanism. After all enclave pages have been +copied, kernel executes EINIT function, which initializes the enclave. Only after +this the CPU can execute inside the enclave. + +EINIT function takes an RSA-3072 signature of the enclave measurement. The function +checks that the measurement is correct and signature is signed with the key +hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the +SHA256 of a public key. + +Those MSRs can be configured by the BIOS to be either readable or writable. +Linux supports only writable configuration in order to give full control to the +kernel on launch control policy. Before calling EINIT function, the driver sets +the MSRs to match the enclave's signing key. + +Encryption engines +================== + +In order to conceal the enclave data while it is out of the CPU package, the +memory controller has an encryption engine to transparently encrypt and decrypt +enclave memory. + +In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to +encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in +SRAM to maintain integrity of the encrypted data. This provides integrity and +anti-replay protection but does not scale to large memory sizes because the time +required to update the Merkle tree grows logarithmically in relation to the +memory size. + +CPUs starting from Icelake use Total Memory Encryption (TME) in the place of +MEE. TME-based SGX implementations do not have an integrity Merkle tree, which +means integrity and replay-attacks are not mitigated. B, it includes +additional changes to prevent cipher text from being returned and SW memory +aliases from being created. + +DMA to enclave memory is blocked by range registers on both MEE and TME systems +(SDM section 41.10). + +Usage Models +============ + +Shared Library +-------------- + +Sensitive data and the code that acts on it is partitioned from the application +into a separate library. The library is then linked as a DSO which can be loaded +into an enclave. The application can then make individual function calls into +the enclave through special SGX instructions. A run-time within the enclave is +configured to marshal function parameters into and out of the enclave and to +call the correct library function. + +Application Container +--------------------- + +An application may be loaded into a container enclave which is specially +configured with a library OS and run-time which permits the application to run. +The enclave run-time and library OS work together to execute the application +when a thread enters the enclave. + +Impact of Potential Kernel SGX Bugs +=================================== + +EPC leaks +--------- + +When EPC page leaks happen, a WARNING like this is shown in dmesg: + +"EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." + +This is effectively a kernel use-after-free of an EPC page, and due +to the way SGX works, the bug is detected at freeing. Rather than +adding the page back to the pool of available EPC pages, the kernel +intentionally leaks the page to avoid additional errors in the future. + +When this happens, the kernel will likely soon leak more EPC pages, and +SGX will likely become unusable because the memory available to SGX is +limited. However, while this may be fatal to SGX, the rest of the kernel +is unlikely to be impacted and should continue to work. + +As a result, when this happpens, user should stop running any new +SGX workloads, (or just any new workloads), and migrate all valuable +workloads. Although a machine reboot can recover all EPC memory, the bug +should be reported to Linux developers. + + +Virtual EPC +=========== + +The implementation has also a virtual EPC driver to support SGX enclaves +in guests. Unlike the SGX driver, an EPC page allocated by the virtual +EPC driver doesn't have a specific enclave associated with it. This is +because KVM doesn't track how a guest uses EPC pages. + +As a result, the SGX core page reclaimer doesn't support reclaiming EPC +pages allocated to KVM guests through the virtual EPC driver. If the +user wants to deploy SGX applications both on the host and in guests +on the same machine, the user should reserve enough EPC (by taking out +total virtual EPC size of all SGX VMs from the physical EPC size) for +host SGX applications so they can run with acceptable performance. + +Architectural behavior is to restore all EPC pages to an uninitialized +state also after a guest reboot. Because this state can be reached only +through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` +provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction +on all pages in the virtual EPC. + +``EREMOVE`` can fail for three reasons. Userspace must pay attention +to expected failures and handle them as follows: + +1. Page removal will always fail when any thread is running in the + enclave to which the page belongs. In this case the ioctl will + return ``EBUSY`` independent of whether it has successfully removed + some pages; userspace can avoid these failures by preventing execution + of any vcpu which maps the virtual EPC. + +2. Page removal will cause a general protection fault if two calls to + ``EREMOVE`` happen concurrently for pages that refer to the same + "SECS" metadata pages. This can happen if there are concurrent + invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` + file descriptor in the guest is closed at the same time as + ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. + This can be avoided in userspace by serializing calls to the ioctl() + and to close(), but in general it should not be a problem. + +3. Finally, page removal will fail for SECS metadata pages which still + have child pages. Child pages can be removed by executing + ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors + mapped into the guest. This means that the ioctl() must be called + twice: an initial set of calls to remove child pages and a subsequent + set of calls to remove SECS pages. The second set of calls is only + required for those mappings that returned a nonzero value from the + first call. It indicates a bug in the kernel or the userspace client + if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has + a return code other than 0. diff --git a/Documentation/arch/x86/sva.rst b/Documentation/arch/x86/sva.rst new file mode 100644 index 000000000000..33cb05005982 --- /dev/null +++ b/Documentation/arch/x86/sva.rst @@ -0,0 +1,286 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +Shared Virtual Addressing (SVA) with ENQCMD +=========================================== + +Background +========== + +Shared Virtual Addressing (SVA) allows the processor and device to use the +same virtual addresses avoiding the need for software to translate virtual +addresses to physical addresses. SVA is what PCIe calls Shared Virtual +Memory (SVM). + +In addition to the convenience of using application virtual addresses +by the device, it also doesn't require pinning pages for DMA. +PCIe Address Translation Services (ATS) along with Page Request Interface +(PRI) allow devices to function much the same way as the CPU handling +application page-faults. For more information please refer to the PCIe +specification Chapter 10: ATS Specification. + +Use of SVA requires IOMMU support in the platform. IOMMU is also +required to support the PCIe features ATS and PRI. ATS allows devices +to cache translations for virtual addresses. The IOMMU driver uses the +mmu_notifier() support to keep the device TLB cache and the CPU cache in +sync. When an ATS lookup fails for a virtual address, the device should +use the PRI in order to request the virtual address to be paged into the +CPU page tables. The device must use ATS again in order the fetch the +translation before use. + +Shared Hardware Workqueues +========================== + +Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits +the use of Shared Work Queues (SWQ) by both applications and Virtual +Machines (VM's). This allows better hardware utilization vs. hard +partitioning resources that could result in under utilization. In order to +allow the hardware to distinguish the context for which work is being +executed in the hardware by SWQ interface, SIOV uses Process Address Space +ID (PASID), which is a 20-bit number defined by the PCIe SIG. + +PASID value is encoded in all transactions from the device. This allows the +IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe +Resource Identifier (RID) which is the Bus/Device/Function. + + +ENQCMD +====== + +ENQCMD is a new instruction on Intel platforms that atomically submits a +work descriptor to a device. The descriptor includes the operation to be +performed, virtual addresses of all parameters, virtual address of a completion +record, and the PASID (process address space ID) of the current process. + +ENQCMD works with non-posted semantics and carries a status back if the +command was accepted by hardware. This allows the submitter to know if the +submission needs to be retried or other device specific mechanisms to +implement fairness or ensure forward progress should be provided. + +ENQCMD is the glue that ensures applications can directly submit commands +to the hardware and also permits hardware to be aware of application context +to perform I/O operations via use of PASID. + +Process Address Space Tagging +============================= + +A new thread-scoped MSR (IA32_PASID) provides the connection between +user processes and the rest of the hardware. When an application first +accesses an SVA-capable device, this MSR is initialized with a newly +allocated PASID. The driver for the device calls an IOMMU-specific API +that sets up the routing for DMA and page-requests. + +For example, the Intel Data Streaming Accelerator (DSA) uses +iommu_sva_bind_device(), which will do the following: + +- Allocate the PASID, and program the process page-table (%cr3 register) in the + PASID context entries. +- Register for mmu_notifier() to track any page-table invalidations to keep + the device TLB in sync. For example, when a page-table entry is invalidated, + the IOMMU propagates the invalidation to the device TLB. This will force any + future access by the device to this virtual address to participate in + ATS. If the IOMMU responds with proper response that a page is not + present, the device would request the page to be paged in via the PCIe PRI + protocol before performing I/O. + +This MSR is managed with the XSAVE feature set as "supervisor state" to +ensure the MSR is updated during context switch. + +PASID Management +================ + +The kernel must allocate a PASID on behalf of each process which will use +ENQCMD and program it into the new MSR to communicate the process identity to +platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests +from this process. When a user submits a work descriptor to a device using the +ENQCMD instruction, the PASID field in the descriptor is auto-filled with the +value from MSR_IA32_PASID. Requests for DMA from the device are also tagged +with the same PASID. The platform IOMMU uses the PASID in the transaction to +perform address translation. The IOMMU APIs setup the corresponding PASID +entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in +x86). + +The MSR must be configured on each logical CPU before any application +thread can interact with a device. Threads that belong to the same +process share the same page tables, thus the same MSR value. + +PASID Life Cycle Management +=========================== + +PASID is initialized as IOMMU_PASID_INVALID (-1) when a process is created. + +Only processes that access SVA-capable devices need to have a PASID +allocated. This allocation happens when a process opens/binds an SVA-capable +device but finds no PASID for this process. Subsequent binds of the same, or +other devices will share the same PASID. + +Although the PASID is allocated to the process by opening a device, +it is not active in any of the threads of that process. It's loaded to the +IA32_PASID MSR lazily when a thread tries to submit a work descriptor +to a device using the ENQCMD. + +That first access will trigger a #GP fault because the IA32_PASID MSR +has not been initialized with the PASID value assigned to the process +when the device was opened. The Linux #GP handler notes that a PASID has +been allocated for the process, and so initializes the IA32_PASID MSR +and returns so that the ENQCMD instruction is re-executed. + +On fork(2) or exec(2) the PASID is removed from the process as it no +longer has the same address space that it had when the device was opened. + +On clone(2) the new task shares the same address space, so will be +able to use the PASID allocated to the process. The IA32_PASID is not +preemptively initialized as the PASID value might not be allocated yet or +the kernel does not know whether this thread is going to access the device +and the cleared IA32_PASID MSR reduces context switch overhead by xstate +init optimization. Since #GP faults have to be handled on any threads that +were created before the PASID was assigned to the mm of the process, newly +created threads might as well be treated in a consistent way. + +Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in +all threads in unbind, free the PASID lazily only on mm exit. + +If a process does a close(2) of the device file descriptor and munmap(2) +of the device MMIO portal, then the driver will unbind the device. The +PASID is still marked VALID in the PASID_MSR for any threads in the +process that accessed the device. But this is harmless as without the +MMIO portal they cannot submit new work to the device. + +Relationships +============= + + * Each process has many threads, but only one PASID. + * Devices have a limited number (~10's to 1000's) of hardware workqueues. + The device driver manages allocating hardware workqueues. + * A single mmap() maps a single hardware workqueue as a "portal" and + each portal maps down to a single workqueue. + * For each device with which a process interacts, there must be + one or more mmap()'d portals. + * Many threads within a process can share a single portal to access + a single device. + * Multiple processes can separately mmap() the same portal, in + which case they still share one device hardware workqueue. + * The single process-wide PASID is used by all threads to interact + with all devices. There is not, for instance, a PASID for each + thread or each thread<->device pair. + +FAQ +=== + +* What is SVA/SVM? + +Shared Virtual Addressing (SVA) permits I/O hardware and the processor to +work in the same address space, i.e., to share it. Some call it Shared +Virtual Memory (SVM), but Linux community wanted to avoid confusing it with +POSIX Shared Memory and Secure Virtual Machines which were terms already in +circulation. + +* What is a PASID? + +A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet +(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. +PASID is included in all transactions between the platform and the device. + +* How are shared workqueues different? + +Traditionally, in order for userspace applications to interact with hardware, +there is a separate hardware instance required per process. For example, +consider doorbells as a mechanism of informing hardware about work to process. +Each doorbell is required to be spaced 4k (or page-size) apart for process +isolation. This requires hardware to provision that space and reserve it in +MMIO. This doesn't scale as the number of threads becomes quite large. The +hardware also manages the queue depth for Shared Work Queues (SWQ), and +consumers don't need to track queue depth. If there is no space to accept +a command, the device will return an error indicating retry. + +A user should check Deferrable Memory Write (DMWr) capability on the device +and only submits ENQCMD when the device supports it. In the new DMWr PCIe +terminology, devices need to support DMWr completer capability. In addition, +it requires all switch ports to support DMWr routing and must be enabled by +the PCIe subsystem, much like how PCIe atomic operations are managed for +instance. + +SWQ allows hardware to provision just a single address in the device. When +used with ENQCMD to submit work, the device can distinguish the process +submitting the work since it will include the PASID assigned to that +process. This helps the device scale to a large number of processes. + +* Is this the same as a user space device driver? + +Communicating with the device via the shared workqueue is much simpler +than a full blown user space driver. The kernel driver does all the +initialization of the hardware. User space only needs to worry about +submitting work and processing completions. + +* Is this the same as SR-IOV? + +Single Root I/O Virtualization (SR-IOV) focuses on providing independent +hardware interfaces for virtualizing hardware. Hence, it's required to be +almost fully functional interface to software supporting the traditional +BARs, space for interrupts via MSI-X, its own register layout. +Virtual Functions (VFs) are assisted by the Physical Function (PF) +driver. + +Scalable I/O Virtualization builds on the PASID concept to create device +instances for virtualization. SIOV requires host software to assist in +creating virtual devices; each virtual device is represented by a PASID +along with the bus/device/function of the device. This allows device +hardware to optimize device resource creation and can grow dynamically on +demand. SR-IOV creation and management is very static in nature. Consult +references below for more details. + +* Why not just create a virtual function for each app? + +Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require +duplicated hardware for PCI config space and interrupts such as MSI-X. +Resources such as interrupts have to be hard partitioned between VFs at +creation time, and cannot scale dynamically on demand. The VFs are not +completely independent from the Physical Function (PF). Most VFs require +some communication and assistance from the PF driver. SIOV, in contrast, +creates a software-defined device where all the configuration and control +aspects are mediated via the slow path. The work submission and completion +happen without any mediation. + +* Does this support virtualization? + +ENQCMD can be used from within a guest VM. In these cases, the VMM helps +with setting up a translation table to translate from Guest PASID to Host +PASID. Please consult the ENQCMD instruction set reference for more +details. + +* Does memory need to be pinned? + +When devices support SVA along with platform hardware such as IOMMU +supporting such devices, there is no need to pin memory for DMA purposes. +Devices that support SVA also support other PCIe features that remove the +pinning requirement for memory. + +Device TLB support - Device requests the IOMMU to lookup an address before +use via Address Translation Service (ATS) requests. If the mapping exists +but there is no page allocated by the OS, IOMMU hardware returns that no +mapping exists. + +Device requests the virtual address to be mapped via Page Request +Interface (PRI). Once the OS has successfully completed the mapping, it +returns the response back to the device. The device requests again for +a translation and continues. + +IOMMU works with the OS in managing consistency of page-tables with the +device. When removing pages, it interacts with the device to remove any +device TLB entry that might have been cached before removing the mappings from +the OS. + +References +========== + +VT-D: +https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d + +SIOV: +https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux + +ENQCMD in ISE: +https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf + +DSA spec: +https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst new file mode 100644 index 000000000000..dc8d9fd2c3f7 --- /dev/null +++ b/Documentation/arch/x86/tdx.rst @@ -0,0 +1,261 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +Intel Trust Domain Extensions (TDX) +===================================== + +Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from +the host and physical attacks by isolating the guest register state and by +encrypting the guest memory. In TDX, a special module running in a special +mode sits between the host and the guest and manages the guest/host +separation. + +Since the host cannot directly access guest registers or memory, much +normal functionality of a hypervisor must be moved into the guest. This is +implemented using a Virtualization Exception (#VE) that is handled by the +guest kernel. A #VE is handled entirely inside the guest kernel, but some +require the hypervisor to be consulted. + +TDX includes new hypercall-like mechanisms for communicating from the +guest to the hypervisor or the TDX module. + +New TDX Exceptions +================== + +TDX guests behave differently from bare-metal and traditional VMX guests. +In TDX guests, otherwise normal instructions or memory accesses can cause +#VE or #GP exceptions. + +Instructions marked with an '*' conditionally cause exceptions. The +details for these instructions are discussed below. + +Instruction-based #VE +--------------------- + +- Port I/O (INS, OUTS, IN, OUT) +- HLT +- MONITOR, MWAIT +- WBINVD, INVD +- VMCALL +- RDMSR*,WRMSR* +- CPUID* + +Instruction-based #GP +--------------------- + +- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, + VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON +- ENCLS, ENCLU +- GETSEC +- RSM +- ENQCMD +- RDMSR*,WRMSR* + +RDMSR/WRMSR Behavior +-------------------- + +MSR access behavior falls into three categories: + +- #GP generated +- #VE generated +- "Just works" + +In general, the #GP MSRs should not be used in guests. Their use likely +indicates a bug in the guest. The guest may try to handle the #GP with a +hypercall but it is unlikely to succeed. + +The #VE MSRs are typically able to be handled by the hypervisor. Guests +can make a hypercall to the hypervisor to handle the #VE. + +The "just works" MSRs do not need any special guest handling. They might +be implemented by directly passing through the MSR to the hardware or by +trapping and handling in the TDX module. Other than possibly being slow, +these MSRs appear to function just as they would on bare metal. + +CPUID Behavior +-------------- + +For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID +return values (in guest EAX/EBX/ECX/EDX) are configurable by the +hypervisor. For such cases, the Intel TDX module architecture defines two +virtualization types: + +- Bit fields for which the hypervisor controls the value seen by the guest + TD. + +- Bit fields for which the hypervisor configures the value such that the + guest TD either sees their native value or a value of 0. For these bit + fields, the hypervisor can mask off the native values, but it can not + turn *on* values. + +A #VE is generated for CPUID leaves and sub-leaves that the TDX module does +not know how to handle. The guest kernel may ask the hypervisor for the +value with a hypercall. + +#VE on Memory Accesses +====================== + +There are essentially two classes of TDX memory: private and shared. +Private memory receives full TDX protections. Its content is protected +against access from the hypervisor. Shared memory is expected to be +shared between guest and hypervisor and does not receive full TDX +protections. + +A TD guest is in control of whether its memory accesses are treated as +private or shared. It selects the behavior with a bit in its page table +entries. This helps ensure that a guest does not place sensitive +information in shared memory, exposing it to the untrusted hypervisor. + +#VE on Shared Memory +-------------------- + +Access to shared mappings can cause a #VE. The hypervisor ultimately +controls whether a shared memory access causes a #VE, so the guest must be +careful to only reference shared pages it can safely handle a #VE. For +instance, the guest should be careful not to access shared memory in the +#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET). + +Shared mapping content is entirely controlled by the hypervisor. The guest +should only use shared mappings for communicating with the hypervisor. +Shared mappings must never be used for sensitive memory content like kernel +stacks. A good rule of thumb is that hypervisor-shared memory should be +treated the same as memory mapped to userspace. Both the hypervisor and +userspace are completely untrusted. + +MMIO for virtual devices is implemented as shared memory. The guest must +be careful not to access device MMIO regions unless it is also prepared to +handle a #VE. + +#VE on Private Pages +-------------------- + +An access to private mappings can also cause a #VE. Since all kernel +memory is also private memory, the kernel might theoretically need to +handle a #VE on arbitrary kernel memory accesses. This is not feasible, so +TDX guests ensure that all guest memory has been "accepted" before memory +is used by the kernel. + +A modest amount of memory (typically 512M) is pre-accepted by the firmware +before the kernel runs to ensure that the kernel can start up without +being subjected to a #VE. + +The hypervisor is permitted to unilaterally move accepted pages to a +"blocked" state. However, if it does this, page access will not generate a +#VE. It will, instead, cause a "TD Exit" where the hypervisor is required +to handle the exception. + +Linux #VE handler +================= + +Just like page faults or #GP's, #VE exceptions can be either handled or be +fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. +An unhandled kernel #VE results in an oops. + +Handling nested exceptions on x86 is typically nasty business. A #VE +could be interrupted by an NMI which triggers another #VE and hilarity +ensues. The TDX #VE architecture anticipated this scenario and includes a +feature to make it slightly less nasty. + +During #VE handling, the TDX module ensures that all interrupts (including +NMIs) are blocked. The block remains in place until the guest makes a +TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts +or a new #VE can be delivered. + +However, the guest kernel must still be careful to avoid potential +#VE-triggering actions (discussed above) while this block is in place. +While the block is in place, any #VE is elevated to a double fault (#DF) +which is not recoverable. + +MMIO handling +============= + +In non-TDX VMs, MMIO is usually implemented by giving a guest access to a +mapping which will cause a VMEXIT on access, and then the hypervisor +emulates the access. That is not possible in TDX guests because VMEXIT +will expose the register state to the host. TDX guests don't trust the host +and can't have their state exposed to the host. + +In TDX, MMIO regions typically trigger a #VE exception in the guest. The +guest #VE handler then emulates the MMIO instruction inside the guest and +converts it into a controlled TDCALL to the host, rather than exposing +guest state to the host. + +MMIO addresses on x86 are just special physical addresses. They can +theoretically be accessed with any instruction that accesses memory. +However, the kernel instruction decoding method is limited. It is only +designed to decode instructions like those generated by io.h macros. + +MMIO access via other means (like structure overlays) may result in an +oops. + +Shared Memory Conversions +========================= + +All TDX guest memory starts out as private at boot. This memory can not +be accessed by the hypervisor. However, some kernel users like device +drivers might have a need to share data with the hypervisor. To do this, +memory must be converted between shared and private. This can be +accomplished using some existing memory encryption helpers: + + * set_memory_decrypted() converts a range of pages to shared. + * set_memory_encrypted() converts memory back to private. + +Device drivers are the primary user of shared memory, but there's no need +to touch every driver. DMA buffers and ioremap() do the conversions +automatically. + +TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is +converted to shared on boot. + +For coherent DMA allocation, the DMA buffer gets converted on the +allocation. Check force_dma_unencrypted() for details. + +Attestation +=========== + +Attestation is used to verify the TDX guest trustworthiness to other +entities before provisioning secrets to the guest. For example, a key +server may want to use attestation to verify that the guest is the +desired one before releasing the encryption keys to mount the encrypted +rootfs or a secondary drive. + +The TDX module records the state of the TDX guest in various stages of +the guest boot process using the build time measurement register (MRTD) +and runtime measurement registers (RTMR). Measurements related to the +guest initial configuration and firmware image are recorded in the MRTD +register. Measurements related to initial state, kernel image, firmware +image, command line options, initrd, ACPI tables, etc are recorded in +RTMR registers. For more details, as an example, please refer to TDX +Virtual Firmware design specification, section titled "TD Measurement". +At TDX guest runtime, the attestation process is used to attest to these +measurements. + +The attestation process consists of two steps: TDREPORT generation and +Quote generation. + +TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT) +from the TDX module. TDREPORT is a fixed-size data structure generated by +the TDX module which contains guest-specific information (such as build +and boot measurements), platform security version, and the MAC to protect +the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used +as input and included in the TDREPORT. Typically it can be some nonce +provided by attestation service so the TDREPORT can be verified uniquely. +More details about the TDREPORT can be found in Intel TDX Module +specification, section titled "TDG.MR.REPORT Leaf". + +After getting the TDREPORT, the second step of the attestation process +is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT +by design can only be verified on the local platform as the MAC key is +bound to the platform. To support remote verification of the TDREPORT, +TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally +and convert it to a remotely verifiable Quote. Method of sending TDREPORT +to QE is implementation specific. Attestation software can choose +whatever communication channel available (i.e. vsock or TCP/IP) to +send the TDREPORT to QE and receive the Quote. + +References +========== + +TDX reference material is collected here: + +https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html diff --git a/Documentation/arch/x86/tlb.rst b/Documentation/arch/x86/tlb.rst new file mode 100644 index 000000000000..82ec58ae63a8 --- /dev/null +++ b/Documentation/arch/x86/tlb.rst @@ -0,0 +1,83 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +The TLB +======= + +When the kernel unmaps or modified the attributes of a range of +memory, it has two choices: + + 1. Flush the entire TLB with a two-instruction sequence. This is + a quick operation, but it causes collateral damage: TLB entries + from areas other than the one we are trying to flush will be + destroyed and must be refilled later, at some cost. + 2. Use the invlpg instruction to invalidate a single page at a + time. This could potentially cost many more instructions, but + it is a much more precise operation, causing no collateral + damage to other TLB entries. + +Which method to do depends on a few things: + + 1. The size of the flush being performed. A flush of the entire + address space is obviously better performed by flushing the + entire TLB than doing 2^48/PAGE_SIZE individual flushes. + 2. The contents of the TLB. If the TLB is empty, then there will + be no collateral damage caused by doing the global flush, and + all of the individual flush will have ended up being wasted + work. + 3. The size of the TLB. The larger the TLB, the more collateral + damage we do with a full flush. So, the larger the TLB, the + more attractive an individual flush looks. Data and + instructions have separate TLBs, as do different page sizes. + 4. The microarchitecture. The TLB has become a multi-level + cache on modern CPUs, and the global flushes have become more + expensive relative to single-page flushes. + +There is obviously no way the kernel can know all these things, +especially the contents of the TLB during a given flush. The +sizes of the flush will vary greatly depending on the workload as +well. There is essentially no "right" point to choose. + +You may be doing too many individual invalidations if you see the +invlpg instruction (or instructions _near_ it) show up high in +profiles. If you believe that individual invalidations being +called too often, you can lower the tunable:: + + /sys/kernel/debug/x86/tlb_single_page_flush_ceiling + +This will cause us to do the global flush for more cases. +Lowering it to 0 will disable the use of the individual flushes. +Setting it to 1 is a very conservative setting and it should +never need to be 0 under normal circumstances. + +Despite the fact that a single individual flush on x86 is +guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full +flushes. THP is treated exactly the same as normal memory. + +You might see invlpg inside of flush_tlb_mm_range() show up in +profiles, or you can use the trace_tlb_flush() tracepoints. to +determine how long the flush operations are taking. + +Essentially, you are balancing the cycles you spend doing invlpg +with the cycles that you spend refilling the TLB later. + +You can measure how expensive TLB refills are by using +performance counters and 'perf stat', like this:: + + perf stat -e + cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, + cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, + cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, + cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, + cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, + cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ + +That works on an IvyBridge-era CPU (i5-3320M). Different CPUs +may have differently-named counters, but they should at least +be there in some form. You can use pmu-tools 'ocperf list' +(https://github.com/andikleen/pmu-tools) to find the right +counters for a given CPU. + +.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" + says: "One execution of INVLPG is sufficient even for a page + with size greater than 4 KBytes." diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst new file mode 100644 index 000000000000..7f58010ea86a --- /dev/null +++ b/Documentation/arch/x86/topology.rst @@ -0,0 +1,234 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +x86 Topology +============ + +This documents and clarifies the main aspects of x86 topology modelling and +representation in the kernel. Update/change when doing changes to the +respective code. + +The architecture-agnostic topology definitions are in +Documentation/admin-guide/cputopology.rst. This file holds x86-specific +differences/specialities which must not necessarily apply to the generic +definitions. Thus, the way to read up on Linux topology on x86 is to start +with the generic one and look at this one in parallel for the x86 specifics. + +Needless to say, code should use the generic functions - this file is *only* +here to *document* the inner workings of x86 topology. + +Started by Thomas Gleixner <tglx@linutronix.de> and Borislav Petkov <bp@alien8.de>. + +The main aim of the topology facilities is to present adequate interfaces to +code which needs to know/query/use the structure of the running system wrt +threads, cores, packages, etc. + +The kernel does not care about the concept of physical sockets because a +socket has no relevance to software. It's an electromechanical component. In +the past a socket always contained a single package (see below), but with the +advent of Multi Chip Modules (MCM) a socket can hold more than one package. So +there might be still references to sockets in the code, but they are of +historical nature and should be cleaned up. + +The topology of a system is described in the units of: + + - packages + - cores + - threads + +Package +======= +Packages contain a number of cores plus shared resources, e.g. DRAM +controller, shared caches etc. + +Modern systems may also use the term 'Die' for package. + +AMD nomenclature for package is 'Node'. + +Package-related topology information in the kernel: + + - cpuinfo_x86.x86_max_cores: + + The number of cores in a package. This information is retrieved via CPUID. + + - cpuinfo_x86.x86_max_dies: + + The number of dies in a package. This information is retrieved via CPUID. + + - cpuinfo_x86.cpu_die_id: + + The physical ID of the die. This information is retrieved via CPUID. + + - cpuinfo_x86.phys_proc_id: + + The physical ID of the package. This information is retrieved via CPUID + and deduced from the APIC IDs of the cores in the package. + + Modern systems use this value for the socket. There may be multiple + packages within a socket. This value may differ from cpu_die_id. + + - cpuinfo_x86.logical_proc_id: + + The logical ID of the package. As we do not trust BIOSes to enumerate the + packages in a consistent way, we introduced the concept of logical package + ID so we can sanely calculate the number of maximum possible packages in + the system and have the packages enumerated linearly. + + - topology_max_packages(): + + The maximum possible number of packages in the system. Helpful for per + package facilities to preallocate per package information. + + - cpu_llc_id: + + A per-CPU variable containing: + + - On Intel, the first APIC ID of the list of CPUs sharing the Last Level + Cache + + - On AMD, the Node ID or Core Complex ID containing the Last Level + Cache. In general, it is a number identifying an LLC uniquely on the + system. + +Cores +===== +A core consists of 1 or more threads. It does not matter whether the threads +are SMT- or CMT-type threads. + +AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses +"core". + +Core-related topology information in the kernel: + + - smp_num_siblings: + + The number of threads in a core. The number of threads in a package can be + calculated by:: + + threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings + + +Threads +======= +A thread is a single scheduling unit. It's the equivalent to a logical Linux +CPU. + +AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always +uses "thread". + +Thread-related topology information in the kernel: + + - topology_core_cpumask(): + + The cpumask contains all online threads in the package to which a thread + belongs. + + The number of online threads is also printed in /proc/cpuinfo "siblings." + + - topology_sibling_cpumask(): + + The cpumask contains all online threads in the core to which a thread + belongs. + + - topology_logical_package_id(): + + The logical package ID to which a thread belongs. + + - topology_physical_package_id(): + + The physical package ID to which a thread belongs. + + - topology_core_id(); + + The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo + "core_id." + + + +System topology examples +======================== + +.. note:: + The alternative Linux CPU enumeration depends on how the BIOS enumerates the + threads. Many BIOSes enumerate all threads 0 first and then all threads 1. + That has the "advantage" that the logical Linux CPU numbers of threads 0 stay + the same whether threads are enabled or not. That's merely an implementation + detail and has no practical impact. + +1) Single Package, Single Core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + +2) Single Package, Dual Core + + a) One thread per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [core 1] -> [thread 0] -> Linux CPU 1 + + b) Two threads per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 1 + -> [core 1] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 3 + + Alternative enumeration:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 2 + -> [core 1] -> [thread 0] -> Linux CPU 1 + -> [thread 1] -> Linux CPU 3 + + AMD nomenclature for CMT systems:: + + [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 + -> [Compute Unit Core 1] -> Linux CPU 1 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 + -> [Compute Unit Core 1] -> Linux CPU 3 + +4) Dual Package, Dual Core + + a) One thread per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [core 1] -> [thread 0] -> Linux CPU 1 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 + -> [core 1] -> [thread 0] -> Linux CPU 3 + + b) Two threads per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 1 + -> [core 1] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 3 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 4 + -> [thread 1] -> Linux CPU 5 + -> [core 1] -> [thread 0] -> Linux CPU 6 + -> [thread 1] -> Linux CPU 7 + + Alternative enumeration:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 4 + -> [core 1] -> [thread 0] -> Linux CPU 1 + -> [thread 1] -> Linux CPU 5 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 6 + -> [core 1] -> [thread 0] -> Linux CPU 3 + -> [thread 1] -> Linux CPU 7 + + AMD nomenclature for CMT systems:: + + [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 + -> [Compute Unit Core 1] -> Linux CPU 1 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 + -> [Compute Unit Core 1] -> Linux CPU 3 + + [node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4 + -> [Compute Unit Core 1] -> Linux CPU 5 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6 + -> [Compute Unit Core 1] -> Linux CPU 7 diff --git a/Documentation/arch/x86/tsx_async_abort.rst b/Documentation/arch/x86/tsx_async_abort.rst new file mode 100644 index 000000000000..583ddc185ba2 --- /dev/null +++ b/Documentation/arch/x86/tsx_async_abort.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +TSX Async Abort (TAA) mitigation +================================ + +.. _tsx_async_abort: + +Overview +-------- + +TSX Async Abort (TAA) is a side channel attack on internal buffers in some +Intel processors similar to Microachitectural Data Sampling (MDS). In this +case certain loads may speculatively pass invalid data to dependent operations +when an asynchronous abort condition is pending in a Transactional +Synchronization Extensions (TSX) transaction. This includes loads with no +fault or assist condition. Such loads may speculatively expose stale data from +the same uarch data structures as in MDS, with same scope of exposure i.e. +same-thread and cross-thread. This issue affects all current processors that +support TSX. + +Mitigation strategy +------------------- + +a) TSX disable - one of the mitigations is to disable TSX. A new MSR +IA32_TSX_CTRL will be available in future and current processors after +microcode update which can be used to disable TSX. In addition, it +controls the enumeration of the TSX feature bits (RTM and HLE) in CPUID. + +b) Clear CPU buffers - similar to MDS, clearing the CPU buffers mitigates this +vulnerability. More details on this approach can be found in +:ref:`Documentation/admin-guide/hw-vuln/mds.rst <mds>`. + +Kernel internal mitigation modes +-------------------------------- + + ============= ============================================================ + off Mitigation is disabled. Either the CPU is not affected or + tsx_async_abort=off is supplied on the kernel command line. + + tsx disabled Mitigation is enabled. TSX feature is disabled by default at + bootup on processors that support TSX control. + + verw Mitigation is enabled. CPU is affected and MD_CLEAR is + advertised in CPUID. + + ucode needed Mitigation is enabled. CPU is affected and MD_CLEAR is not + advertised in CPUID. That is mainly for virtualization + scenarios where the host has the updated microcode but the + hypervisor does not expose MD_CLEAR in CPUID. It's a best + effort approach without guarantee. + ============= ============================================================ + +If the CPU is affected and the "tsx_async_abort" kernel command line parameter is +not provided then the kernel selects an appropriate mitigation depending on the +status of RTM and MD_CLEAR CPUID bits. + +Below tables indicate the impact of tsx=on|off|auto cmdline options on state of +TAA mitigation, VERW behavior and TSX feature for various combinations of +MSR_IA32_ARCH_CAPABILITIES bits. + +1. "tsx=off" + +========= ========= ============ ============ ============== =================== ====================== +MSR_IA32_ARCH_CAPABILITIES bits Result with cmdline tsx=off +---------------------------------- ------------------------------------------------------------------------- +TAA_NO MDS_NO TSX_CTRL_MSR TSX state VERW can clear TAA mitigation TAA mitigation + after bootup CPU buffers tsx_async_abort=off tsx_async_abort=full +========= ========= ============ ============ ============== =================== ====================== + 0 0 0 HW default Yes Same as MDS Same as MDS + 0 0 1 Invalid case Invalid case Invalid case Invalid case + 0 1 0 HW default No Need ucode update Need ucode update + 0 1 1 Disabled Yes TSX disabled TSX disabled + 1 X 1 Disabled X None needed None needed +========= ========= ============ ============ ============== =================== ====================== + +2. "tsx=on" + +========= ========= ============ ============ ============== =================== ====================== +MSR_IA32_ARCH_CAPABILITIES bits Result with cmdline tsx=on +---------------------------------- ------------------------------------------------------------------------- +TAA_NO MDS_NO TSX_CTRL_MSR TSX state VERW can clear TAA mitigation TAA mitigation + after bootup CPU buffers tsx_async_abort=off tsx_async_abort=full +========= ========= ============ ============ ============== =================== ====================== + 0 0 0 HW default Yes Same as MDS Same as MDS + 0 0 1 Invalid case Invalid case Invalid case Invalid case + 0 1 0 HW default No Need ucode update Need ucode update + 0 1 1 Enabled Yes None Same as MDS + 1 X 1 Enabled X None needed None needed +========= ========= ============ ============ ============== =================== ====================== + +3. "tsx=auto" + +========= ========= ============ ============ ============== =================== ====================== +MSR_IA32_ARCH_CAPABILITIES bits Result with cmdline tsx=auto +---------------------------------- ------------------------------------------------------------------------- +TAA_NO MDS_NO TSX_CTRL_MSR TSX state VERW can clear TAA mitigation TAA mitigation + after bootup CPU buffers tsx_async_abort=off tsx_async_abort=full +========= ========= ============ ============ ============== =================== ====================== + 0 0 0 HW default Yes Same as MDS Same as MDS + 0 0 1 Invalid case Invalid case Invalid case Invalid case + 0 1 0 HW default No Need ucode update Need ucode update + 0 1 1 Disabled Yes TSX disabled TSX disabled + 1 X 1 Enabled X None needed None needed +========= ========= ============ ============ ============== =================== ====================== + +In the tables, TSX_CTRL_MSR is a new bit in MSR_IA32_ARCH_CAPABILITIES that +indicates whether MSR_IA32_TSX_CTRL is supported. + +There are two control bits in IA32_TSX_CTRL MSR: + + Bit 0: When set it disables the Restricted Transactional Memory (RTM) + sub-feature of TSX (will force all transactions to abort on the + XBEGIN instruction). + + Bit 1: When set it disables the enumeration of the RTM and HLE feature + (i.e. it will make CPUID(EAX=7).EBX{bit4} and + CPUID(EAX=7).EBX{bit11} read as 0). diff --git a/Documentation/arch/x86/usb-legacy-support.rst b/Documentation/arch/x86/usb-legacy-support.rst new file mode 100644 index 000000000000..e01c08b7c981 --- /dev/null +++ b/Documentation/arch/x86/usb-legacy-support.rst @@ -0,0 +1,50 @@ + +.. SPDX-License-Identifier: GPL-2.0 + +================== +USB Legacy support +================== + +:Author: Vojtech Pavlik <vojtech@suse.cz>, January 2004 + + +Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a +feature that allows one to use the USB mouse and keyboard as if they were +their classic PS/2 counterparts. This means one can use an USB keyboard to +type in LILO for example. + +It has several drawbacks, though: + +1) On some machines, the emulated PS/2 mouse takes over even when no USB + mouse is present and a real PS/2 mouse is present. In that case the extra + features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may + not be available. + +2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause + system crashes, because the SMM BIOS is not expecting to be in PAE mode. + The Intel E7505 is a typical machine where this happens. + +3) If AMD64 64-bit mode is enabled, again system crashes often happen, + because the SMM BIOS isn't expecting the CPU to be in 64-bit mode. The + BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit + yet. + +Solutions: + +Problem 1) + can be solved by loading the USB drivers prior to loading the + PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into + the kernel unconditionally, this means the USB drivers need to be + compiled-in, too. + +Problem 2) + can currently only be solved by either disabling HIGHMEM64G + in the kernel config or USB Legacy support in the BIOS. A BIOS update + could help, but so far no such update exists. + +Problem 3) + is usually fixed by a BIOS update. Check the board + manufacturers web site. If an update is not available, disable USB + Legacy support in the BIOS. If this alone doesn't help, try also adding + idle=poll on the kernel command line. The BIOS may be entering the SMM + on the HLT instruction as well. diff --git a/Documentation/arch/x86/x86_64/5level-paging.rst b/Documentation/arch/x86/x86_64/5level-paging.rst new file mode 100644 index 000000000000..71f882f4a173 --- /dev/null +++ b/Documentation/arch/x86/x86_64/5level-paging.rst @@ -0,0 +1,67 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +5-level paging +============== + +Overview +======== +Original x86-64 was limited by 4-level paging to 256 TiB of virtual address +space and 64 TiB of physical address space. We are already bumping into +this limit: some vendors offer servers with 64 TiB of memory today. + +To overcome the limitation upcoming hardware will introduce support for +5-level paging. It is a straight-forward extension of the current page +table structure adding one more layer of translation. + +It bumps the limits to 128 PiB of virtual address space and 4 PiB of +physical address space. This "ought to be enough for anybody" ©. + +QEMU 2.9 and later support 5-level paging. + +Virtual memory layout for 5-level paging is described in +Documentation/arch/x86/x86_64/mm.rst + + +Enabling 5-level paging +======================= +CONFIG_X86_5LEVEL=y enables the feature. + +Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware. +In this case additional page table level -- p4d -- will be folded at +runtime. + +User-space and large virtual address space +========================================== +On x86, 5-level paging enables 56-bit userspace virtual address space. +Not all user space is ready to handle wide addresses. It's known that +at least some JIT compilers use higher bits in pointers to encode their +information. It collides with valid pointers with 5-level paging and +leads to crashes. + +To mitigate this, we are not going to allocate virtual address space +above 47-bit by default. + +But userspace can ask for allocation from full address space by +specifying hint address (with or without MAP_FIXED) above 47-bits. + +If hint address set above 47-bit, but MAP_FIXED is not specified, we try +to look for unmapped area by specified address. If it's already +occupied, we look for unmapped area in *full* address space, rather than +from 47-bit window. + +A high hint address would only affect the allocation in question, but not +any future mmap()s. + +Specifying high hint address on older kernel or on machine without 5-level +paging support is safe. The hint will be ignored and kernel will fall back +to allocation from 47-bit address space. + +This approach helps to easily make application's memory allocator aware +about large address space without manually tracking allocated virtual +address space. + +One important case we need to handle here is interaction with MPX. +MPX (without MAWA extension) cannot handle addresses above 47-bit, so we +need to make sure that MPX cannot be enabled we already have VMA above +the boundary and forbid creating such VMAs once MPX is enabled. diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst new file mode 100644 index 000000000000..137432d34109 --- /dev/null +++ b/Documentation/arch/x86/x86_64/boot-options.rst @@ -0,0 +1,319 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +AMD64 Specific Boot Options +=========================== + +There are many others (usually documented in driver documentation), but +only the AMD64 specific ones are listed here. + +Machine check +============= +Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. + + mce=off + Disable machine check + mce=no_cmci + Disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your hardware + is misbehaving. + Note that you'll get more problems without CMCI than with + due to the shared banks, i.e. you might get duplicated + error logs. + mce=dont_log_ce + Don't make logs for corrected errors. All events reported + as corrected are silently cleared by OS. + This option will be useful if you have no interest in any + of corrected errors. + mce=ignore_ce + Disable features for corrected errors, e.g. polling timer + and CMCI. All events reported as corrected are not cleared + by OS and remained in its error banks. + Usually this disablement is not recommended, however if + there is an agent checking/clearing corrected errors + (e.g. BIOS or hardware monitoring applications), conflicting + with OS's error handling, and you cannot deactivate the agent, + then this option will be a help. + mce=no_lmce + Do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + mce=bootlog + Enable logging of machine checks left over from booting. + Disabled by default on AMD Fam10h and older because some BIOS + leave bogus ones. + If your BIOS doesn't do that it's a good idea to enable though + to make sure you log even machine check events that result + in a reboot. On Intel systems it is enabled by default. + mce=nobootlog + Disable boot machine check logging. + mce=monarchtimeout (number) + monarchtimeout: + Sets the time in us to wait for other CPUs on machine checks. 0 + to disable. + mce=bios_cmci_threshold + Don't overwrite the bios-set CMCI threshold. This boot option + prevents Linux from overwriting the CMCI threshold set by the + bios. Without this option, Linux always sets the CMCI + threshold to 1. Enabling this may make memory predictive failure + analysis less effective if the bios sets thresholds for memory + errors since we will not see details for all errors. + mce=recovery + Force-enable recoverable machine check code paths + + nomce (for compatibility with i386) + same as mce=off + + Everything else is in sysfs now. + +APICs +===== + + apic + Use IO-APIC. Default + + noapic + Don't use the IO-APIC. + + disableapic + Don't use the local APIC + + nolapic + Don't use the local APIC (alias for i386 compatibility) + + pirq=... + See Documentation/arch/x86/i386/IO-APIC.rst + + noapictimer + Don't set up the APIC timer + + no_timer_check + Don't check the IO-APIC timer. This can work around + problems with incorrect timer initialization on some boards. + + apicpmtimer + Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally broken. + +Timing +====== + + notsc + Deprecated, use tsc=unstable instead. + + nohpet + Don't use the HPET timer. + +Idle loop +========= + + idle=poll + Don't do power saving in the idle loop using HLT, but poll for rescheduling + event. This will make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor benchmarks. It also + makes some profiling using performance counters more accurate. + Please note that on systems with MONITOR/MWAIT support (like Intel EM64T + CPUs) this option has no performance advantage over the normal idle loop. + It may also interact badly with hyperthreading. + +Rebooting +========= + + reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old] + bios + Use the CPU reboot vector for warm reset + warm + Don't set the cold reboot flag + cold + Set the cold reboot flag + triple + Force a triple fault (init) + kbd + Use the keyboard controller. cold reset (default) + acpi + Use the ACPI RESET_REG in the FADT. If ACPI is not configured or + the ACPI reset does not work, the reboot path attempts the reset + using the keyboard controller. + efi + Use efi reset_system runtime service. If EFI is not configured or + the EFI reset does not work, the reboot path attempts the reset using + the keyboard controller. + pci + Use a write to the PCI config space register 0xcf9 to trigger reboot. + + Using warm reset will be much faster especially on big memory + systems because the BIOS will not go through the memory check. + Disadvantage is that not all hardware will be completely reinitialized + on reboot so there may be boot problems on some systems. + + reboot=force + Don't stop other CPUs on reboot. This can make reboot more reliable + in some cases. + + reboot=default + There are some built-in platform specific "quirks" - you may see: + "reboot: <name> series board detected. Selecting <type> for reboots." + In the case where you think the quirk is in error (e.g. you have + newer BIOS, or newer board) using this option will ignore the built-in + quirk table, and use the generic default reboot actions. + +NUMA +==== + + numa=off + Only set up a single NUMA node spanning all memory. + + numa=noacpi + Don't parse the SRAT table for NUMA setup + + numa=nohmat + Don't parse the HMAT table for NUMA setup, or soft-reserved memory + partitioning. + + numa=fake=<size>[MG] + If given as a memory unit, fills all system RAM with nodes of + size interleaved over physical nodes. + + numa=fake=<N> + If given as an integer, fills all system RAM with N fake nodes + interleaved over physical nodes. + + numa=fake=<N>U + If given as an integer followed by 'U', it will divide each + physical node into N emulated nodes. + +ACPI +==== + + acpi=off + Don't enable ACPI + acpi=ht + Use ACPI boot table parsing, but don't enable ACPI interpreter + acpi=force + Force ACPI on (currently not needed) + acpi=strict + Disable out of spec ACPI workarounds. + acpi_sci={edge,level,high,low} + Set up ACPI SCI interrupt. + acpi=noirq + Don't route interrupts + acpi=nocmcff + Disable firmware first mode for corrected errors. This + disables parsing the HEST CMC error source to check if + firmware has set the FF flag. This may result in + duplicate corrected error reports. + +PCI +=== + + pci=off + Don't use PCI + pci=conf1 + Use conf1 access. + pci=conf2 + Use conf2 access. + pci=rom + Assign ROMs. + pci=assign-busses + Assign busses + pci=irqmask=MASK + Set PCI interrupt mask to MASK + pci=lastbus=NUMBER + Scan up to NUMBER busses, no matter what the mptable says. + pci=noacpi + Don't use ACPI to set up PCI interrupt routing. + +IOMMU (input/output memory management unit) +=========================================== +Multiple x86-64 PCI-DMA mapping implementations exist, for example: + + 1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all + (e.g. because you have < 3 GB memory). + Kernel boot message: "PCI-DMA: Disabling IOMMU" + + 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU. + Kernel boot message: "PCI-DMA: using GART IOMMU" + + 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used + e.g. if there is no hardware IOMMU in the system and it is need because + you have >3GB memory or told the kernel to us it (iommu=soft)) + Kernel boot message: "PCI-DMA: Using software bounce buffering + for IO (SWIOTLB)" + +:: + + iommu=[<size>][,noagp][,off][,force][,noforce] + [,memaper[=<order>]][,merge][,fullflush][,nomerge] + [,noaperture] + +General iommu options: + + off + Don't initialize and use any kind of IOMMU. + noforce + Don't force hardware IOMMU usage when it is not needed. (default). + force + Force the use of the hardware IOMMU even when it is + not actually needed (e.g. because < 3 GB memory). + soft + Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + +iommu options only relevant to the AMD GART hardware IOMMU: + + <size> + Set the size of the remapping area in bytes. + allowed + Overwrite iommu off workarounds for specific chipsets. + fullflush + Flush IOMMU on each allocation (default). + nofullflush + Don't use IOMMU fullflush. + memaper[=<order>] + Allocate an own aperture over RAM with size 32MB<<order. + (default: order=1, i.e. 64MB) + merge + Do scatter-gather (SG) merging. Implies "force" (experimental). + nomerge + Don't do scatter-gather (SG) merging. + noaperture + Ask the IOMMU not to touch the aperture for AGP. + noagp + Don't initialize the AGP driver and use full aperture. + panic + Always panic when IOMMU overflows. + +iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU +implementation: + + swiotlb=<slots>[,force,noforce] + <slots> + Prereserve that many 2K slots for the software IO bounce buffering. + force + Force all IO through the software TLB. + noforce + Do not initialize the software TLB. + + +Miscellaneous +============= + + nogbpages + Do not use GB pages for kernel direct mappings. + gbpages + Use GB pages for kernel direct mappings. + + +AMD SEV (Secure Encrypted Virtualization) +========================================= +Options relating to AMD SEV, specified via the following format: + +:: + + sev=option1[,option2] + +The available options are: + + debug + Enable debug messages. diff --git a/Documentation/arch/x86/x86_64/cpu-hotplug-spec.rst b/Documentation/arch/x86/x86_64/cpu-hotplug-spec.rst new file mode 100644 index 000000000000..8d1c91f0c880 --- /dev/null +++ b/Documentation/arch/x86/x86_64/cpu-hotplug-spec.rst @@ -0,0 +1,24 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================================== +Firmware support for CPU hotplug under Linux/x86-64 +=================================================== + +Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to +know in advance of boot time the maximum number of CPUs that could be plugged +into the system. ACPI 3.0 currently has no official way to supply +this information from the firmware to the operating system. + +In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the +ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC +objects by setting the Enabled bit in the LAPIC object to zero. + +For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable +CPU is already available in the MADT. If the CPU is not available yet +it should have its LAPIC Enabled bit set to 0. Linux will use the number +of disabled LAPICs to compute the maximum number of future CPUs. + +In the worst case the user can overwrite this choice using a command line +option (additional_cpus=...), but it is recommended to supply the correct +number (or a reasonable approximation of it, with erring towards more not less) +in the MADT to avoid manual configuration. diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst new file mode 100644 index 000000000000..ba74617d4999 --- /dev/null +++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Fake NUMA For CPUSets +===================== + +:Author: David Rientjes <rientjes@cs.washington.edu> + +Using numa=fake and CPUSets for Resource Management + +This document describes how the numa=fake x86_64 command-line option can be used +in conjunction with cpusets for coarse memory management. Using this feature, +you can create fake NUMA nodes that represent contiguous chunks of memory and +assign them to cpusets and their attached tasks. This is a way of limiting the +amount of system memory that are available to a certain class of tasks. + +For more information on the features of cpusets, see +Documentation/admin-guide/cgroup-v1/cpusets.rst. +There are a number of different configurations you can use for your needs. For +more information on the numa=fake command line option and its various ways of +configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst. + +For the purposes of this introduction, we'll assume a very primitive NUMA +emulation setup of "numa=fake=4*512,". This will split our system memory into +four equal chunks of 512M each that we can now use to assign to cpusets. As +you become more familiar with using this combination for resource control, +you'll determine a better setup to minimize the number of nodes you have to deal +with. + +A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:: + + Faking node 0 at 0000000000000000-0000000020000000 (512MB) + Faking node 1 at 0000000020000000-0000000040000000 (512MB) + Faking node 2 at 0000000040000000-0000000060000000 (512MB) + Faking node 3 at 0000000060000000-0000000080000000 (512MB) + ... + On node 0 totalpages: 130975 + On node 1 totalpages: 131072 + On node 2 totalpages: 131072 + On node 3 totalpages: 131072 + +Now following the instructions for mounting the cpusets filesystem from +Documentation/admin-guide/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory +address spaces) to individual cpusets:: + + [root@xroads /]# mkdir exampleset + [root@xroads /]# mount -t cpuset none exampleset + [root@xroads /]# mkdir exampleset/ddset + [root@xroads /]# cd exampleset/ddset + [root@xroads /exampleset/ddset]# echo 0-1 > cpus + [root@xroads /exampleset/ddset]# echo 0-1 > mems + +Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for +memory allocations (1G). + +You can now assign tasks to these cpusets to limit the memory resources +available to them according to the fake nodes assigned as mems:: + + [root@xroads /exampleset/ddset]# echo $$ > tasks + [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G + [1] 13425 + +Notice the difference between the system memory usage as reported by +/proc/meminfo between the restricted cpuset case above and the unrestricted +case (i.e. running the same 'dd' command without assigning it to a fake NUMA +cpuset): + + ======== ============ ========== + Name Unrestricted Restricted + ======== ============ ========== + MemTotal 3091900 kB 3091900 kB + MemFree 42113 kB 1513236 kB + ======== ============ ========== + +This allows for coarse memory management for the tasks you assign to particular +cpusets. Since cpusets can form a hierarchy, you can create some pretty +interesting combinations of use-cases for various classes of tasks for your +memory management needs. diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst new file mode 100644 index 000000000000..50960e09e1f6 --- /dev/null +++ b/Documentation/arch/x86/x86_64/fsgs.rst @@ -0,0 +1,199 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Using FS and GS segments in user space applications +=================================================== + +The x86 architecture supports segmentation. Instructions which access +memory can use segment register based addressing mode. The following +notation is used to address a byte within a segment: + + Segment-register:Byte-address + +The segment base address is added to the Byte-address to compute the +resulting virtual address which is accessed. This allows to access multiple +instances of data with the identical Byte-address, i.e. the same code. The +selection of a particular instance is purely based on the base-address in +the segment register. + +In 32-bit mode the CPU provides 6 segments, which also support segment +limits. The limits can be used to enforce address space protections. + +In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is +always 0 to provide a full 64bit address space. The FS and GS segments are +still functional in 64-bit mode. + +Common FS and GS usage +------------------------------ + +The FS segment is commonly used to address Thread Local Storage (TLS). FS +is usually managed by runtime code or a threading library. Variables +declared with the '__thread' storage class specifier are instantiated per +thread and the compiler emits the FS: address prefix for accesses to these +variables. Each thread has its own FS base address so common code can be +used without complex address offset calculations to access the per thread +instances. Applications should not use FS for other purposes when they use +runtimes or threading libraries which manage the per thread FS. + +The GS segment has no common use and can be used freely by +applications. GCC and Clang support GS based addressing via address space +identifiers. + +Reading and writing the FS/GS base address +------------------------------------------ + +There exist two mechanisms to read and write the FS/GS base address: + + - the arch_prctl() system call + + - the FSGSBASE instruction family + +Accessing FS/GS base with arch_prctl() +-------------------------------------- + + The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all + kernel versions. + + Reading the base: + + arch_prctl(ARCH_GET_FS, &fsbase); + arch_prctl(ARCH_GET_GS, &gsbase); + + Writing the base: + + arch_prctl(ARCH_SET_FS, fsbase); + arch_prctl(ARCH_SET_GS, gsbase); + + The ARCH_SET_GS prctl may be disabled depending on kernel configuration + and security settings. + +Accessing FS/GS base with the FSGSBASE instructions +--------------------------------------------------- + + With the Ivy Bridge CPU generation Intel introduced a new set of + instructions to access the FS and GS base registers directly from user + space. These instructions are also supported on AMD Family 17H CPUs. The + following instructions are available: + + =============== =========================== + RDFSBASE %reg Read the FS base register + RDGSBASE %reg Read the GS base register + WRFSBASE %reg Write the FS base register + WRGSBASE %reg Write the GS base register + =============== =========================== + + The instructions avoid the overhead of the arch_prctl() syscall and allow + more flexible usage of the FS/GS addressing modes in user space + applications. This does not prevent conflicts between threading libraries + and runtimes which utilize FS and applications which want to use it for + their own purpose. + +FSGSBASE instructions enablement +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If + available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs. + + The availability of the instructions does not enable them + automatically. The kernel has to enable them explicitly in CR4. The + reason for this is that older kernels make assumptions about the values in + the GS register and enforce them when GS base is set via + arch_prctl(). Allowing user space to write arbitrary values to GS base + would violate these assumptions and cause malfunction. + + On kernels which do not enable FSGSBASE the execution of the FSGSBASE + instructions will fault with a #UD exception. + + The kernel provides reliable information about the enabled state in the + ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the + kernel has FSGSBASE instructions enabled and applications can use them. + The following code example shows how this detection works:: + + #include <sys/auxv.h> + #include <elf.h> + + /* Will be eventually in asm/hwcap.h */ + #ifndef HWCAP2_FSGSBASE + #define HWCAP2_FSGSBASE (1 << 1) + #endif + + .... + + unsigned val = getauxval(AT_HWCAP2); + + if (val & HWCAP2_FSGSBASE) + printf("FSGSBASE enabled\n"); + +FSGSBASE instructions compiler support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE +instructions. Clang 5 supports them as well. + + =================== =========================== + _readfsbase_u64() Read the FS base register + _readfsbase_u64() Read the GS base register + _writefsbase_u64() Write the FS base register + _writegsbase_u64() Write the GS base register + =================== =========================== + +To utilize these instrinsics <immintrin.h> must be included in the source +code and the compiler option -mfsgsbase has to be added. + +Compiler support for FS/GS based addressing +------------------------------------------- + +GCC version 6 and newer provide support for FS/GS based addressing via +Named Address Spaces. GCC implements the following address space +identifiers for x86: + + ========= ==================================== + __seg_fs Variable is addressed relative to FS + __seg_gs Variable is addressed relative to GS + ========= ==================================== + +The preprocessor symbols __SEG_FS and __SEG_GS are defined when these +address spaces are supported. Code which implements fallback modes should +check whether these symbols are defined. Usage example:: + + #ifdef __SEG_GS + + long data0 = 0; + long data1 = 1; + + long __seg_gs *ptr; + + /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */ + .... + + /* Set GS base to point to data0 */ + _writegsbase_u64(&data0); + + /* Access offset 0 of GS */ + ptr = 0; + printf("data0 = %ld\n", *ptr); + + /* Set GS base to point to data1 */ + _writegsbase_u64(&data1); + /* ptr still addresses offset 0! */ + printf("data1 = %ld\n", *ptr); + + +Clang does not provide the GCC address space identifiers, but it provides +address spaces via an attribute based mechanism in Clang 2.6 and newer +versions: + + ==================================== ===================================== + __attribute__((address_space(256)) Variable is addressed relative to GS + __attribute__((address_space(257)) Variable is addressed relative to FS + ==================================== ===================================== + +FS/GS based addressing with inline assembly +------------------------------------------- + +In case the compiler does not support address spaces, inline assembly can +be used for FS/GS based addressing mode:: + + mov %fs:offset, %reg + mov %gs:offset, %reg + + mov %reg, %fs:offset + mov %reg, %gs:offset diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst new file mode 100644 index 000000000000..a56070fc8e77 --- /dev/null +++ b/Documentation/arch/x86/x86_64/index.rst @@ -0,0 +1,17 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +x86_64 Support +============== + +.. toctree:: + :maxdepth: 2 + + boot-options + uefi + mm + 5level-paging + fake-numa-for-cpusets + cpu-hotplug-spec + machinecheck + fsgs diff --git a/Documentation/arch/x86/x86_64/machinecheck.rst b/Documentation/arch/x86/x86_64/machinecheck.rst new file mode 100644 index 000000000000..cea12ee97200 --- /dev/null +++ b/Documentation/arch/x86/x86_64/machinecheck.rst @@ -0,0 +1,33 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================================== +Configurable sysfs parameters for the x86-64 machine check code +=============================================================== + +Machine checks report internal hardware error conditions detected +by the CPU. Uncorrected errors typically cause a machine check +(often with panic), corrected ones cause a machine check log entry. + +Machine checks are organized in banks (normally associated with +a hardware subsystem) and subevents in a bank. The exact meaning +of the banks and subevent is CPU specific. + +mcelog knows how to decode them. + +When you see the "Machine check errors logged" message in the system +log then mcelog should run to collect and decode machine check entries +from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. + +Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN +(N = CPU number). + +The directory contains some configurable entries. See +Documentation/ABI/testing/sysfs-mce for more details. + +TBD document entries for AMD threshold interrupt configuration + +For more details about the x86 machine check architecture +see the Intel and AMD architecture manuals from their developer websites. + +For more details about the architecture +see http://one.firstfloor.org/~andi/mce.pdf diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst new file mode 100644 index 000000000000..35e5e18c83d0 --- /dev/null +++ b/Documentation/arch/x86/x86_64/mm.rst @@ -0,0 +1,157 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Memory Management +================= + +Complete virtual memory map with 4-level page tables +==================================================== + +.. note:: + + - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down + from the top of the 64-bit address space. It's easier to understand the layout + when seen both in absolute addresses and in distance-from-top notation. + + For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the + 64-bit address space (ffffffffffffffff). + + Note that as we get closer to the top of the address space, the notation changes + from TB to GB and then MB/KB. + + - "16M TB" might look weird at first sight, but it's an easier way to visualize size + notation than "16 EB", which few will recognize at first sight as 16 exabytes. + It also shows it nicely how incredibly large 64-bit address space is. + +:: + + ======================================================================================================================== + Start addr | Offset | End addr | Size | VM area description + ======================================================================================================================== + | | | | + 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -128 TB + | | | | starting offset of kernel mappings. + __________________|____________|__________________|_________|___________________________________________________________ + | + | Kernel-space virtual memory, shared between all processes: + ____________________________________________________________|___________________________________________________________ + | | | | + ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole + ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) + ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole + ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) + ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole + ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory + __________________|____________|__________________|_________|____________________________________________________________ + | + | Identical layout to the 56-bit one from here on: + ____________________________________________________________|____________________________________________________________ + | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks + ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole + ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space + ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole + ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 + ffffffff80000000 |-2048 MB | | | + ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space + ffffffffff000000 | -16 MB | | | + FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset + ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI + ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole + __________________|____________|__________________|_________|___________________________________________________________ + + +Complete virtual memory map with 5-level page tables +==================================================== + +.. note:: + + - With 56-bit addresses, user-space memory gets expanded by a factor of 512x, + from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting + offset and many of the regions expand to support the much larger physical + memory supported. + +:: + + ======================================================================================================================== + Start addr | Offset | End addr | Size | VM area description + ======================================================================================================================== + | | | | + 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -64 PB + | | | | starting offset of kernel mappings. + __________________|____________|__________________|_________|___________________________________________________________ + | + | Kernel-space virtual memory, shared between all processes: + ____________________________________________________________|___________________________________________________________ + | | | | + ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor + ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI + ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) + ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole + ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base) + ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole + ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base) + ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole + ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory + __________________|____________|__________________|_________|____________________________________________________________ + | + | Identical layout to the 47-bit one from here on: + ____________________________________________________________|____________________________________________________________ + | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks + ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole + ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space + ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole + ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 + ffffffff80000000 |-2048 MB | | | + ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space + ffffffffff000000 | -16 MB | | | + FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset + ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI + ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole + __________________|____________|__________________|_________|___________________________________________________________ + +Architecture defines a 64-bit virtual address. Implementations can support +less. Currently supported are 48- and 57-bit virtual addresses. Bits 63 +through to the most-significant implemented bit are sign extended. +This causes hole between user space and kernel addresses if you interpret them +as unsigned. + +The direct mapping covers all memory in the system up to the highest +memory address (this means in some cases it can also include PCI memory +holes). + +We map EFI runtime services in the 'efi_pgd' PGD in a 64GB large virtual +memory window (this size is arbitrary, it can be raised later if needed). +The mappings are not part of any other kernel PGD and are only available +during EFI runtime calls. + +Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all +physical memory, vmalloc/ioremap space and virtual memory map are randomized. +Their order is preserved but their base will be offset early at boot time. + +Be very careful vs. KASLR when changing anything here. The KASLR address +range must not overlap with anything except the KASAN shadow area, which is +correct as KASAN disables KASLR. + +For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB +hole: ffffffffffff4111 diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst new file mode 100644 index 000000000000..fbc30c9a071d --- /dev/null +++ b/Documentation/arch/x86/x86_64/uefi.rst @@ -0,0 +1,58 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +General note on [U]EFI x86_64 support +===================================== + +The nomenclature EFI and UEFI are used interchangeably in this document. + +Although the tools below are _not_ needed for building the kernel, +the needed bootloader support and associated tools for x86_64 platforms +with EFI firmware and specifications are listed below. + +1. UEFI specification: http://www.uefi.org + +2. Booting Linux kernel on UEFI x86_64 platform requires bootloader + support. Elilo with x86_64 support can be used. + +3. x86_64 platform with EFI/UEFI firmware. + +Mechanics +--------- + +- Build the kernel with the following configuration:: + + CONFIG_FB_EFI=y + CONFIG_FRAMEBUFFER_CONSOLE=y + + If EFI runtime services are expected, the following configuration should + be selected:: + + CONFIG_EFI=y + CONFIG_EFIVAR_FS=y or m # optional + +- Create a VFAT partition on the disk +- Copy the following to the VFAT partition: + + elilo bootloader with x86_64 support, elilo configuration file, + kernel image built in first step and corresponding + initrd. Instructions on building elilo and its dependencies + can be found in the elilo sourceforge project. + +- Boot to EFI shell and invoke elilo choosing the kernel image built + in first step. +- If some or all EFI runtime services don't work, you can try following + kernel command line parameters to turn off some or all EFI runtime + services. + + noefi + turn off all EFI runtime services + reboot_type=k + turn off EFI reboot runtime service + +- If the EFI memory map has additional entries not in the E820 map, + you can include those entries in the kernels memory map of available + physical RAM by using the following kernel command line parameter. + + add_efi_memmap + include EFI memory map of available physical RAM diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst new file mode 100644 index 000000000000..ae5c69e48b11 --- /dev/null +++ b/Documentation/arch/x86/xstate.rst @@ -0,0 +1,174 @@ +Using XSTATE features in user space applications +================================================ + +The x86 architecture supports floating-point extensions which are +enumerated via CPUID. Applications consult CPUID and use XGETBV to +evaluate which features have been enabled by the kernel XCR0. + +Up to AVX-512 and PKRU states, these features are automatically enabled by +the kernel if available. Features like AMX TILE_DATA (XSTATE component 18) +are enabled by XCR0 as well, but the first use of related instruction is +trapped by the kernel because by default the required large XSTATE buffers +are not allocated automatically. + +The purpose for dynamic features +-------------------------------- + +Legacy userspace libraries often have hard-coded, static sizes for +alternate signal stacks, often using MINSIGSTKSZ which is typically 2KB. +That stack must be able to store at *least* the signal frame that the +kernel sets up before jumping into the signal handler. That signal frame +must include an XSAVE buffer defined by the CPU. + +However, that means that the size of signal stacks is dynamic, not static, +because different CPUs have differently-sized XSAVE buffers. A compiled-in +size of 2KB with existing applications is too small for new CPU features +like AMX. Instead of universally requiring larger stack, with the dynamic +enabling, the kernel can enforce userspace applications to have +properly-sized altstacks. + +Using dynamically enabled XSTATE features in user space applications +-------------------------------------------------------------------- + +The kernel provides an arch_prctl(2) based mechanism for applications to +request the usage of such features. The arch_prctl(2) options related to +this are: + +-ARCH_GET_XCOMP_SUPP + + arch_prctl(ARCH_GET_XCOMP_SUPP, &features); + + ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of + type uint64_t. The second argument is a pointer to that storage. + +-ARCH_GET_XCOMP_PERM + + arch_prctl(ARCH_GET_XCOMP_PERM, &features); + + ARCH_GET_XCOMP_PERM stores the features for which the userspace process + has permission in userspace storage of type uint64_t. The second argument + is a pointer to that storage. + +-ARCH_REQ_XCOMP_PERM + + arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr); + + ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled + feature or a feature set. A feature set can be mapped to a facility, e.g. + AMX, and can require one or more XSTATE components to be enabled. + + The feature argument is the number of the highest XSTATE component which + is required for a facility to work. + +When requesting permission for a feature, the kernel checks the +availability. The kernel ensures that sigaltstacks in the process's tasks +are large enough to accommodate the resulting large signal frame. It +enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent +sigaltstack(2) calls. If an installed sigaltstack is smaller than the +resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also, +sigaltstack(2) results in -ENOMEM if the requested altstack is too small +for the permitted features. + +Permission, when granted, is valid per process. Permissions are inherited +on fork(2) and cleared on exec(3). + +The first use of an instruction related to a dynamically enabled feature is +trapped by the kernel. The trap handler checks whether the process has +permission to use the feature. If the process has no permission then the +kernel sends SIGILL to the application. If the process has permission then +the handler allocates a larger xstate buffer for the task so the large +state can be context switched. In the unlikely cases that the allocation +fails, the kernel sends SIGSEGV. + +AMX TILE_DATA enabling example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Below is the example of how userspace applications enable +TILE_DATA dynamically: + + 1. The application first needs to query the kernel for AMX + support:: + + #include <asm/prctl.h> + #include <sys/syscall.h> + #include <stdio.h> + #include <unistd.h> + + #ifndef ARCH_GET_XCOMP_SUPP + #define ARCH_GET_XCOMP_SUPP 0x1021 + #endif + + #ifndef ARCH_XCOMP_TILECFG + #define ARCH_XCOMP_TILECFG 17 + #endif + + #ifndef ARCH_XCOMP_TILEDATA + #define ARCH_XCOMP_TILEDATA 18 + #endif + + #define MASK_XCOMP_TILE ((1 << ARCH_XCOMP_TILECFG) | \ + (1 << ARCH_XCOMP_TILEDATA)) + + unsigned long features; + long rc; + + ... + + rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_SUPP, &features); + + if (!rc && (features & MASK_XCOMP_TILE) == MASK_XCOMP_TILE) + printf("AMX is available.\n"); + + 2. After that, determining support for AMX, an application must + explicitly ask permission to use it:: + + #ifndef ARCH_REQ_XCOMP_PERM + #define ARCH_REQ_XCOMP_PERM 0x1023 + #endif + + ... + + rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, ARCH_XCOMP_TILEDATA); + + if (!rc) + printf("AMX is ready for use.\n"); + +Note this example does not include the sigaltstack preparation. + +Dynamic features in signal frames +--------------------------------- + +Dynamcally enabled features are not written to the signal frame upon signal +entry if the feature is in its initial configuration. This differs from +non-dynamic features which are always written regardless of their +configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV +field to determine if a features was written. + +Dynamic features for virtual machines +------------------------------------- + +The permission for the guest state component needs to be managed separately +from the host, as they are exclusive to each other. A coupled of options +are extended to control the guest permission: + +-ARCH_GET_XCOMP_GUEST_PERM + + arch_prctl(ARCH_GET_XCOMP_GUEST_PERM, &features); + + ARCH_GET_XCOMP_GUEST_PERM is a variant of ARCH_GET_XCOMP_PERM. So it + provides the same semantics and functionality but for the guest + components. + +-ARCH_REQ_XCOMP_GUEST_PERM + + arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM, feature_nr); + + ARCH_REQ_XCOMP_GUEST_PERM is a variant of ARCH_REQ_XCOMP_PERM. It has the + same semantics for the guest permission. While providing a similar + functionality, this comes with a constraint. Permission is frozen when the + first VCPU is created. Any attempt to change permission after that point + is going to be rejected. So, the permission has to be requested before the + first VCPU creation. + +Note that some VMMs may have already established a set of supported state +components. These options are not presumed to support any particular VMM. diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst new file mode 100644 index 000000000000..45aa9cceb4f1 --- /dev/null +++ b/Documentation/arch/x86/zero-page.rst @@ -0,0 +1,47 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========= +Zero Page +========= +The additional fields in struct boot_params as a part of 32-bit boot +protocol of kernel. These should be filled by bootloader or 16-bit +real-mode setup code of the kernel. References/settings to it mainly +are in:: + + arch/x86/include/uapi/asm/bootparam.h + +=========== ===== ======================= ================================================= +Offset/Size Proto Name Meaning + +000/040 ALL screen_info Text mode or frame buffer information + (struct screen_info) +040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info) +058/008 ALL tboot_addr Physical address of tboot shared page +060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information + (struct ist_info) +070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table +080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! +090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!! +0A0/010 ALL sys_desc_table System description table (struct sys_desc_table), + OBSOLETE!! +0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends +0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits +0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits +0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits +13C/004 ALL cc_blob_address Physical address of Confidential Computing blob +140/080 ALL edid_info Video mode setup (struct edid_info) +1C0/020 ALL efi_info EFI 32 information (struct efi_info) +1E0/004 ALL alt_mem_k Alternative mem check, in KB +1E4/004 ALL scratch Scratch field for the kernel setup code +1E8/001 ALL e820_entries Number of entries in e820_table (below) +1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below) +1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer + (below) +1EB/001 ALL kbd_status Numlock is enabled +1EC/001 ALL secure_boot Secure boot is enabled in the firmware +1EF/001 ALL sentinel Used to detect broken bootloaders +290/040 ALL edd_mbr_sig_buffer EDD MBR signatures +2D0/A00 ALL e820_table E820 memory map table + (array of struct e820_entry) +D00/1EC ALL eddbuf EDD data (array of struct edd_info) +=========== ===== ======================= ================================================= diff --git a/Documentation/arch/xtensa/atomctl.rst b/Documentation/arch/xtensa/atomctl.rst new file mode 100644 index 000000000000..1ecbd0ba9a2e --- /dev/null +++ b/Documentation/arch/xtensa/atomctl.rst @@ -0,0 +1,51 @@ +=========================================== +Atomic Operation Control (ATOMCTL) Register +=========================================== + +We Have Atomic Operation Control (ATOMCTL) Register. +This register determines the effect of using a S32C1I instruction +with various combinations of: + + 1. With and without an Coherent Cache Controller which + can do Atomic Transactions to the memory internally. + + 2. With and without An Intelligent Memory Controller which + can do Atomic Transactions itself. + +The Core comes up with a default value of for the three types of cache ops:: + + 0x28: (WB: Internal, WT: Internal, BY:Exception) + +On the FPGA Cards we typically simulate an Intelligent Memory controller +which can implement RCW transactions. For FPGA cards with an External +Memory controller we let it to the atomic operations internally while +doing a Cached (WB) transaction and use the Memory RCW for un-cached +operations. + +For systems without an coherent cache controller, non-MX, we always +use the memory controllers RCW, thought non-MX controlers likely +support the Internal Operation. + +CUSTOMER-WARNING: + Virtually all customers buy their memory controllers from vendors that + don't support atomic RCW memory transactions and will likely want to + configure this register to not use RCW. + +Developers might find using RCW in Bypass mode convenient when testing +with the cache being bypassed; for example studying cache alias problems. + +See Section 4.3.12.4 of ISA; Bits:: + + WB WT BY + 5 4 | 3 2 | 1 0 + +========= ================== ================== =============== + 2 Bit + Field + Values WB - Write Back WT - Write Thru BY - Bypass +========= ================== ================== =============== + 0 Exception Exception Exception + 1 RCW Transaction RCW Transaction RCW Transaction + 2 Internal Operation Internal Operation Reserved + 3 Reserved Reserved Reserved +========= ================== ================== =============== diff --git a/Documentation/arch/xtensa/booting.rst b/Documentation/arch/xtensa/booting.rst new file mode 100644 index 000000000000..e1b83707e5b6 --- /dev/null +++ b/Documentation/arch/xtensa/booting.rst @@ -0,0 +1,22 @@ +===================================== +Passing boot parameters to the kernel +===================================== + +Boot parameters are represented as a TLV list in the memory. Please see +arch/xtensa/include/asm/bootparam.h for definition of the bp_tag structure and +tag value constants. First entry in the list must have type BP_TAG_FIRST, last +entry must have type BP_TAG_LAST. The address of the first list entry is +passed to the kernel in the register a2. The address type depends on MMU type: + +- For configurations without MMU, with region protection or with MPU the + address must be the physical address. +- For configurations with region translarion MMU or with MMUv3 and CONFIG_MMU=n + the address must be a valid address in the current mapping. The kernel will + not change the mapping on its own. +- For configurations with MMUv2 the address must be a virtual address in the + default virtual mapping (0xd0000000..0xffffffff). +- For configurations with MMUv3 and CONFIG_MMU=y the address may be either a + virtual or physical address. In either case it must be within the default + virtual mapping. It is considered physical if it is within the range of + physical addresses covered by the default KSEG mapping (XCHAL_KSEG_PADDR.. + XCHAL_KSEG_PADDR + XCHAL_KSEG_SIZE), otherwise it is considered virtual. diff --git a/Documentation/arch/xtensa/features.rst b/Documentation/arch/xtensa/features.rst new file mode 100644 index 000000000000..6b92c7bfa19d --- /dev/null +++ b/Documentation/arch/xtensa/features.rst @@ -0,0 +1,3 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. kernel-feat:: $srctree/Documentation/features xtensa diff --git a/Documentation/arch/xtensa/index.rst b/Documentation/arch/xtensa/index.rst new file mode 100644 index 000000000000..69952446a9be --- /dev/null +++ b/Documentation/arch/xtensa/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +Xtensa Architecture +=================== + +.. toctree:: + :maxdepth: 1 + + atomctl + booting + mmu + + features diff --git a/Documentation/arch/xtensa/mmu.rst b/Documentation/arch/xtensa/mmu.rst new file mode 100644 index 000000000000..450573afa31a --- /dev/null +++ b/Documentation/arch/xtensa/mmu.rst @@ -0,0 +1,198 @@ +============================= +MMUv3 initialization sequence +============================= + +The code in the initialize_mmu macro sets up MMUv3 memory mapping +identically to MMUv2 fixed memory mapping. Depending on +CONFIG_INITIALIZE_XTENSA_MMU_INSIDE_VMLINUX symbol this code is +located in addresses it was linked for (symbol undefined), or not +(symbol defined), so it needs to be position-independent. + +The code has the following assumptions: + + - This code fragment is run only on an MMU v3. + - TLBs are in their reset state. + - ITLBCFG and DTLBCFG are zero (reset state). + - RASID is 0x04030201 (reset state). + - PS.RING is zero (reset state). + - LITBASE is zero (reset state, PC-relative literals); required to be PIC. + +TLB setup proceeds along the following steps. + + Legend: + + - VA = virtual address (two upper nibbles of it); + - PA = physical address (two upper nibbles of it); + - pc = physical range that contains this code; + +After step 2, we jump to virtual address in the range 0x40000000..0x5fffffff +or 0x00000000..0x1fffffff, depending on whether the kernel was loaded below +0x40000000 or above. That address corresponds to next instruction to execute +in this code. After step 4, we jump to intended (linked) address of this code. +The scheme below assumes that the kernel is loaded below 0x40000000. + + ====== ===== ===== ===== ===== ====== ===== ===== + - Step0 Step1 Step2 Step3 Step4 Step5 + + VA PA PA PA PA VA PA PA + ====== ===== ===== ===== ===== ====== ===== ===== + E0..FF -> E0 -> E0 -> E0 F0..FF -> F0 -> F0 + C0..DF -> C0 -> C0 -> C0 E0..EF -> F0 -> F0 + A0..BF -> A0 -> A0 -> A0 D8..DF -> 00 -> 00 + 80..9F -> 80 -> 80 -> 80 D0..D7 -> 00 -> 00 + 60..7F -> 60 -> 60 -> 60 + 40..5F -> 40 -> pc -> pc 40..5F -> pc + 20..3F -> 20 -> 20 -> 20 + 00..1F -> 00 -> 00 -> 00 + ====== ===== ===== ===== ===== ====== ===== ===== + +The default location of IO peripherals is above 0xf0000000. This may be changed +using a "ranges" property in a device tree simple-bus node. See the Devicetree +Specification, section 4.5 for details on the syntax and semantics of +simple-bus nodes. The following limitations apply: + +1. Only top level simple-bus nodes are considered + +2. Only one (first) simple-bus node is considered + +3. Empty "ranges" properties are not supported + +4. Only the first triplet in the "ranges" property is considered + +5. The parent-bus-address value is rounded down to the nearest 256MB boundary + +6. The IO area covers the entire 256MB segment of parent-bus-address; the + "ranges" triplet length field is ignored + + +MMUv3 address space layouts. +============================ + +Default MMUv2-compatible layout:: + + Symbol VADDR Size + +------------------+ + | Userspace | 0x00000000 TASK_SIZE + +------------------+ 0x40000000 + +------------------+ + | Page table | XCHAL_PAGE_TABLE_VADDR 0x80000000 XCHAL_PAGE_TABLE_SIZE + +------------------+ + | KASAN shadow map | KASAN_SHADOW_START 0x80400000 KASAN_SHADOW_SIZE + +------------------+ 0x8e400000 + +------------------+ + | VMALLOC area | VMALLOC_START 0xc0000000 128MB - 64KB + +------------------+ VMALLOC_END + +------------------+ + | Cache aliasing | TLBTEMP_BASE_1 0xc8000000 DCACHE_WAY_SIZE + | remap area 1 | + +------------------+ + | Cache aliasing | TLBTEMP_BASE_2 DCACHE_WAY_SIZE + | remap area 2 | + +------------------+ + +------------------+ + | KMAP area | PKMAP_BASE PTRS_PER_PTE * + | | DCACHE_N_COLORS * + | | PAGE_SIZE + | | (4MB * DCACHE_N_COLORS) + +------------------+ + | Atomic KMAP area | FIXADDR_START KM_TYPE_NR * + | | NR_CPUS * + | | DCACHE_N_COLORS * + | | PAGE_SIZE + +------------------+ FIXADDR_TOP 0xcffff000 + +------------------+ + | Cached KSEG | XCHAL_KSEG_CACHED_VADDR 0xd0000000 128MB + +------------------+ + | Uncached KSEG | XCHAL_KSEG_BYPASS_VADDR 0xd8000000 128MB + +------------------+ + | Cached KIO | XCHAL_KIO_CACHED_VADDR 0xe0000000 256MB + +------------------+ + | Uncached KIO | XCHAL_KIO_BYPASS_VADDR 0xf0000000 256MB + +------------------+ + + +256MB cached + 256MB uncached layout:: + + Symbol VADDR Size + +------------------+ + | Userspace | 0x00000000 TASK_SIZE + +------------------+ 0x40000000 + +------------------+ + | Page table | XCHAL_PAGE_TABLE_VADDR 0x80000000 XCHAL_PAGE_TABLE_SIZE + +------------------+ + | KASAN shadow map | KASAN_SHADOW_START 0x80400000 KASAN_SHADOW_SIZE + +------------------+ 0x8e400000 + +------------------+ + | VMALLOC area | VMALLOC_START 0xa0000000 128MB - 64KB + +------------------+ VMALLOC_END + +------------------+ + | Cache aliasing | TLBTEMP_BASE_1 0xa8000000 DCACHE_WAY_SIZE + | remap area 1 | + +------------------+ + | Cache aliasing | TLBTEMP_BASE_2 DCACHE_WAY_SIZE + | remap area 2 | + +------------------+ + +------------------+ + | KMAP area | PKMAP_BASE PTRS_PER_PTE * + | | DCACHE_N_COLORS * + | | PAGE_SIZE + | | (4MB * DCACHE_N_COLORS) + +------------------+ + | Atomic KMAP area | FIXADDR_START KM_TYPE_NR * + | | NR_CPUS * + | | DCACHE_N_COLORS * + | | PAGE_SIZE + +------------------+ FIXADDR_TOP 0xaffff000 + +------------------+ + | Cached KSEG | XCHAL_KSEG_CACHED_VADDR 0xb0000000 256MB + +------------------+ + | Uncached KSEG | XCHAL_KSEG_BYPASS_VADDR 0xc0000000 256MB + +------------------+ + +------------------+ + | Cached KIO | XCHAL_KIO_CACHED_VADDR 0xe0000000 256MB + +------------------+ + | Uncached KIO | XCHAL_KIO_BYPASS_VADDR 0xf0000000 256MB + +------------------+ + + +512MB cached + 512MB uncached layout:: + + Symbol VADDR Size + +------------------+ + | Userspace | 0x00000000 TASK_SIZE + +------------------+ 0x40000000 + +------------------+ + | Page table | XCHAL_PAGE_TABLE_VADDR 0x80000000 XCHAL_PAGE_TABLE_SIZE + +------------------+ + | KASAN shadow map | KASAN_SHADOW_START 0x80400000 KASAN_SHADOW_SIZE + +------------------+ 0x8e400000 + +------------------+ + | VMALLOC area | VMALLOC_START 0x90000000 128MB - 64KB + +------------------+ VMALLOC_END + +------------------+ + | Cache aliasing | TLBTEMP_BASE_1 0x98000000 DCACHE_WAY_SIZE + | remap area 1 | + +------------------+ + | Cache aliasing | TLBTEMP_BASE_2 DCACHE_WAY_SIZE + | remap area 2 | + +------------------+ + +------------------+ + | KMAP area | PKMAP_BASE PTRS_PER_PTE * + | | DCACHE_N_COLORS * + | | PAGE_SIZE + | | (4MB * DCACHE_N_COLORS) + +------------------+ + | Atomic KMAP area | FIXADDR_START KM_TYPE_NR * + | | NR_CPUS * + | | DCACHE_N_COLORS * + | | PAGE_SIZE + +------------------+ FIXADDR_TOP 0x9ffff000 + +------------------+ + | Cached KSEG | XCHAL_KSEG_CACHED_VADDR 0xa0000000 512MB + +------------------+ + | Uncached KSEG | XCHAL_KSEG_BYPASS_VADDR 0xc0000000 512MB + +------------------+ + | Cached KIO | XCHAL_KIO_CACHED_VADDR 0xe0000000 256MB + +------------------+ + | Uncached KIO | XCHAL_KIO_BYPASS_VADDR 0xf0000000 256MB + +------------------+ |