diff options
author | Dmitry Torokhov <dmitry.torokhov@gmail.com> | 2024-05-27 21:37:18 -0700 |
---|---|---|
committer | Dmitry Torokhov <dmitry.torokhov@gmail.com> | 2024-05-27 21:37:18 -0700 |
commit | 6f47c7ae8c7afaf9ad291d39f0d3974f191a7946 (patch) | |
tree | 74ce89b352cb8096e6a94ddc8597274d3e2d53ce /Documentation/mm | |
parent | 832f54c9ccd3a3f32d1db905462d3c58b4df52bd (diff) | |
parent | a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 (diff) |
Merge tag 'v6.9' into next
Sync up with the mainline to bring in the new cleanup API.
Diffstat (limited to 'Documentation/mm')
-rw-r--r-- | Documentation/mm/arch_pgtable_helpers.rst | 14 | ||||
-rw-r--r-- | Documentation/mm/damon/design.rst | 129 | ||||
-rw-r--r-- | Documentation/mm/damon/maintainer-profile.rst | 8 | ||||
-rw-r--r-- | Documentation/mm/frontswap.rst | 264 | ||||
-rw-r--r-- | Documentation/mm/highmem.rst | 28 | ||||
-rw-r--r-- | Documentation/mm/hmm.rst | 13 | ||||
-rw-r--r-- | Documentation/mm/hugetlbfs_reserv.rst | 14 | ||||
-rw-r--r-- | Documentation/mm/hwpoison.rst | 2 | ||||
-rw-r--r-- | Documentation/mm/index.rst | 1 | ||||
-rw-r--r-- | Documentation/mm/overcommit-accounting.rst | 3 | ||||
-rw-r--r-- | Documentation/mm/page_cache.rst | 10 | ||||
-rw-r--r-- | Documentation/mm/page_migration.rst | 2 | ||||
-rw-r--r-- | Documentation/mm/page_owner.rst | 48 | ||||
-rw-r--r-- | Documentation/mm/page_tables.rst | 127 | ||||
-rw-r--r-- | Documentation/mm/slub.rst | 60 | ||||
-rw-r--r-- | Documentation/mm/split_page_table_lock.rst | 12 | ||||
-rw-r--r-- | Documentation/mm/transhuge.rst | 4 | ||||
-rw-r--r-- | Documentation/mm/unevictable-lru.rst | 6 | ||||
-rw-r--r-- | Documentation/mm/vmemmap_dedup.rst | 6 | ||||
-rw-r--r-- | Documentation/mm/zsmalloc.rst | 5 |
20 files changed, 388 insertions, 368 deletions
diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index af3891f895b0..2466d3363af7 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -18,8 +18,6 @@ PTE Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_same | Tests whether both PTE entries are the same | +---------------------------+--------------------------------------------------+ -| pte_bad | Tests a non-table mapped PTE | -+---------------------------+--------------------------------------------------+ | pte_present | Tests a valid mapped PTE | +---------------------------+--------------------------------------------------+ | pte_young | Tests a young PTE | @@ -46,7 +44,11 @@ PTE Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_mkclean | Creates a clean PTE | +---------------------------+--------------------------------------------------+ -| pte_mkwrite | Creates a writable PTE | +| pte_mkwrite | Creates a writable PTE of the type specified by | +| | the VMA. | ++---------------------------+--------------------------------------------------+ +| pte_mkwrite_novma | Creates a writable PTE, of the conventional type | +| | of writable. | +---------------------------+--------------------------------------------------+ | pte_wrprotect | Creates a write protected PTE | +---------------------------+--------------------------------------------------+ @@ -118,7 +120,11 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_mkclean | Creates a clean PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkwrite | Creates a writable PMD | +| pmd_mkwrite | Creates a writable PMD of the type specified by | +| | the VMA. | ++---------------------------+--------------------------------------------------+ +| pmd_mkwrite_novma | Creates a writable PMD, of the conventional type | +| | of writable. | +---------------------------+--------------------------------------------------+ | pmd_wrprotect | Creates a write protected PMD | +---------------------------+--------------------------------------------------+ diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 4bfdf1d30c4a..5620aab9b385 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -5,6 +5,18 @@ Design ====== +.. _damon_design_execution_model_and_data_structures: + +Execution Model and Data Structures +=================================== + +The monitoring-related information including the monitoring request +specification and DAMON-based operation schemes are stored in a data structure +called DAMON ``context``. DAMON executes each context with a kernel thread +called ``kdamond``. Multiple kdamonds could run in parallel, for different +types of monitoring. + + Overall Architecture ==================== @@ -19,6 +31,8 @@ DAMON subsystem is configured with three layers including interfaces for the user space, on top of the core layer. +.. _damon_design_configurable_operations_set: + Configurable Operations Set --------------------------- @@ -51,6 +65,8 @@ modules that built on top of the core layer using the API, which can be easily used by the user space end users. +.. _damon_operations_set: + Operations Set Layer ==================== @@ -59,16 +75,26 @@ The monitoring operations are defined in two parts: 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space. -DAMON currently provides the implementations of the operations for the physical -and virtual address spaces. Below two subsections describe how those work. +DAMON currently provides below three operation sets. Below two subsections +describe how those work. + + - vaddr: Monitor virtual address spaces of specific processes + - fvaddr: Monitor fixed virtual address ranges + - paddr: Monitor the physical address space of the system + .. _damon_design_vaddr_target_regions_construction: + VMA-based Target Address Range Construction ------------------------------------------- -This is only for the virtual address space monitoring operations -implementation. That for the physical address space simply asks users to -manually set the monitoring target address ranges. +A mechanism of ``vaddr`` DAMON operations set that automatically initializes +and updates the monitoring target address regions so that entire memory +mappings of the target processes can be covered. + +This mechanism is only for the ``vaddr`` operations set. In cases of +``fvaddr`` and ``paddr`` operation sets, users are asked to manually set the +monitoring target address ranges. Only small parts in the super-huge virtual address space of the processes are mapped to the physical memory and accessed. Thus, tracking the unmapped @@ -154,6 +180,8 @@ The monitoring overhead of this mechanism will arbitrarily increase as the size of the target workload grows. +.. _damon_design_region_based_sampling: + Region Based Sampling ~~~~~~~~~~~~~~~~~~~~~ @@ -163,9 +191,10 @@ assumption (pages in a region have the same access frequencies) is kept, only one page in the region is required to be checked. Thus, for each ``sampling interval``, DAMON randomly picks one page in each region, waits for one ``sampling interval``, checks whether the page is accessed meanwhile, and -increases the access frequency of the region if so. Therefore, the monitoring -overhead is controllable by setting the number of regions. DAMON allows users -to set the minimum and the maximum number of regions for the trade-off. +increases the access frequency counter of the region if so. The counter is +called ``nr_regions`` of the region. Therefore, the monitoring overhead is +controllable by setting the number of regions. DAMON allows users to set the +minimum and the maximum number of regions for the trade-off. This scheme, however, cannot preserve the quality of the output if the assumption is not guaranteed. @@ -190,6 +219,8 @@ In this way, DAMON provides its best-effort quality and minimal overhead while keeping the bounds users set for their trade-off. +.. _damon_design_age_tracking: + Age Tracking ~~~~~~~~~~~~ @@ -254,7 +285,8 @@ works, DAMON provides a feature called Data Access Monitoring-based Operation Schemes (DAMOS). It lets users specify their desired schemes at a high level. For such specifications, DAMON starts monitoring, finds regions having the access pattern of interest, and applies the user-desired operation actions -to the regions as soon as found. +to the regions, for every user-specified time interval called +``apply_interval``. .. _damon_design_damos_action: @@ -276,9 +308,29 @@ not mandated to support all actions of the list. Hence, the availability of specific DAMOS action depends on what operations set is selected to be used together. -Applying an action to a region is considered as changing the region's -characteristics. Hence, DAMOS resets the age of regions when an action is -applied to those. +The list of the supported actions, their meaning, and DAMON operations sets +that supports each action are as below. + + - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``pageout``: Reclaim the region. + Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. + - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``lru_prio``: Prioritize the region on its LRU lists. + Supported by ``paddr`` operations set. + - ``lru_deprio``: Deprioritize the region on its LRU lists. + Supported by ``paddr`` operations set. + - ``stat``: Do nothing but count the statistics. + Supported by all operations sets. + +Applying the actions except ``stat`` to a region is considered as changing the +region's characteristics. Hence, DAMOS resets the age of regions when any such +actions are applied to those. .. _damon_design_damos_access_pattern: @@ -340,6 +392,35 @@ the weight will be respected are up to the underlying prioritization mechanism implementation. +.. _damon_design_damos_quotas_auto_tuning: + +Aim-oriented Feedback-driven Auto-tuning +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Automatic feedback-driven quota tuning. Instead of setting the absolute quota +value, users can specify the metric of their interest, and what target value +they want the metric value to be. DAMOS then automatically tunes the +aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS +is under achieving the goal, DAMOS automatically increases the quota. If DAMOS +is over achieving the goal, it decreases the quota. + +The goal can be specified with three parameters, namely ``target_metric``, +``target_value``, and ``current_value``. The auto-tuning mechanism tries to +make ``current_value`` of ``target_metric`` be same to ``target_value``. +Currently, two ``target_metric`` are provided. + +- ``user_input``: User-provided value. Users could use any metric that they + has interest in for the value. Use space main workload's latency or + throughput, system metrics like free memory ratio or memory pressure stall + time (PSI) could be examples. Note that users should explicitly set + ``current_value`` on their own in this case. In other words, users should + repeatedly provide the feedback. +- ``some_mem_psi_us``: System-wide ``some`` memory pressure stall information + in microseconds that measured from last quota reset to next quota reset. + DAMOS does the measurement on its own, so only ``target_value`` need to be + set by users at the initial time. In other words, DAMOS does self-feedback. + + .. _damon_design_damos_watermarks: Watermarks @@ -380,12 +461,24 @@ number of filters for each scheme. Each filter specifies the type of target memory, and whether it should exclude the memory of the type (filter-out), or all except the memory of the type (filter-in). -As of this writing, anonymous page type and memory cgroup type are supported by -the feature. Some filter target types can require additional arguments. For -example, the memory cgroup filter type asks users to specify the file path of -the memory cgroup for the filter. Hence, users can apply specific schemes to -only anonymous pages, non-anonymous pages, pages of specific cgroups, all pages -excluding those of specific cgroups, and any combination of those. +Currently, anonymous page, memory cgroup, address range, and DAMON monitoring +target type filters are supported by the feature. Some filter target types +require additional arguments. The memory cgroup filter type asks users to +specify the file path of the memory cgroup for the filter. The address range +type asks the start and end addresses of the range. The DAMON monitoring +target type asks the index of the target from the context's monitoring targets +list. Hence, users can apply specific schemes to only anonymous pages, +non-anonymous pages, pages of specific cgroups, all pages excluding those of +specific cgroups, pages in specific address range, pages in specific DAMON +monitoring targets, and any combination of those. + +To handle filters efficiently, the address range and DAMON monitoring target +type filters are handled by the core layer, while others are handled by +operations set. If a memory region is filtered by a core layer-handled filter, +it is not counted as the scheme has tried to the region. In contrast, if a +memory regions is filtered by an operations set layer-handled filter, it is +counted as the scheme has tried. The difference in accounting leads to changes +in the statistics. Application Programming Interface diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index a84c14e59053..5a306e4de22e 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -21,8 +21,8 @@ be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the memory management subsystem maintainer. Note again the patches for review should be made against the mm-unstable -tree[1] whenever possible. damon/next is only for preview of others' works in -progress. +tree [1]_ whenever possible. damon/next is only for preview of others' works +in progress. Submit checklist addendum ------------------------- @@ -41,8 +41,8 @@ Further doing below and putting the results will be helpful. Key cycle dates --------------- -Patches can be sent anytime. Key cycle dates of the mm-unstable[1] and -mm-stable[3] trees depend on the memory management subsystem maintainer. +Patches can be sent anytime. Key cycle dates of the mm-unstable [1]_ and +mm-stable [3]_ trees depend on the memory management subsystem maintainer. Review cadence -------------- diff --git a/Documentation/mm/frontswap.rst b/Documentation/mm/frontswap.rst deleted file mode 100644 index c892412988af..000000000000 --- a/Documentation/mm/frontswap.rst +++ /dev/null @@ -1,264 +0,0 @@ -========= -Frontswap -========= - -Frontswap provides a "transcendent memory" interface for swap pages. -In some environments, dramatic performance savings may be obtained because -swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. - -.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ - -Frontswap is so named because it can be thought of as the opposite of -a "backing" store for a swap device. The storage is assumed to be -a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming -to the requirements of transcendent memory (such as Xen's "tmem", or -in-kernel compressed memory, aka "zcache", or future RAM-like devices); -this pseudo-RAM device is not directly accessible or addressable by the -kernel and is of unknown and possibly time-varying size. The driver -links itself to frontswap by calling frontswap_register_ops to set the -frontswap_ops funcs appropriately and the functions it provides must -conform to certain policies as follows: - -An "init" prepares the device to receive frontswap pages associated -with the specified swap device number (aka "type"). A "store" will -copy the page to transcendent memory and associate it with the type and -offset associated with the page. A "load" will copy the page, if found, -from transcendent memory into kernel memory, but will NOT remove the page -from transcendent memory. An "invalidate_page" will remove the page -from transcendent memory and an "invalidate_area" will remove ALL pages -associated with the swap type (e.g., like swapoff) and notify the "device" -to refuse further stores with that swap type. - -Once a page is successfully stored, a matching load on the page will normally -succeed. So when the kernel finds itself in a situation where it needs -to swap out a page, it first attempts to use frontswap. If the store returns -success, the data has been successfully saved to transcendent memory and -a disk write and, if the data is later read back, a disk read are avoided. -If a store returns failure, transcendent memory has rejected the data, and the -page can be written to swap as usual. - -Note that if a page is stored and the page already exists in transcendent memory -(a "duplicate" store), either the store succeeds and the data is overwritten, -or the store fails AND the page is invalidated. This ensures stale data may -never be obtained from frontswap. - -If properly configured, monitoring of frontswap is done via debugfs in -the `/sys/kernel/debug/frontswap` directory. The effectiveness of -frontswap can be measured (across all swap devices) with: - -``failed_stores`` - how many store attempts have failed - -``loads`` - how many loads were attempted (all should succeed) - -``succ_stores`` - how many store attempts have succeeded - -``invalidates`` - how many invalidates were attempted - -A backend implementation may provide additional metrics. - -FAQ -=== - -* Where's the value? - -When a workload starts swapping, performance falls through the floor. -Frontswap significantly increases performance in many such workloads by -providing a clean, dynamic interface to read and write swap pages to -"transcendent memory" that is otherwise not directly addressable to the kernel. -This interface is ideal when data is transformed to a different form -and size (such as with compression) or secretly moved (as might be -useful for write-balancing for some RAM-like devices). Swap pages (and -evicted page-cache pages) are a great use for this kind of slower-than-RAM- -but-much-faster-than-disk "pseudo-RAM device". - -Frontswap with a fairly small impact on the kernel, -provides a huge amount of flexibility for more dynamic, flexible RAM -utilization in various system configurations: - -In the single kernel case, aka "zcache", pages are compressed and -stored in local memory, thus increasing the total anonymous pages -that can be safely kept in RAM. Zcache essentially trades off CPU -cycles used in compression/decompression for better memory utilization. -Benchmarks have shown little or no impact when memory pressure is -low while providing a significant performance improvement (25%+) -on some workloads under high memory pressure. - -"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory -support for clustered systems. Frontswap pages are locally compressed -as in zcache, but then "remotified" to another system's RAM. This -allows RAM to be dynamically load-balanced back-and-forth as needed, -i.e. when system A is overcommitted, it can swap to system B, and -vice versa. RAMster can also be configured as a memory server so -many servers in a cluster can swap, dynamically as needed, to a single -server configured with a large amount of RAM... without pre-configuring -how much of the RAM is available for each of the clients! - -In the virtual case, the whole point of virtualization is to statistically -multiplex physical resources across the varying demands of multiple -virtual machines. This is really hard to do with RAM and efforts to do -it well with no kernel changes have essentially failed (except in some -well-publicized special-case workloads). -Specifically, the Xen Transcendent Memory backend allows otherwise -"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple -virtual machines, but the pages can be compressed and deduplicated to -optimize RAM utilization. And when guest OS's are induced to surrender -underutilized RAM (e.g. with "selfballooning"), sudden unexpected -memory pressure may result in swapping; frontswap allows those pages -to be swapped to and from hypervisor RAM (if overall host system memory -conditions allow), thus mitigating the potentially awful performance impact -of unplanned swapping. - -A KVM implementation is underway and has been RFC'ed to lkml. And, -using frontswap, investigation is also underway on the use of NVM as -a memory extension technology. - -* Sure there may be performance advantages in some situations, but - what's the space/time overhead of frontswap? - -If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into -nothingness and the only overhead is a few extra bytes per swapon'ed -swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" -registers, there is one extra global variable compared to zero for -every swap page read or written. If CONFIG_FRONTSWAP is enabled -AND a frontswap backend registers AND the backend fails every "store" -request (i.e. provides no memory despite claiming it might), -CPU overhead is still negligible -- and since every frontswap fail -precedes a swap page write-to-disk, the system is highly likely -to be I/O bound and using a small fraction of a percent of a CPU -will be irrelevant anyway. - -As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend -registers, one bit is allocated for every swap page for every swap -device that is swapon'd. This is added to the EIGHT bits (which -was sixteen until about 2.6.34) that the kernel already allocates -for every swap page for every swap device that is swapon'd. (Hugh -Dickins has observed that frontswap could probably steal one of -the existing eight bits, but let's worry about that minor optimization -later.) For very large swap disks (which are rare) on a standard -4K pagesize, this is 1MB per 32GB swap. - -When swap pages are stored in transcendent memory instead of written -out to disk, there is a side effect that this may create more memory -pressure that can potentially outweigh the other advantages. A -backend, such as zcache, must implement policies to carefully (but -dynamically) manage memory limits to ensure this doesn't happen. - -* OK, how about a quick overview of what this frontswap patch does - in terms that a kernel hacker can grok? - -Let's assume that a frontswap "backend" has registered during -kernel initialization; this registration indicates that this -frontswap backend has access to some "memory" that is not directly -accessible by the kernel. Exactly how much memory it provides is -entirely dynamic and random. - -Whenever a swap-device is swapon'd frontswap_init() is called, -passing the swap device number (aka "type") as a parameter. -This notifies frontswap to expect attempts to "store" swap pages -associated with that number. - -Whenever the swap subsystem is readying a page to write to a swap -device (c.f swap_writepage()), frontswap_store is called. Frontswap -consults with the frontswap backend and if the backend says it does NOT -have room, frontswap_store returns -1 and the kernel swaps the page -to the swap device as normal. Note that the response from the frontswap -backend is unpredictable to the kernel; it may choose to never accept a -page, it could accept every ninth page, or it might accept every -page. But if the backend does accept a page, the data from the page -has already been copied and associated with the type and offset, -and the backend guarantees the persistence of the data. In this case, -frontswap sets a bit in the "frontswap_map" for the swap device -corresponding to the page offset on the swap device to which it would -otherwise have written the data. - -When the swap subsystem needs to swap-in a page (swap_readpage()), -it first calls frontswap_load() which checks the frontswap_map to -see if the page was earlier accepted by the frontswap backend. If -it was, the page of data is filled from the frontswap backend and -the swap-in is complete. If not, the normal swap-in code is -executed to obtain the page of data from the real swap device. - -So every time the frontswap backend accepts a page, a swap device read -and (potentially) a swap device write are replaced by a "frontswap backend -store" and (possibly) a "frontswap backend loads", which are presumably much -faster. - -* Can't frontswap be configured as a "special" swap device that is - just higher priority than any real swap device (e.g. like zswap, - or maybe swap-over-nbd/NFS)? - -No. First, the existing swap subsystem doesn't allow for any kind of -swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, -but this would require fairly drastic changes. Even if it were -rewritten, the existing swap subsystem uses the block I/O layer which -assumes a swap device is fixed size and any page in it is linearly -addressable. Frontswap barely touches the existing swap subsystem, -and works around the constraints of the block I/O subsystem to provide -a great deal of flexibility and dynamicity. - -For example, the acceptance of any swap page by the frontswap backend is -entirely unpredictable. This is critical to the definition of frontswap -backends because it grants completely dynamic discretion to the -backend. In zcache, one cannot know a priori how compressible a page is. -"Poorly" compressible pages can be rejected, and "poorly" can itself be -defined dynamically depending on current memory constraints. - -Further, frontswap is entirely synchronous whereas a real swap -device is, by definition, asynchronous and uses block I/O. The -block I/O layer is not only unnecessary, but may perform "optimizations" -that are inappropriate for a RAM-oriented device including delaying -the write of some pages for a significant amount of time. Synchrony is -required to ensure the dynamicity of the backend and to avoid thorny race -conditions that would unnecessarily and greatly complicate frontswap -and/or the block I/O subsystem. That said, only the initial "store" -and "load" operations need be synchronous. A separate asynchronous thread -is free to manipulate the pages stored by frontswap. For example, -the "remotification" thread in RAMster uses standard asynchronous -kernel sockets to move compressed frontswap pages to a remote machine. -Similarly, a KVM guest-side implementation could do in-guest compression -and use "batched" hypercalls. - -In a virtualized environment, the dynamicity allows the hypervisor -(or host OS) to do "intelligent overcommit". For example, it can -choose to accept pages only until host-swapping might be imminent, -then force guests to do their own swapping. - -There is a downside to the transcendent memory specifications for -frontswap: Since any "store" might fail, there must always be a real -slot on a real swap device to swap the page. Thus frontswap must be -implemented as a "shadow" to every swapon'd device with the potential -capability of holding every page that the swap device might have held -and the possibility that it might hold no pages at all. This means -that frontswap cannot contain more pages than the total of swapon'd -swap devices. For example, if NO swap device is configured on some -installation, frontswap is useless. Swapless portable devices -can still use frontswap but a backend for such devices must configure -some kind of "ghost" swap device and ensure that it is never used. - -* Why this weird definition about "duplicate stores"? If a page - has been previously successfully stored, can't it always be - successfully overwritten? - -Nearly always it can, but no, sometimes it cannot. Consider an example -where data is compressed and the original 4K page has been compressed -to 1K. Now an attempt is made to overwrite the page with data that -is non-compressible and so would take the entire 4K. But the backend -has no more space. In this case, the store must be rejected. Whenever -frontswap rejects a store that would overwrite, it also must invalidate -the old data and ensure that it is no longer accessible. Since the -swap subsystem then writes the new data to the read swap device, -this is the correct course of action to ensure coherency. - -* Why does the frontswap patch create the new include file swapfile.h? - -The frontswap code depends on some swap-subsystem-internal data -structures that have, over the years, moved back and forth between -static and global. This seemed a reasonable compromise: Define -them as global but declare them in a new include file that isn't -included by the large number of source files that include swap.h. - -Dan Magenheimer, last updated April 9, 2012 diff --git a/Documentation/mm/highmem.rst b/Documentation/mm/highmem.rst index c964e0848702..9d92e3f2b3d6 100644 --- a/Documentation/mm/highmem.rst +++ b/Documentation/mm/highmem.rst @@ -51,11 +51,14 @@ Temporary Virtual Mappings The kernel contains several ways of creating temporary mappings. The following list shows them in order of preference of use. -* kmap_local_page(). This function is used to require short term mappings. - It can be invoked from any context (including interrupts) but the mappings - can only be used in the context which acquired them. - - This function should always be used, whereas kmap_atomic() and kmap() have +* kmap_local_page(), kmap_local_folio() - These functions are used to create + short term mappings. They can be invoked from any context (including + interrupts) but the mappings can only be used in the context which acquired + them. The only differences between them consist in the first taking a pointer + to a struct page and the second taking a pointer to struct folio and the byte + offset within the folio which identifies the page. + + These functions should always be used, whereas kmap_atomic() and kmap() have been deprecated. These mappings are thread-local and CPU-local, meaning that the mapping @@ -72,17 +75,17 @@ list shows them in order of preference of use. maps of the outgoing task are saved and those of the incoming one are restored. - kmap_local_page() always returns a valid virtual address and it is assumed - that kunmap_local() will never fail. + kmap_local_page(), as well as kmap_local_folio() always returns valid virtual + kernel addresses and it is assumed that kunmap_local() will never fail. - On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the + On CONFIG_HIGHMEM=n kernels and for low memory pages they return the virtual address of the direct mapping. Only real highmem pages are temporarily mapped. Therefore, users may call a plain page_address() for pages which are known to not come from ZONE_HIGHMEM. However, it is - always safe to use kmap_local_page() / kunmap_local(). + always safe to use kmap_local_{page,folio}() / kunmap_local(). - While it is significantly faster than kmap(), for the highmem case it - comes with restrictions about the pointers validity. Contrary to kmap() + While they are significantly faster than kmap(), for the highmem case they + come with restrictions about the pointers validity. Contrary to kmap() mappings, the local mappings are only valid in the context of the caller and cannot be handed to other contexts. This implies that users must be absolutely sure to keep the use of the return address local to the @@ -91,7 +94,7 @@ list shows them in order of preference of use. Most code can be designed to use thread local mappings. User should therefore try to design their code to avoid the use of kmap() by mapping pages in the same thread the address will be used and prefer - kmap_local_page(). + kmap_local_page() or kmap_local_folio(). Nesting kmap_local_page() and kmap_atomic() mappings is allowed to a certain extent (up to KMAP_TYPE_NR) but their invocations have to be strictly ordered @@ -206,4 +209,5 @@ Functions ========= .. kernel-doc:: include/linux/highmem.h +.. kernel-doc:: mm/highmem.c .. kernel-doc:: include/linux/highmem-internal.h diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst index 9aa512c3a12c..0595098a74d9 100644 --- a/Documentation/mm/hmm.rst +++ b/Documentation/mm/hmm.rst @@ -163,16 +163,7 @@ use:: It will trigger a page fault on missing or read-only entries if write access is requested (see below). Page faults use the generic mm page fault code path just -like a CPU page fault. - -Both functions copy CPU page table entries into their pfns array argument. Each -entry in that array corresponds to an address in the virtual range. HMM -provides a set of flags to help the driver identify special CPU page table -entries. - -Locking within the sync_cpu_device_pagetables() callback is the most important -aspect the driver must respect in order to keep things properly synchronized. -The usage pattern is:: +like a CPU page fault. The usage pattern is:: int driver_populate_range(...) { @@ -417,7 +408,7 @@ entries. Any attempt to access the swap entry results in a fault which is resovled by replacing the entry with the original mapping. A driver gets notified that the mapping has been changed by MMU notifiers, after which point it will no longer have exclusive access to the page. Exclusive access is -guranteed to last until the driver drops the page lock and page reference, at +guaranteed to last until the driver drops the page lock and page reference, at which point any CPU faults on the page may proceed as described. Memory cgroup (memcg) and rss accounting diff --git a/Documentation/mm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst index d9c2b0f01dcd..4914fbf07966 100644 --- a/Documentation/mm/hugetlbfs_reserv.rst +++ b/Documentation/mm/hugetlbfs_reserv.rst @@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages). Freeing Huge Pages ================== -Huge page freeing is performed by the routine free_huge_page(). This routine -is the destructor for hugetlbfs compound pages. As a result, it is only -passed a pointer to the page struct. When a huge page is freed, reservation -accounting may need to be performed. This would be the case if the page was -associated with a subpool that contained reserves, or the page is being freed -on an error path where a global reserve count must be restored. +Huge pages are freed by free_huge_folio(). It is only passed a pointer +to the folio as it is called from the generic MM code. When a huge page +is freed, reservation accounting may need to be performed. This would +be the case if the page was associated with a subpool that contained +reserves, or the page is being freed on an error path where a global +reserve count must be restored. The page->private field points to any subpool associated with the page. If the PagePrivate flag is set, it indicates the global reserve count should @@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge page is allocated but before it is instantiated. In this case, the page allocation has consumed the reservation and made the appropriate subpool, reservation map and global count adjustments. If the page is freed at this -time (before instantiation and clearing of PagePrivate), then free_huge_page +time (before instantiation and clearing of PagePrivate), then free_huge_folio will increment the global reservation count. However, the reservation map indicates the reservation was consumed. This resulting inconsistent state will cause the 'leak' of a reserved huge page. The global reserve count will diff --git a/Documentation/mm/hwpoison.rst b/Documentation/mm/hwpoison.rst index ba48a441feed..483b72aa7c11 100644 --- a/Documentation/mm/hwpoison.rst +++ b/Documentation/mm/hwpoison.rst @@ -48,7 +48,7 @@ of applications. KVM support requires a recent qemu-kvm release. For the KVM use there was need for a new signal type so that KVM can inject the machine check into the guest with the proper address. This in theory allows other applications to handle -memory failures too. The expection is that near all applications +memory failures too. The expectation is that most applications won't do that, but some very specialized ones might. Failure recovery modes diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst index 5a94a921ea40..31d2ac306438 100644 --- a/Documentation/mm/index.rst +++ b/Documentation/mm/index.rst @@ -44,7 +44,6 @@ above structured documentation, or deleted if it has served its purpose. balance damon/index free_page_reporting - frontswap hmm hwpoison hugetlbfs_reserv diff --git a/Documentation/mm/overcommit-accounting.rst b/Documentation/mm/overcommit-accounting.rst index a4895d6fc1c2..e2263477f6d5 100644 --- a/Documentation/mm/overcommit-accounting.rst +++ b/Documentation/mm/overcommit-accounting.rst @@ -8,8 +8,7 @@ The Linux kernel supports the following overcommit handling modes Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to - reduce swap usage. root is allowed to allocate slightly more - memory in this mode. This is the default. + reduce swap usage. This is the default. 1 Always overcommit. Appropriate for some scientific diff --git a/Documentation/mm/page_cache.rst b/Documentation/mm/page_cache.rst index 75eba7c431b2..138d61f869df 100644 --- a/Documentation/mm/page_cache.rst +++ b/Documentation/mm/page_cache.rst @@ -3,3 +3,13 @@ ========== Page Cache ========== + +The page cache is the primary way that the user and the rest of the kernel +interact with filesystems. It can be bypassed (e.g. with O_DIRECT), +but normal reads, writes and mmaps go through the page cache. + +Folios +====== + +The folio is the unit of memory management within the page cache. +Operations diff --git a/Documentation/mm/page_migration.rst b/Documentation/mm/page_migration.rst index e35af7805be5..f1ce67a26615 100644 --- a/Documentation/mm/page_migration.rst +++ b/Documentation/mm/page_migration.rst @@ -180,7 +180,7 @@ The following events (counters) can be used to monitor page migration. 4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. 5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had - to be split. After splitting, a migration retry was used for it's sub-pages. + to be split. After splitting, a migration retry was used for its sub-pages. THP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or PGMIGRATE_FAIL events. For example, a THP migration failure will cause both diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst index 62e3f7ab23cc..3a45a20fc05a 100644 --- a/Documentation/mm/page_owner.rst +++ b/Documentation/mm/page_owner.rst @@ -24,6 +24,11 @@ fragmentation statistics can be obtained through gfp flag information of each page. It is already implemented and activated if page owner is enabled. Other usages are more than welcome. +It can also be used to show all the stacks and their current number of +allocated base pages, which gives us a quick overview of where the memory +is going without the need to screen through all the pages and match the +allocation and free operation. + page owner is disabled by default. So, if you'd like to use it, you need to add "page_owner=on" to your boot cmdline. If the kernel is built with page owner and page owner is disabled in runtime due to not enabling @@ -68,6 +73,49 @@ Usage 4) Analyze information from page owner:: + cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt + cat stacks.txt + post_alloc_hook+0x177/0x1a0 + get_page_from_freelist+0xd01/0xd80 + __alloc_pages+0x39e/0x7e0 + allocate_slab+0xbc/0x3f0 + ___slab_alloc+0x528/0x8a0 + kmem_cache_alloc+0x224/0x3b0 + sk_prot_alloc+0x58/0x1a0 + sk_alloc+0x32/0x4f0 + inet_create+0x427/0xb50 + __sock_create+0x2e4/0x650 + inet_ctl_sock_create+0x30/0x180 + igmp_net_init+0xc1/0x130 + ops_init+0x167/0x410 + setup_net+0x304/0xa60 + copy_net_ns+0x29b/0x4a0 + create_new_namespaces+0x4a1/0x820 + nr_base_pages: 16 + ... + ... + echo 7000 > /sys/kernel/debug/page_owner_stacks/count_threshold + cat /sys/kernel/debug/page_owner_stacks/show_stacks> stacks_7000.txt + cat stacks_7000.txt + post_alloc_hook+0x177/0x1a0 + get_page_from_freelist+0xd01/0xd80 + __alloc_pages+0x39e/0x7e0 + alloc_pages_mpol+0x22e/0x490 + folio_alloc+0xd5/0x110 + filemap_alloc_folio+0x78/0x230 + page_cache_ra_order+0x287/0x6f0 + filemap_get_pages+0x517/0x1160 + filemap_read+0x304/0x9f0 + xfs_file_buffered_read+0xe6/0x1d0 [xfs] + xfs_file_read_iter+0x1f0/0x380 [xfs] + __kernel_read+0x3b9/0x730 + kernel_read_file+0x309/0x4d0 + __do_sys_finit_module+0x381/0x730 + do_syscall_64+0x8d/0x150 + entry_SYSCALL_64_after_hwframe+0x62/0x6a + nr_base_pages: 20824 + ... + cat /sys/kernel/debug/page_owner > page_owner_full.txt ./page_owner_sort page_owner_full.txt sorted_page_owner.txt diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst index 7840c1891751..be47b192a596 100644 --- a/Documentation/mm/page_tables.rst +++ b/Documentation/mm/page_tables.rst @@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the virtual memory manager, will need to be written so that it traverses all of the currently five levels. This style should also be preferred for architecture-specific code, so as to be robust to future changes. + + +MMU, TLB, and Page Faults +========================= + +The `Memory Management Unit (MMU)` is a hardware component that handles virtual +to physical address translations. It may use relatively small caches in hardware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up +these translations. + +When CPU accesses a memory location, it provides a virtual address to the MMU, +which checks if there is the existing translation in the TLB or in the Page +Walk Caches (on architectures that support them). If no translation is found, +MMU uses the page walks to determine the physical address and create the map. + +The dirty bit for a page is set (i.e., turned on) when the page is written to. +Each page of memory has associated permission and dirty bits. The latter +indicate that the page has been modified since it was loaded into memory. + +If nothing prevents it, eventually the physical memory can be accessed and the +requested operation on the physical frame is performed. + +There are several reasons why the MMU can't find certain translations. It could +happen because the CPU is trying to access memory that the current task is not +permitted to, or because the data is not present into physical memory. + +When these conditions happen, the MMU triggers page faults, which are types of +exceptions that signal the CPU to pause the current execution and run a special +function to handle the mentioned exceptions. + +There are common and expected causes of page faults. These are triggered by +process management optimization techniques called "Lazy Allocation" and +"Copy-on-Write". Page faults may also happen when frames have been swapped out +to persistent storage (swap partition or file) and evicted from their physical +locations. + +These techniques improve memory efficiency, reduce latency, and minimize space +occupation. This document won't go deeper into the details of "Lazy Allocation" +and "Copy-on-Write" because these subjects are out of scope as they belong to +Process Address Management. + +Swapping differentiates itself from the other mentioned techniques because it's +undesirable since it's performed as a means to reduce memory under heavy +pressure. + +Swapping can't work for memory mapped by kernel logical addresses. These are a +subset of the kernel virtual space that directly maps a contiguous range of +physical memory. Given any logical address, its physical address is determined +with simple arithmetic on an offset. Accesses to logical addresses are fast +because they avoid the need for complex page table lookups at the expenses of +frames not being evictable and pageable out. + +If the kernel fails to make room for the data that must be present in the +physical frames, the kernel invokes the out-of-memory (OOM) killer to make room +by terminating lower priority processes until pressure reduces under a safe +threshold. + +Additionally, page faults may be also caused by code bugs or by maliciously +crafted addresses that the CPU is instructed to access. A thread of a process +could use instructions to address (non-shared) memory which does not belong to +its own address space, or could try to execute an instruction that want to write +to a read-only location. + +If the above-mentioned conditions happen in user-space, the kernel sends a +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually +causes the termination of the thread and of the process it belongs to. + +This document is going to simplify and show an high altitude view of how the +Linux kernel handles these page faults, creates tables and tables' entries, +check if memory is present and, if not, requests to load data from persistent +storage or from other devices, and updates the MMU and its caches. + +The first steps are architecture dependent. Most architectures jump to +`do_page_fault()`, whereas the x86 interrupt handler is defined by the +`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. + +Whatever the routes, all architectures end up to the invocation of +`handle_mm_fault()` which, in turn, (likely) ends up calling +`__handle_mm_fault()` to carry out the actual work of allocating the page +tables. + +The unfortunate case of not being able to call `__handle_mm_fault()` means +that the virtual address is pointing to areas of physical memory which are not +permitted to be accessed (at least from the current context). This +condition resolves to the kernel sending the above-mentioned SIGSEGV signal +to the process and leads to the consequences already explained. + +`__handle_mm_fault()` carries out its work by calling several functions to +find the entry's offsets of the upper layers of the page tables and allocate +the tables that it may need. + +The functions that look for the offset have names like `*_offset()`, where the +"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the +corresponding tables, layer by layer, are called `*_alloc`, using the +above-mentioned convention to name them after the corresponding types of tables +in the hierarchy. + +The page table walk may end at one of the middle or upper layers (PMD, PUD). + +Linux supports larger page sizes than the usual 4KB (i.e., the so called +`huge pages`). When using these kinds of larger pages, higher level pages can +directly map them, with no need to use lower level page entries (PTE). Huge +pages contain large contiguous physical regions that usually span from 2MB to +1GB. They are respectively mapped by the PMD and PUD page entries. + +The huge pages bring with them several benefits like reduced TLB pressure, +reduced page table overhead, memory allocation efficiency, and performance +improvement for certain workloads. However, these benefits come with +trade-offs, like wasted memory and allocation challenges. + +At the very end of the walk with allocations, if it didn't return errors, +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` +performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. +"read", "cow", "shared" give hints about the reasons and the kind of fault it's +handling. + +The actual implementation of the workflow is very complex. Its design allows +Linux to handle page faults in a way that is tailored to the specific +characteristics of each architecture, while still sharing a common overall +structure. + +To conclude this high altitude view of how Linux handles page faults, let's +add that the page faults handler can be disabled and enabled respectively with +`pagefault_disable()` and `pagefault_enable()`. + +Several code path make use of the latter two functions because they need to +disable traps into the page faults handler, mostly to prevent deadlocks. diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst index be75971532f5..b517ee28a955 100644 --- a/Documentation/mm/slub.rst +++ b/Documentation/mm/slub.rst @@ -9,7 +9,7 @@ SLUB can enable debugging only for selected slabs in order to avoid an impact on overall system performance which may make a bug more difficult to find. -In order to switch debugging on one can add an option ``slub_debug`` +In order to switch debugging on one can add an option ``slab_debug`` to the kernel command line. That will enable full debugging for all slabs. @@ -26,16 +26,16 @@ be enabled on the command line. F.e. no tracking information will be available without debugging on and validation can only partially be performed if debugging was not switched on. -Some more sophisticated uses of slub_debug: +Some more sophisticated uses of slab_debug: ------------------------------------------- -Parameters may be given to ``slub_debug``. If none is specified then full +Parameters may be given to ``slab_debug``. If none is specified then full debugging is enabled. Format: -slub_debug=<Debug-Options> +slab_debug=<Debug-Options> Enable options for all slabs -slub_debug=<Debug-Options>,<slab name1>,<slab name2>,... +slab_debug=<Debug-Options>,<slab name1>,<slab name2>,... Enable options only for select slabs (no spaces after a comma) @@ -60,23 +60,23 @@ Possible debug options are:: F.e. in order to boot just with sanity checks and red zoning one would specify:: - slub_debug=FZ + slab_debug=FZ Trying to find an issue in the dentry cache? Try:: - slub_debug=,dentry + slab_debug=,dentry to only enable debugging on the dentry cache. You may use an asterisk at the end of the slab name, in order to cover all slabs with the same prefix. For example, here's how you can poison the dentry cache as well as all kmalloc slabs:: - slub_debug=P,kmalloc-*,dentry + slab_debug=P,kmalloc-*,dentry Red zoning and tracking may realign the slab. We can just apply sanity checks to the dentry cache with:: - slub_debug=F,dentry + slab_debug=F,dentry Debugging options may require the minimum possible slab order to increase as a result of storing the metadata (for example, caches with PAGE_SIZE object @@ -84,20 +84,20 @@ sizes). This has a higher liklihood of resulting in slab allocation errors in low memory situations or if there's high fragmentation of memory. To switch off debugging for such caches by default, use:: - slub_debug=O + slab_debug=O You can apply different options to different list of slab names, using blocks of options. This will enable red zoning for dentry and user tracking for kmalloc. All other slabs will not get any debugging enabled:: - slub_debug=Z,dentry;U,kmalloc-* + slab_debug=Z,dentry;U,kmalloc-* You can also enable options (e.g. sanity checks and poisoning) for all caches except some that are deemed too performance critical and don't need to be debugged by specifying global debug options followed by a list of slab names with "-" as options:: - slub_debug=FZ;-,zs_handle,zspage + slab_debug=FZ;-,zs_handle,zspage The state of each debug option for a slab can be found in the respective files under:: @@ -105,7 +105,7 @@ under:: /sys/kernel/slab/<slab name>/ If the file contains 1, the option is enabled, 0 means disabled. The debug -options from the ``slub_debug`` parameter translate to the following files:: +options from the ``slab_debug`` parameter translate to the following files:: F sanity_checks Z red_zone @@ -129,7 +129,7 @@ in order to reduce overhead and increase cache hotness of objects. Slab validation =============== -SLUB can validate all object if the kernel was booted with slub_debug. In +SLUB can validate all object if the kernel was booted with slab_debug. In order to do so you must have the ``slabinfo`` tool. Then you can do :: @@ -150,29 +150,29 @@ list_lock once in a while to deal with partial slabs. That overhead is governed by the order of the allocation for each slab. The allocations can be influenced by kernel parameters: -.. slub_min_objects=x (default 4) -.. slub_min_order=x (default 0) -.. slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) +.. slab_min_objects=x (default: automatically scaled by number of cpus) +.. slab_min_order=x (default 0) +.. slab_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) -``slub_min_objects`` +``slab_min_objects`` allows to specify how many objects must at least fit into one slab in order for the allocation order to be acceptable. In general slub will be able to perform this number of allocations on a slab without consulting centralized resources (list_lock) where contention may occur. -``slub_min_order`` +``slab_min_order`` specifies a minimum order of slabs. A similar effect like - ``slub_min_objects``. + ``slab_min_objects``. -``slub_max_order`` - specified the order at which ``slub_min_objects`` should no +``slab_max_order`` + specified the order at which ``slab_min_objects`` should no longer be checked. This is useful to avoid SLUB trying to - generate super large order pages to fit ``slub_min_objects`` + generate super large order pages to fit ``slab_min_objects`` of a slab cache with large object sizes into one high order page. Setting command line parameter ``debug_guardpage_minorder=N`` (N > 0), forces setting - ``slub_max_order`` to 0, what cause minimum possible order of + ``slab_max_order`` to 0, what cause minimum possible order of slabs allocation. SLUB Debug output @@ -219,7 +219,7 @@ Here is a sample of slub debug output:: FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc If SLUB encounters a corrupted object (full detection requires the kernel -to be booted with slub_debug) then the following output will be dumped +to be booted with slab_debug) then the following output will be dumped into the syslog: 1. Description of the problem encountered @@ -239,7 +239,7 @@ into the syslog: pid=<pid of the process> (Object allocation / free information is only available if SLAB_STORE_USER is - set for the slab. slub_debug sets that option) + set for the slab. slab_debug sets that option) 2. The object contents if an object was involved. @@ -262,7 +262,7 @@ into the syslog: the object boundary. (Redzone information is only available if SLAB_RED_ZONE is set. - slub_debug sets that option) + slab_debug sets that option) Padding <address> : <bytes> Unused data to fill up the space in order to get the next object @@ -296,7 +296,7 @@ Emergency operations Minimal debugging (sanity checks alone) can be enabled by booting with:: - slub_debug=F + slab_debug=F This will be generally be enough to enable the resiliency features of slub which will keep the system running even if a bad kernel component will @@ -311,13 +311,13 @@ and enabling debugging only for that cache I.e.:: - slub_debug=F,dentry + slab_debug=F,dentry If the corruption occurs by writing after the end of the object then it may be advisable to enable a Redzone to avoid corrupting the beginning of other objects:: - slub_debug=FZ,dentry + slab_debug=FZ,dentry Extended slabinfo mode and plotting =================================== diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst index a834fad9de12..e4f6972eb6c0 100644 --- a/Documentation/mm/split_page_table_lock.rst +++ b/Documentation/mm/split_page_table_lock.rst @@ -58,7 +58,7 @@ Support of split page table lock by an architecture =================================================== There's no need in special enabling of PTE split page table lock: everything -required is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which +required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which must be called on PTE table allocation / freeing. Make sure the architecture doesn't use slab allocator for page table @@ -68,8 +68,8 @@ This field shares storage with page->ptl. PMD split lock only makes sense if you have more than two page table levels. -PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table -allocation and pgtable_pmd_page_dtor() on freeing. +PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table +allocation and pagetable_pmd_dtor() on freeing. Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing @@ -77,7 +77,7 @@ paths: i.e X86_PAE preallocate few PMDs on pgd_alloc(). With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK. -NOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must +NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must be handled properly. page->ptl @@ -97,7 +97,7 @@ trick: split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs one more cache line for indirect access; -The spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in -pgtable_pmd_page_ctor() for PMD table. +The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in +pagetable_pmd_ctor() for PMD table. Please, never access page->ptl directly -- use appropriate helper. diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 9a607059ea11..93c9239b9ebe 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -117,7 +117,7 @@ pages: - map/unmap of a PMD entry for the whole THP increment/decrement folio->_entire_mapcount and also increment/decrement - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount + folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement @@ -156,7 +156,7 @@ Partial unmap and deferred_split_folio() Unmapping part of THP (with munmap() or other way) is not going to free memory immediately. Instead, we detect that a subpage of THP is not in use -in page_remove_rmap() and queue the THP for splitting if memory pressure +in folio_remove_rmap_*() and queue the THP for splitting if memory pressure comes. Splitting will free up unused subpages. Splitting the page right away is not an option due to locking context in diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index d5ac8511eb67..b6a07a26b10d 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -463,7 +463,7 @@ can request that a region of memory be mlocked by supplying the MAP_LOCKED flag to the mmap() call. There is one important and subtle difference here, though. mmap() + mlock() will fail if the range cannot be faulted in (e.g. because mm_populate fails) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. -The mmaped area will still have properties of the locked area - pages will not +The mmapped area will still have properties of the locked area - pages will not get swapped out - but major page faults to fault memory in might still happen. Furthermore, any mmap() call or brk() call that expands the heap by a task @@ -486,7 +486,7 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. Before the unevictable/mlock changes, mlocking did not mark the pages in any way, so unmapping them required no processing. -For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls +For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). @@ -511,7 +511,7 @@ userspace; truncation even unmaps and deletes any private anonymous pages which had been Copied-On-Write from the file pages now being truncated. Mlocked pages can be munlocked and deleted in this way: like with munmap(), -for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls +for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst index a4b12ff906c4..593ede6d314b 100644 --- a/Documentation/mm/vmemmap_dedup.rst +++ b/Documentation/mm/vmemmap_dedup.rst @@ -1,3 +1,4 @@ + .. SPDX-License-Identifier: GPL-2.0 ========================================= @@ -10,14 +11,14 @@ HugeTLB This section is to explain how HugeTLB Vmemmap Optimization (HVO) works. The ``struct page`` structures are used to describe a physical page frame. By -default, there is a one-to-one mapping from a page frame to it's corresponding +default, there is a one-to-one mapping from a page frame to its corresponding ``struct page``. HugeTLB pages consist of multiple base page size pages and is supported by many architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page -consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. +consists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages. For each base page, there is a corresponding ``struct page``. Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to @@ -210,6 +211,7 @@ the device (altmap). The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). +For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst The differences with HugeTLB are relatively minor. diff --git a/Documentation/mm/zsmalloc.rst b/Documentation/mm/zsmalloc.rst index a3c26d587752..76902835e68e 100644 --- a/Documentation/mm/zsmalloc.rst +++ b/Documentation/mm/zsmalloc.rst @@ -263,3 +263,8 @@ is heavy internal fragmentation and zspool compaction is unable to relocate objects and release zspages. In these cases, it is recommended to decrease the limit on the size of the zspage chains (as specified by the CONFIG_ZSMALLOC_CHAIN_SIZE option). + +Functions +========= + +.. kernel-doc:: mm/zsmalloc.c |