aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2021-07-01mm/rmap: split migration into its own functionAlistair Popple1-3/+1
Migration is currently implemented as a mode of operation for try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE. However it does not have much in common with the rest of the unmap functionality of try_to_unmap_one() and thus splitting it into a separate function reduces the complexity of try_to_unmap_one() making it more readable. Several simplifications can also be made in try_to_migrate_one() based on the following observations: - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK. - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON. - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH. TTU_SPLIT_FREEZE is a special case of migration used when splitting an anonymous page. This is most easily dealt with by calling the correct function from unmap_page() in mm/huge_memory.c - either try_to_migrate() for PageAnon or try_to_unmap(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Cc: Ben Skeggs <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/rmap: split try_to_munlock from try_to_unmapAlistair Popple1-2/+1
The behaviour of try_to_unmap_one() is difficult to follow because it performs different operations based on a fairly large set of flags used in different combinations. TTU_MUNLOCK is one such flag. However it is exclusively used by try_to_munlock() which specifies no other flags. Therefore rather than overload try_to_unmap_one() with unrelated behaviour split this out into it's own function and remove the flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Ben Skeggs <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/swapops: rework swap entry manipulation codeAlistair Popple1-26/+30
Both migration and device private pages use special swap entries that are manipluated by a range of inline functions. The arguments to these are somewhat inconsistent so rework them to remove flag type arguments and to make the arguments similar for both read and write entry creation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Cc: Ben Skeggs <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm: remove special swap entry functionsAlistair Popple2-48/+25
Patch series "Add support for SVM atomics in Nouveau", v11. Introduction ============ Some devices have features such as atomic PTE bits that can be used to implement atomic access to system memory. To support atomic operations to a shared virtual memory page such a device needs access to that page which is exclusive of the CPU. This series introduces a mechanism to temporarily unmap pages granting exclusive access to a device. These changes are required to support OpenCL atomic operations in Nouveau to shared virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the OpenCL SVM feature is available at https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/ OpenCL_API.html#_shared_virtual_memory . Implementation ============== Exclusive device access is implemented by adding a new swap entry type (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main difference is that on fault the original entry is immediately restored by the fault handler instead of waiting. Restoring the entry triggers calls to MMU notifers which allows a device driver to revoke the atomic access permission from the GPU prior to the CPU finalising the entry. Patches ======= Patches 1 & 2 refactor existing migration and device private entry functions. Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated functionality into separate functions - try_to_migrate_one() and try_to_munlock_one(). Patch 5 renames some existing code but does not introduce functionality. Patch 6 is a small clean-up to swap entry handling in copy_pte_range(). Patch 7 contains the bulk of the implementation for device exclusive memory. Patch 8 contains some additions to the HMM selftests to ensure everything works as expected. Patch 9 is a cleanup for the Nouveau SVM implementation. Patch 10 contains the implementation of atomic access for the Nouveau driver. Testing ======= This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program which checks that GPU atomic accesses to system memory are atomic. Without this series the test fails as there is no way of write-protecting the page mapping which results in the device clobbering CPU writes. For reference the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ Further testing has been performed by adding support for testing exclusive access to the hmm-tests kselftests. This patch (of 10): Remove multiple similar inline functions for dealing with different types of special swap entries. Both migration and device private swap entries use the swap offset to store a pfn. Instead of multiple inline functions to obtain a struct page for each swap entry type use a common function pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn() functions as this results is shorter code that is easier to understand. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Ralph Campbell <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Ben Skeggs <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/thp: define default pmd_pgtable()Anshuman Khandual1-0/+9
Currently most platforms define pmd_pgtable() as pmd_page() duplicating the same code all over. Instead just define a default value i.e pmd_page() for pmd_pgtable() and let platforms override when required via <asm/pgtable.h>. All the existing platform that override pmd_pgtable() have been moved into their respective <asm/pgtable.h> header in order to precede before the new generic definition. This makes it much cleaner with reduced code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Nick Hu <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Guo Ren <[email protected]> Cc: Brian Cain <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Michal Simek <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Stafford Horne <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Chris Zankel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEMMel Gorman1-1/+4
make W=1 generates the following warning in mm/workingset.c for allnoconfig mm/workingset.c: In function `unpack_shadow': mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable] int memcgid, nid; ^~~ On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing nid. Make the helper an inline function to suppress the warning, add type checking and to apply any side-effects in the parameter list. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Streetman <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm/swap: make swap_address_space an inline functionMel Gorman1-1/+5
make W=1 generates the following warning in page_mapping() for allnoconfig mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable] swp_entry_t entry; ^~~~~ swap_address is a #define on !CONFIG_SWAP configurations. Make the helper an inline function to suppress the warning, add type checking and to apply any side-effects in the parameter list. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Streetman <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm: fix spelling mistakesZhen Lei3-3/+3
Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Reviewed-by: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm: define default value for FIRST_USER_ADDRESSAnshuman Khandual1-0/+9
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the same code all over. Instead just define a generic default value (i.e 0UL) for FIRST_USER_ADDRESS and let the platforms override when required. This makes it much cleaner with reduced code. The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h> when the given platform overrides its value via <asm/pgtable.h>. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> [m68k] Acked-by: Guo Ren <[email protected]> [csky] Acked-by: Stafford Horne <[email protected]> [openrisc] Acked-by: Catalin Marinas <[email protected]> [arm64] Acked-by: Mike Rapoport <[email protected]> Acked-by: Palmer Dabbelt <[email protected]> [RISC-V] Cc: Richard Henderson <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Guo Ren <[email protected]> Cc: Brian Cain <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Michal Simek <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Stafford Horne <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Chris Zankel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01mm: fix typos and grammar error in commentsHyeonggon Yoo1-1/+1
We moves tha -> We move that in mm/swap.c statments -> statements in include/linux/mm.h Link: https://lkml.kernel.org/r/20210509063444.GA24745@hyeyoo Signed-off-by: Hyeonggon Yoo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-01Merge branch 'cpufreq/arm/linux-next' of ↵Rafael J. Wysocki2-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Pull ARM cpufreq updates for v5.14-rc1 from Viresh Kumar: "- Add frequency invariance support for CPPC driver again and related fixes/changes." - Minor changes/cleanups for Meditak driver (Fabien Parent and Seiya Wang), Qcom platform (Sibi Sankar), and SCMI driver (Christophe JAILLET). - New bindings for generic performance domains (Sudeep Holla). - Rename black/white-lists (Viresh Kumar)." * 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: cpufreq: CPPC: Add support for frequency invariance arch_topology: Avoid use-after-free for scale_freq_data cpufreq: CPPC: Pass structure instance by reference cpufreq: CPPC: Fix potential memleak in cppc_cpufreq_cpu_init dt-bindings: cpufreq: update cpu type and clock name for MT8173 SoC clk: mediatek: remove deprecated CLK_INFRA_CA57SEL for MT8173 SoC cpufreq: dt: Rename black/white-lists cpufreq: scmi: Fix an error message cpufreq: mediatek: add support for mt8365 dt-bindings: dvfs: Add support for generic performance domains cpufreq: blacklist SC7280 in cpufreq-dt-platdev
2021-07-01Merge tag 'asoc-v5.14' of ↵Takashi Iwai8-6/+232
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Updates for v5.14 This release sees a nice new feature in the core from Morimoto-san, support for automatic negotiation of DAI formats between the components on the link. Otherwise the big highlight was the merging of the Tegra machine drivers into a single driver avoiding a bunch of duplication. - Support for automatic negotiation of DAI formats. - Accessory detection support for several Qualcomm parts. - Support for IEC958 control with hdmi-codec. - Merging of Tegra machine drivers into a single driver. - Support for AmLogic SM1 TOACODEC, Intel AlderLake-M, several NXP i.MX8 variants, NXP TFA1 and TDF9897, Rockchip RK817, Qualcomm Quinary MI2S, Texas Instruments TAS2505
2021-07-01Merge branch 'for-next' into for-linusTakashi Iwai6-97/+88
2021-06-30mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tablesDavid Hildenbrand1-0/+3
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [[email protected]: coding style fixes] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Max Filippov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Ram Pai <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: rmap: make try_to_unmap() void functionYang Shi1-1/+1
Currently try_to_unmap() return bool value by checking page_mapcount(), however this may return false positive since page_mapcount() doesn't check all subpages of compound page. The total_mapcount() could be used instead, but its cost is higher since it traverses all subpages. Actually the most callers of try_to_unmap() don't care about the return value at all. So just need check if page is still mapped by page_mapped() when necessary. And page_mapped() does bail out early when it finds mapped subpage. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Hugh Dickins <[email protected]> Signed-off-by: Yang Shi <[email protected]> Acked-by: Minchan Kim <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jue Wang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Wang Yugui <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: thp: refactor NUMA fault handlingYang Shi1-23/+0
When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. This patch reworks the NUMA fault handling to use generic migration implementation to migrate misplaced page. There is no functional change. After the refactor the flow of NUMA fault handling looks just like its PTE counterpart: Acquire ptl Prepare for migration (elevate page refcount) Release ptl Isolate page from lru and elevate page refcount Migrate the misplaced THP If migration fails just restore the old normal PMD. In the old code anon_vma lock was needed to serialize THP migration against THP split, but since then the THP code has been reworked a lot, it seems anon_vma lock is not required anymore to avoid the race. The page refcount elevation when holding ptl should prevent from THP split. Use migrate_misplaced_page() for both base page and THP NUMA hinting fault and remove all the dead and duplicate code. [[email protected]: fix a double unlock bug] Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: memory: add orig_pmd to struct vm_faultYang Shi2-6/+10
Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3. When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation. A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953. Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now. I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful. Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported. Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults. The below test script is used: echo 3 > /proc/sys/vm/drop_caches # Run stress-ng for 24 hours ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$! ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h & # Wait for vm stressors forked sleep 5 PID_1=`pgrep -P $PID | awk 'NR == 1'` PID_2=`pgrep -P $PID | awk 'NR == 2'` JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2` # Bind load jobs to different nodes periodically to force generate # cross node memory access while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage. patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1% Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor. To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result *w/o* memory pressure. patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1% The series also survived a series of tests that exercise NUMA balancing migrations by Mel. This patch (of 7): Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Zi Yan <[email protected]> Cc: Huang Ying <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: migrate: fix missing update page_private to hugetlb_page_subpoolMuchun Song1-0/+5
Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") converts page.private for hugetlb specific page flags. We should use hugetlb_page_subpool() to get the subpool pointer instead of page_private(). This 'could' prevent the migration of hugetlb pages. page_private(hpage) is now used for hugetlb page specific flags. At migration time, the only flag which could be set is HPageVmemmapOptimized. This flag will only be set if the new vmemmap reduction feature is enabled. In addition, !page_mapping() implies an anonymous mapping. So, this will prevent migration of hugetb pages in anonymous mappings if the vmemmap reduction feature is enabled. In addition, that if statement checked for the rare race condition of a page being migrated while in the process of being freed. Since that check is now wrong, we could leak hugetlb subpool usage counts. The commit forgot to update it in the page migration routine. So fix it. [[email protected]: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Muchun Song <[email protected]> Reported-by: Anshuman Khandual <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Tested-by: Anshuman Khandual <[email protected]> [arm64] Cc: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30arm64/mm: drop HAVE_ARCH_PFN_VALIDAnshuman Khandual1-0/+9
CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30memblock: update initialization of reserved pagesMike Rapoport1-1/+3
The struct pages representing a reserved memory region are initialized using reserve_bootmem_range() function. This function is called for each reserved region just before the memory is freed from memblock to the buddy page allocator. The struct pages for MEMBLOCK_NOMAP regions are kept with the default values set by the memory map initialization which makes it necessary to have a special treatment for such pages in pfn_valid() and pfn_valid_within(). Split out initialization of the reserved pages to a function with a meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the reserved regions and mark struct pages for the NOMAP regions as PageReserved. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30include/linux/mmzone.h: add documentation for pfn_valid()Mike Rapoport1-0/+11
Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4. These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire pfn_valid_within() to 1. The idea is to mark NOMAP pages as reserved in the memory map and restore the intended semantics of pfn_valid() to designate availability of struct page for a pfn. With this the core mm will be able to cope with the fact that it cannot use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks will be treated correctly even without the need for pfn_valid_within. This patch (of 4): Add comment describing the semantics of pfn_valid() that clarifies that pfn_valid() only checks for availability of a memory map entry (i.e. struct page) for a PFN rather than availability of usable memory backing that PFN. The most "generic" version of pfn_valid() used by the configurations with SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most suitable place for documentation about semantics of pfn_valid(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Anshuman Khandual <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policiesBen Widawsky1-5/+2
Current structure 'mempolicy' uses a union to store the node info for bind/interleave/perfer policies. union { short preferred_node; /* preferred */ nodemask_t nodes; /* interleave/bind */ /* undefined for default */ } v; Since preferred node can also be represented by a nodemask_t with only ont bit set, unify these policies with using one nodemask_t 'nodes', which can remove a union, simplify the code and make it easier to support future's new policy's node info. Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Feng Tang <[email protected]> Signed-off-by: Ben Widawsky <[email protected]> Signed-off-by: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policyFeng Tang1-1/+0
MPOL_LOCAL policy has been setup as a real policy, but it is still handled like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit set, and there are many places having to judge the real 'prefer' or the 'local' policy, which are quite confusing. In current code, there are 4 cases that MPOL_LOCAL are used: 1. user specifies 'local' policy 2. user specifies 'prefer' policy, but with empty nodemask 3. system 'default' policy is used 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES flag set, and when it is 'rebind' to a nodemask which doesn't contains the 'preferred' node, it will perform as 'local' policy So make 'local' a real policy instead of a fake 'prefer' one, and kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading. For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal Hocko pointed out: : I do believe that rebinding preferred policy is just bogus and it should : be dropped altogether on the ground that a preference is a mere hint from : userspace where to start the allocation. Unless I am missing something : cpusets will be always authoritative for the final placement. The : preferred node just acts as a starting point and it should be really : preserved when cpusets changes. Otherwise we have a very subtle behavior : corner cases. So dump all the tricky transformation between 'prefer' and 'local', and just record the new nodemask of rebinding. [[email protected]: fix a problem in mpol_set_nodemask(), per Michal Hocko] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: refine code and comments of mpol_set_nodemask(), per Michal] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Huang Ying <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/mempolicy: cleanup nodemask intersection check for oomFeng Tang1-1/+1
Patch series "mm/mempolicy: some fix and semantics cleanup", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [[email protected]: changelog edits] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feng Tang <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Huang Ying <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/zbud: don't export any zbud APIMiaohe Lin1-23/+0
The zbud doesn't need to export any API and it is meant to be used via zpool API since the commit 12d79d64bfd3 ("mm/zpool: update zswap to use zpool"). So we can remove the unneeded zbud.h and move down zpool API to avoid any forward declaration. [[email protected]: fix unused function warnings when CONFIG_ZPOOL is disabled] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Nathan Chancellor <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting ↵David Hildenbrand1-0/+10
PageOffline() A driver might set a page logically offline -- PageOffline() -- and turn the page inaccessible in the hypervisor; after that, access to page content can be fatal. One example is virtio-mem; while unplugged memory -- marked as PageOffline() can currently be read in the hypervisor, this will no longer be the case in the future; for example, when having a virtio-mem device backed by huge pages in the hypervisor. Some special PFN walkers -- i.e., /proc/kcore -- read content of random pages after checking PageOffline(); however, these PFN walkers can race with drivers that set PageOffline(). Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing. page_offline_freeze()/page_offline_thaw() allows for a subsystem to synchronize with such drivers, achieving that a page cannot be set PageOffline() while frozen. page_offline_begin()/page_offline_end() is used by drivers that care about such races when setting a page PageOffline(). For simplicity, use a rwsem for now; neither drivers nor users are performance sensitive. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30fs/proc/kcore: don't read offline sections, logically offline pages and ↵David Hildenbrand1-0/+12
hwpoisoned pages Let's avoid reading: 1) Offline memory sections: the content of offline memory sections is stale as the memory is effectively unused by the kernel. On s390x with standby memory, offline memory sections (belonging to offline storage increments) are not accessible. With virtio-mem and the hyper-v balloon, we can have unavailable memory chunks that should not be accessed inside offline memory sections. Last but not least, offline memory sections might contain hwpoisoned pages which we can no longer identify because the memmap is stale. 2) PG_offline pages: logically offline pages that are documented as "The content of these pages is effectively stale. Such pages should not be touched (read/write/dump/save) except by their owner.". Examples include pages inflated in a balloon or unavailble memory ranges inside hotplugged memory sections with virtio-mem or the hyper-v balloon. 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal. As documented: "Accessing is not safe since it may cause another machine check. Don't touch!" Introduce is_page_hwpoison(), adding a comment that it is inherently racy but best we can really do. Reading /proc/kcore now performs similar checks as when reading /proc/vmcore for kdump via makedumpfile: problematic pages are exclude. It's also similar to hibernation code, however, we don't skip hwpoisoned pages when processing pages in kernel/power/snapshot.c:saveable_page() yet. Note 1: we can race against memory offlining code, especially memory going offline and getting unplugged: however, we will properly tear down the identity mapping and handle faults gracefully when accessing this memory from kcore code. Note 2: we can race against drivers setting PageOffline() and turning memory inaccessible in the hypervisor. We'll handle this in a follow-up patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aili Yao <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Alex Shi <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Steven Price <[email protected]> Cc: Wei Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHERDavid Hildenbrand1-3/+0
Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3. Looking for places where the kernel might unconditionally read PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore needs some more love to not touch some other pages we really don't want to read -- i.e., hwpoisoned ones. Examples for PageOffline() pages are pages inflated in a balloon, memory unplugged via virtio-mem, and partially-present sections in memory added by the Hyper-V balloon. When reading pages inflated in a balloon, we essentially produce unnecessary load in the hypervisor; holes in partially present sections in case of Hyper-V are not accessible and already were a problem for /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In the future, virtio-mem might disallow reading unplugged memory -- marked as PageOffline() -- in some environments, resulting in undefined behavior when accessed; therefore, I'm trying to identify and rework all these (corner) cases. With this series, there is really only access via /dev/mem, /proc/vmcore and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case use case -- we won't care for now if someone explicitly tries to do nasty things by reading from/writing to physical addresses we better not touch. /dev/mem is a use case we won't support for virtio-mem, at least for now, so we'll simply disallow mapping any virtio-mem memory via /dev/mem next. /proc/vmcore is really only a problem when dumping the old kernel via something that's not makedumpfile (read: basically never), however, we'll try sanitizing that as well in the second kernel in the future. Tested via kcore_dump: https://github.com/schlafwandler/kcore_dump This patch (of 6): Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()") removed the last user of KCORE_REMAP. Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page") removed the last user of KCORE_OTHER. Let's drop both types. While at it, also drop vaddr in "struct kcore_list", used by KCORE_REMAP only. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Alex Shi <[email protected]> Cc: Steven Price <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Aili Yao <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Wei Liu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_lowYu Zhao1-41/+0
mm_vmscan_inactive_list_is_low has no users after commit b91ac374346b ("mm: vmscan: enforce inactive:active ratio at the reclaim root"). Remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()Axel Rasmussen1-0/+5
In a previous commit, we added the mfill_atomic_install_pte() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases. In other words, shmem_mfill_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly, to reduce code duplication. This requires that we refactor shmem_mfill_atomic_pte() a bit: Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it afterward. This frees up mfill_atomic_install_pte() from having to care about this accounting, and means we don't need to e.g. shmem_uncharge() in the error path. A side effect is this switches shmem_mfill_atomic_pte() to use lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add(). This wrapper does some extra accounting in an exceptional case, if appropriate, so it's actually the more correct thing to use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: advertise shmem minor fault supportAxel Rasmussen1-1/+6
Now that the feature is fully implemented (the faulting path hooks exist so userspace is notified, and the ioctl to resolve such faults is available), advertise this as a supported feature. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pteAxel Rasmussen1-11/+8
Patch series "userfaultfd: add minor fault handling for shmem", v6. Overview ======== See the series which added minor faults for hugetlbfs [3] for a detailed overview of minor fault handling in general. This series adds the same support for shmem-backed areas. This series is structured as follows: - Commits 1 and 2 are cleanups. - Commits 3 and 4 implement the new feature (minor fault handling for shmem). - Commit 5 advertises that the feature is now available since at this point it's fully implemented. - Commit 6 is a final cleanup, modifying an existing code path to re-use a new helper we've introduced. - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature. Use Case ======== In some cases it is useful to have VM memory backed by tmpfs instead of hugetlbfs. So, this feature will be used to support the same VM live migration use case described in my original series. Additionally, Android folks (Lokesh Gidra <[email protected]>) hope to optimize the Android Runtime garbage collector using this feature: "The plan is to use userfaultfd for concurrently compacting the heap. With this feature, the heap can be shared-mapped at another location where the GC-thread(s) could continue the compaction operation without the need to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads get faults on the heap, UFFDIO_CONTINUE can be used to resume execution. Furthermore, this feature enables updating references in the 'non-moving' portion of the heap efficiently. Without this feature, uneccessary page copying (ioctl(UFFDIO_COPY)) would be required." [1] https://lore.kernel.org/patchwork/cover/1388144/ [2] https://lore.kernel.org/patchwork/patch/1408161/ [3] https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t This patch (of 9): Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte). This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc. While we're touching this function, two other small cleanup changes: - offset is equivalent to pgoff, so we can get rid of offset entirely. - Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Reviewed-by: Peter Xu <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/userfaultfd: fix uffd-wp special cases for fork()Peter Xu2-1/+3
We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all right.. A few fixes around the code path: 1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the new vma. That's overlooked in b569a1760782, so it won't work as expected. Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily get the new vma now, so switch the checks to that. 2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd is a migration huge pmd. When it happens, instead of using pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately. 3. Forget to carry over uffd-wp bit for a write migration huge pmd entry. This also happens in copy_huge_pmd(), where we converted a write huge migration entry into a read one. 4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes. 5. In copy_present_page() when COW is enforced when fork(), we also need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when the pte to be copied has uffd-wp bit set. Remove the comment in copy_present_pte() about this. It won't help a huge lot to only comment there, but comment everywhere would be an overkill. Let's assume the commit messages would help. [[email protected]: fix a few thp pmd missing uffd-wp bit] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") Signed-off-by: Peter Xu <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: "Dr . David Alan Gilbert" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Wang Qing <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: sparsemem: use huge PMD mapping for vmemmap pagesMuchun Song1-19/+6
The preparation of splitting huge PMD mapping of vmemmap pages is ready, so switch the mapping from PTE to PMD. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: sparsemem: split the huge PMD mapping of vmemmap pagesMuchun Song1-2/+2
Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/[email protected]/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPYMina Almasry1-0/+4
On UFFDIO_COPY, if we fail to copy the page contents while holding the hugetlb_fault_mutex, we will drop the mutex and return to the caller after allocating a page that consumed a reservation. In this case there may be a fault that double consumes the reservation. To handle this, we free the allocated page, fix the reservations, and allocate a temporary hugetlb page and return that to the caller. When the caller does the copy outside of the lock, we again check the cache, and allocate a page consuming the reservation, and copy over the contents. Test: Hacked the code locally such that resv_huge_pages underflows produce a warning and the copy_huge_page_from_user() always fails, then: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success Both tests succeed and produce no warnings. After the test runs number of free/resv hugepages is correct. [[email protected]: remove set but not used variable 'vm_alloc_shared'] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix allocation error check and copy func name] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mina Almasry <[email protected]> Signed-off-by: YueHaibing <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Peter Xu <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/vmalloc: enable mapping of huge pages at pte level in vmallocChristophe Leroy1-0/+7
On some architectures like powerpc, there are huge pages that are mapped at pte level. Enable it in vmalloc. For that, architectures can provide arch_vmap_pte_supported_shift() that returns the shift for pages to map at pte level. Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/vmalloc: enable mapping of huge pages at pte level in vmapChristophe Leroy1-0/+8
On some architectures like powerpc, there are huge pages that are mapped at pte level. Enable it in vmap. For that, architectures can provide arch_vmap_pte_range_map_size() that returns the size of pages to map at pte level. Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/pgtable: add stubs for {pmd/pub}_{set/clear}_hugeChristophe Leroy1-1/+25
For architectures with no PMD and/or no PUD, add stubs similar to what we have for architectures without P4D. [[email protected]: arm64: define only {pud/pmd}_{set/clear}_huge when useful] Link: https://lkml.kernel.org/r/73ec95f40cafbbb69bdfb43a7f53876fd845b0ce.1620990479.git.christophe.leroy@csgroup.eu [[email protected]: x86: define only {pud/pmd}_{set/clear}_huge when useful] Link: https://lkml.kernel.org/r/7fbf1b6bc3e15c07c24fa45278d57064f14c896b.1620930415.git.christophe.leroy@csgroup.eu Link: https://lkml.kernel.org/r/5ac5976419350e8e048d463a64cae449eb3ba4b0.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Naresh Kamboju <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/hugetlb: change parameters of arch_make_huge_pte()Christophe Leroy1-2/+2
Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2. This series implements huge VMAP and VMALLOC on powerpc 8xx. Powerpc 8xx has 4 page sizes: - 4k - 16k - 512k - 8M At the time being, vmalloc and vmap only support huge pages which are leaf at PMD level. Here the PMD level is 4M, it doesn't correspond to any supported page size. For now, implement use of 16k and 512k pages which is done at PTE level. Support of 8M pages will be implemented later, it requires use of hugepd tables. To allow this, the architecture provides two functions: - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what page size to use. A stub returning PAGE_SIZE is provided when the architecture doesn't provide this function. - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range() what page shift to use for a given area size. A stub returning PAGE_SHIFT is provided when the architecture doesn't provide this function. This patch (of 5): At the time being, arch_make_huge_pte() has the following prototype: pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, struct page *page, int writable); vma is used to get the pages shift or size. vma is also used on Sparc to get vm_flags. page is not used. writable is not used. In order to use this function without a vma, replace vma by shift and flags. Also remove the used parameters. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/huge_memory.c: add missing read-only THP checking in ↵Miaohe Lin1-22/+35
transparent_hugepage_enabled() Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS"), read-only THP file mapping is supported. But it forgot to add checking for it in transparent_hugepage_enabled(). To fix it, we add checking for read-only THP file mapping and also introduce helper transhuge_vma_enabled() to check whether thp is enabled for specified vma to reduce duplicated code. We rename transparent_hugepage_enabled to transparent_hugepage_active to make the code easier to follow as suggested by David Hildenbrand. [[email protected]: define transhuge_vma_enabled next to transhuge_vma_suitable] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Song Liu <[email protected]> Cc: William Kucharski <[email protected]> Cc: Zi Yan <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASKMiaohe Lin1-4/+2
Patch series "Cleanup and fixup for huge_memory:, v3. This series contains cleanups to remove dedicated macro and remove unnecessary tlb_remove_page_size() for huge zero pmd. Also this adds missing read-only THP checking for transparent_hugepage_enabled() and avoids discarding hugepage if other processes are mapping it. More details can be found in the respective changelogs. Thi patch (of 5): Rewrite the pgoff checking logic to remove macro HPAGE_CACHE_INDEX_MASK which is only used here to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Zi Yan <[email protected]> Cc: William Kucharski <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Song Liu <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstateMuchun Song1-0/+3
All the infrastructure is ready, so we introduce nr_free_vmemmap_pages field in the hstate to indicate how many vmemmap pages associated with a HugeTLB page that can be freed to buddy allocator. And initialize it in the hugetlb_vmemmap_init(). This patch is actual enablement of the feature. There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hugetlb: add a kernel parameter hugetlb_free_vmemmapMuchun Song1-0/+19
Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Barry Song <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB pageMuchun Song2-0/+5
When we free a HugeTLB page to the buddy allocator, we need to allocate the vmemmap pages associated with it. However, we may not be able to allocate the vmemmap pages when the system is under memory pressure. In this case, we just refuse to free the HugeTLB page. This changes behavior in some corner cases as listed below: 1) Failing to free a huge page triggered by the user (decrease nr_pages). User needs to try again later. 2) Failing to free a surplus huge page when freed by the application. Try again later when freeing a huge page next time. 3) Failing to dissolve a free huge page on ZONE_MOVABLE via offline_pages(). This can happen when we have plenty of ZONE_MOVABLE memory, but not enough kernel memory to allocate vmemmmap pages. We may even be able to migrate huge page contents, but will not be able to dissolve the source huge page. This will prevent an offline operation and is unfortunate as memory offlining is expected to succeed on movable zones. Users that depend on memory hotplug to succeed for movable zones should carefully consider whether the memory savings gained from this feature are worth the risk of possibly not being able to offline memory in certain situations. 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via alloc_contig_range() - once we have that handling in place. Mainly affects CMA and virtio-mem. Similar to 3). virito-mem will handle migration errors gracefully. CMA might be able to fallback on other free areas within the CMA region. Vmemmap pages are allocated from the page freeing context. In order for those allocations to be not disruptive (e.g. trigger oom killer) __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because a non sleeping allocation would be too fragile and it could fail too easily under memory pressure. GFP_ATOMIC or other modes to access memory reserves is not used because we want to prevent consuming reserves under heavy hugetlb freeing. [[email protected]: fix dissolve_free_huge_page use of tail/head page] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix alloc_vmemmap_page_list documentation warning] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chen Huang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hugetlb: free the vmemmap pages associated with each HugeTLB pageMuchun Song2-1/+30
Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [[email protected]: fix documentation warning] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: hugetlb: gather discrete indexes of tail pageMuchun Song2-10/+30
For HugeTLB page, there are more metadata to save in the struct page. But the head struct page cannot meet our needs, so we have to abuse other tail struct page to store the metadata. In order to avoid conflicts caused by subsequent use of more tail struct pages, we can gather these discrete indexes of tail struct page. In this case, it will be easier to add a new tail page index later. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30mm: memory_hotplug: factor out bootmem core functions to bootmem_info.cMuchun Song2-27/+40
Patch series "Free some vmemmap pages of HugeTLB page", v23. This patch series will free some vmemmap pages(struct page structures) associated with each HugeTLB page when preallocated to save memory. In order to reduce the difficulty of the first version of code review. In this version, we disable PMD/huge page mapping of vmemmap if this feature was enabled. This acutely eliminates a bunch of the complex code doing page table manipulation. When this patch series is solid, we cam add the code of vmemmap page table manipulation in the future. The struct page structures (page structs) are used to describe a physical page frame. By default, there is an one-to-one mapping from a page frame to it's corresponding page struct. The HugeTLB pages consist of multiple base page size pages and is supported by many architectures. See hugetlbpage.rst in the Documentation directory for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. For each base page, there is a corresponding page struct. Within the HugeTLB subsystem, only the first 4 page structs are used to contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER provides this upper limit. The only 'useful' information in the remaining page structs is the compound_head field, and this field is the same for all tail pages. By removing redundant page structs for HugeTLB pages, memory can returned to the buddy allocator for other uses. When the system boot up, every 2M HugeTLB has 512 struct page structs which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | -------------> | 2 | | | +-----------+ +-----------+ | | | 3 | -------------> | 3 | | | +-----------+ +-----------+ | | | 4 | -------------> | 4 | | 2MB | +-----------+ +-----------+ | | | 5 | -------------> | 5 | | | +-----------+ +-----------+ | | | 6 | -------------> | 6 | | | +-----------+ +-----------+ | | | 7 | -------------> | 7 | | | +-----------+ +-----------+ | | | | | | +-----------+ The value of page->compound_head is the same for all tail pages. The first page of page structs (page 0) associated with the HugeTLB page contains the 4 page structs necessary to describe the HugeTLB. The only use of the remaining pages of page structs (page 1 to page 7) is to point to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs will be used for each HugeTLB page. This will allow us to free the remaining 6 pages to the buddy allocator. Here is how things look after remapping. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ When a HugeTLB is freed to the buddy system, we should allocate 6 pages for vmemmap pages and restore the previous mapping relationship. Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar to the 2MB HugeTLB page. We also can use this approach to free the vmemmap pages. In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a very substantial gain. On our server, run some SPDK/QEMU applications which will use 1024GB HugeTLB page. With this feature enabled, we can save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory. Because there are vmemmap page tables reconstruction on the freeing/allocating path, it increases some overhead. Here are some overhead analysis. 1) Allocating 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.166s user 0m0.000s sys 0m0.166s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 4 | | b) Without this patch series: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.067s user 0m0.000s sys 0m0.067s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 93 | | Summarize: this feature is about ~2x slower than before. 2) Freeing 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.213s user 0m0.000s sys 0m0.213s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 6 | | [16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 7 | | b) Without this patch series: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.081s user 0m0.000s sys 0m0.081s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 8 | | Summary: The overhead of __free_hugepage is about ~2-3x slower than before. Although the overhead has increased, the overhead is not significant. Like Mike said, "However, remember that the majority of use cases create HugeTLB pages at or shortly after boot time and add them to the pool. So, additional overhead is at pool creation time. There is no change to 'normal run time' operations of getting a page from or returning a page to the pool (think page fault/unmap)". Despite the overhead and in addition to the memory gains from this series. The following data is obtained by Joao Martins. Very thanks to his effort. There's an additional benefit which is page (un)pinners will see an improvement and Joao presumes because there are fewer memmap pages and thus the tail/head pages are staying in cache more often. Out of the box Joao saw (when comparing linux-next against linux-next + this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages): get_user_pages(): ~32k -> ~9k unpin_user_pages(): ~75k -> ~70k Usually any tight loop fetching compound_head(), or reading tail pages data (e.g. compound_head) benefit a lot. There's some unpinning inefficiencies Joao was fixing[2], but with that in added it shows even more: unpin_user_pages(): ~27k -> ~3.8k [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 9): Move bootmem info registration common API to individual bootmem_info.c. And we will use {get,put}_page_bootmem() to initialize the page for the vmemmap pages or free the vmemmap pages to buddy in the later patch. So move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any functional change. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: [email protected] Cc: "H. Peter Anvin" <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Mina Almasry <[email protected]> Cc: David Rientjes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Barry Song <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: Joao Martins <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-06-30Merge tag 'ext4_for_linus' of ↵Linus Torvalds3-2/+143
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "In addition to bug fixes and cleanups, there are two new features for ext4 in 5.14: - Allow applications to poll on changes to /sys/fs/ext4/*/errors_count - Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be checkpointed, truncated and discarded or zero'ed" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits) jbd2: export jbd2_journal_[un]register_shrinker() ext4: notify sysfs on errors_count value change fs: remove bdev_try_to_free_page callback ext4: remove bdev_try_to_free_page() callback jbd2: simplify journal_clean_one_cp_list() jbd2,ext4: add a shrinker to release checkpointed buffers jbd2: remove redundant buffer io error checks jbd2: don't abort the journal when freeing buffers jbd2: ensure abort the journal if detect IO error when writing original buffer back jbd2: remove the out label in __jbd2_journal_remove_checkpoint() ext4: no need to verify new add extent block jbd2: clean up misleading comments for jbd2_fc_release_bufs ext4: add check to prevent attempting to resize an fs with sparse_super2 ext4: consolidate checks for resize of bigalloc into ext4_resize_begin ext4: remove duplicate definition of ext4_xattr_ibody_inline_set() ext4: fsmap: fix the block/inode bitmap comment ext4: fix comment for s_hash_unsigned ext4: use local variable ei instead of EXT4_I() macro ext4: fix avefreec in find_group_orlov ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit ...
2021-07-01Merge branch 'cpufreq/cppc-fie' into cpufreq/arm/linux-nextViresh Kumar202-6124/+3272