aboutsummaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-11-07mm: fix docs for the kernel parameter ``thp_anon=``Maíra Canal1-1/+1
If we add ``thp_anon=32,64K:always`` to the kernel command line, we will see the following error: [ 0.000000] huge_memory: thp_anon=32,64K:always: error parsing string, ignoring setting This happens because the correct format isn't ``thp_anon=<size>,<size>[KMG]:<state>```, as [KMG] must follow each number to especify its unit. So, the correct format is ``thp_anon=<size>[KMG],<size>[KMG]:<state>```. Therefore, adjust the documentation to reflect the correct format of the parameter ``thp_anon=``. Link: https://lkml.kernel.org/r/[email protected] Fixes: dd4d30d1cdbe ("mm: override mTHP "enabled" defaults at kernel cmdline") Signed-off-by: Maíra Canal <[email protected]> Acked-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-21Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Many singleton patches - please see the various changelogs for details. Quite a lot of nilfs2 work this time around. Notable patch series in this pull request are: - "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64() to provide (much) more accurate results. The current implementation was causing Uwe some issues in the PWM drivers. - "xz: Updates to license, filters, and compression options" from Lasse Collin. Miscellaneous maintenance and kinor feature work to the xz decompressor. - "Fix some GDB command error and add some GDB commands" from Kuan-Ying Lee. Fixes and enhancements to the gdb scripts. - "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of warnings about this. - "nilfs2: add support for some common ioctls" from Ryusuke Konishi. Adds various commonly-available ioctls to nilfs2. - "This series fixes a number of formatting issues in kernel doc comments" from Ryusuke Konishi does that. - "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke Konishi. Fix issues where -ENOENT was being unintentionally and inappropriately returned to userspace. - "nilfs2: assorted cleanups" from Huang Xiaojia. - "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke Konishi fixes some issues which can occur on corrupted nilfs2 filesystems. - "scripts/decode_stacktrace.sh: improve error reporting and usability" from Luca Ceresoli does those things" * tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits) list: test: increase coverage of list_test_list_replace*() list: test: fix tests for list_cut_position() proc: use __auto_type more treewide: correct the typo 'retun' ocfs2: cleanup return value and mlog in ocfs2_global_read_info() nilfs2: remove duplicate 'unlikely()' usage nilfs2: fix potential oob read in nilfs_btree_check_delete() nilfs2: determine empty node blocks as corrupted nilfs2: fix potential null-ptr-deref in nilfs_btree_insert() user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation tools/mm: rm thp_swap_allocator_test when make clean squashfs: fix percpu address space issues in decompressor_multi_percpu.c lib: glob.c: added null check for character class nilfs2: refactor nilfs_segctor_thread() nilfs2: use kthread_create and kthread_stop for the log writer thread nilfs2: remove sc_timer_task nilfs2: do not repair reserved inode bitmap in nilfs_new_inode() nilfs2: eliminate the shared counter and spinlock for i_generation nilfs2: separate inode type information from i_state field nilfs2: use the BITS_PER_LONG macro ...
2024-09-09mm: add sysfs entry to disable splitting underused THPsUsama Arif1-0/+10
If disabled, THPs faulted in or collapsed will not be added to _deferred_list, and therefore won't be considered for splitting under memory pressure if underused. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Cc: Alexander Zhu <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nico Pache <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: split underused THPsUsama Arif1-0/+6
This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Suggested-by: Rik van Riel <[email protected]> Co-authored-by: Johannes Weiner <[email protected]> Cc: Alexander Zhu <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nico Pache <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09Docs/damon: use damonitor GitHub organization instead of awslabsSeongJae Park2-6/+6
Patch series "Docs/damon: update GitHub repo URLs and maintainer-profile". Replace GitHub URLS on DAMON documents for none-kernel parts DAMON repos with new ones[1] via the first patch. With following two patches, wordsmith maitnainer-profile for better readability, and document the Google clendsar for bi-weekly meetups, respectively. [1] https://lore.kernel.org/[email protected] This patch (of 3): GitHub repos for non-kernel parts of DAMON project including 'damo', 'damon-tests' and 'damoos' will be moved[1] from 'awslabs' org to 'damonitor', by 2024-09-05. Update related URLs in kernel tree. [1] https://lore.kernel.org/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hu Haowen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: count the number of partially mapped anonymous THPs per sizeBarry Song1-0/+7
When a THP is added to the deferred_list due to partially mapped, its partial pages are unused, leading to wasted memory and potentially increasing memory reclamation pressure. Detailing the specifics of how unmapping occurs is quite difficult and not that useful, so we adopt a simple approach: each time a THP enters the deferred_list, we increment the count by 1; whenever it leaves for any reason, we decrement the count by 1. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chris Li <[email protected]> Cc: Chuanhua Han <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Lance Yang <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuai Yuan <[email protected]> Cc: Usama Arif <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: count the number of anonymous THPs per sizeBarry Song1-0/+5
Patch series "mm: count the number of anonymous THPs per size", v4. Knowing the number of transparent anon THPs in the system is crucial for performance analysis. It helps in understanding the ratio and distribution of THPs versus small folios throughout the system. Additionally, partial unmapping by userspace can lead to significant waste of THPs over time and increase memory reclamation pressure. We need this information for comprehensive system tuning. This patch (of 2): Let's track for each anonymous THP size, how many of them are currently allocated. We'll track the complete lifespan of an anon THP, starting when it becomes an anon THP ("large anon folio") (->mapping gets set), until it gets freed (->mapping gets cleared). Introduce a new "nr_anon" counter per THP size and adjust the corresponding counter in the following cases: * We allocate a new THP and call folio_add_new_anon_rmap() to map it the first time and turn it into an anon THP. * We split an anon THP into multiple smaller ones. * We migrate an anon THP, when we prepare the destination. * We free an anon THP back to the buddy. Note that AnonPages in /proc/meminfo currently tracks the total number of *mapped* anonymous *pages*, and therefore has slightly different semantics. In the future, we might also want to track "nr_anon_mapped" for each THP size, which might be helpful when comparing it to the number of allocated anon THPs (long-term pinning, stuck in swapcache, memory leaks, ...). Further note that for now, we only track anon THPs after they got their ->mapping set, for example via folio_add_new_anon_rmap(). If we would allocate some in the swapcache, they will only show up in the statistics for now after they have been mapped to user space the first time, where we call folio_add_new_anon_rmap(). [[email protected]: documentation fixups, per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Chris Li <[email protected]> Cc: Chuanhua Han <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Lance Yang <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuai Yuan <[email protected]> Cc: Usama Arif <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01Document/kexec: generalize crash hotplug descriptionSourabh Jain1-2/+3
Commit 79365026f869 ("crash: add a new kexec flag for hotplug support") generalizes the crash hotplug support to allow architectures to update multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr. Therefore, update the relevant kernel documentation to reflect the same. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sourabh Jain <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Petr Tesarik <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: override mTHP "enabled" defaults at kernel cmdlineRyan Roberts1-7/+31
Add thp_anon= cmdline parameter to allow specifying the default enablement of each supported anon THP size. The parameter accepts the following format and can be provided multiple times to configure each size: thp_anon=<size>,<size>[KMG]:<value>;<size>-<size>[KMG]:<value> An example: thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never See Documentation/admin-guide/mm/transhuge.rst for more details. Configuring the defaults at boot time is useful to allow early user space to take advantage of mTHP before its been configured through sysfs. [[email protected]: use get_oder() and check size is is_power_of_2] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: some minor cleanup according to David's comments] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Co-developed-by: Barry Song <[email protected]> Signed-off-by: Barry Song <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-21Merge tag 'mm-stable-2024-07-21-14-50' of ↵Linus Torvalds4-47/+119
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
2024-07-18Merge tag 'docs-6.11' of git://git.lwn.net/linuxLinus Torvalds1-1/+1
Pull documentation updates from Jonathan Corbet: "Nothing hugely exciting happening in the documentation tree this time around, mostly more of the usual: - More Spanish, Italian, and Chinese translations - A new script, scripts/checktransupdate.py, can be used to see which commits have touched an (English) document since a given translation was last updated. - A couple of "best practices" suggestions (on Link: tags and off-list discussions) that were not entirely at consensus level, but I concluded they were close enough to accept. - Some nice cleanups removing documentation for kernel parameters that have not been recognized for ... a long time. ...along with the usual updates, typo fixes, and such" * tag 'docs-6.11' of git://git.lwn.net/linux: (57 commits) Documentation: Document user_events ioctl code docs/pinctrl: fix typo in mapping example docs: maintainer: discourage taking conversations off-list docs: driver-model: platform: update the definition of platform_driver docs/sp_SP: Add translation for scheduler/sched-design-CFS.rst writing_musb_glue_layer.rst: Fix broken URL zh_CN/admin-guide: one typo fix docs/zh_CN/virt: Update the translation of guest-halt-polling.rst Documentation: add reference from dynamic debug to loglevel kernel params Documentation: best practices for using Link trailers Documentation: fix links to mailing list services Documentation: exception-tables.rst: Fix the wrong steps referenced docs/zh_CN: add process/researcher-guidelines Chinese translation Documentation/tools/rv: fix document header docs/sp_SP: Add translation of process/maintainer-kvm-x86.rst docs/admin-guide/mm: correct typo 'quired' to 'queried' Add libps2 to the input section of driver-api Docs/mm/index: move allocation profiling document to unsorted documents chapter Docs/mm/index: rename 'Legacy Documentation' to 'Unsorted Documentation' Docs/mm/index: Remove 'Memory Management Guide' chapter marker ...
2024-07-12mm: shmem: rename mTHP shmem countersRyan Roberts1-13/+16
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc, thp_file_fallback and thp_file_fallback_charge, which rather confusingly refer to shmem THP and do not include any other types of file pages. This is inconsistent since in most other places in the kernel, THP counters are explicitly separated for anon, shmem and file flavours. However, we are stuck with it since it constitutes a user ABI. Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous shmem") added equivalent mTHP stats for shmem, keeping the same "file_" prefix in the names. But in future, we may want to add extra stats to cover actual file pages, at which point, it would all become very confusing. So let's take the opportunity to rename these new counters "shmem_" before the change makes it upstream and the ABI becomes immutable. While we are at it, let's improve the documentation for the legacy counters to make it clear that they count shmem pages only. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Daniel Gomez <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12kpageflags: detect isolated KPF_THP foliosRan Xiaokai1-2/+2
When folio is isolated, the PG_lru bit is cleared. So the PG_lru check in stable_page_flags() will miss this kind of isolated folios. Use folio_test_large_rmappable() instead to also include isolated folios. Since pagecache supports large folios and the introduction of mTHP, the semantics of KPF_THP have been expanded, now it indicates not only PMD-sized THP. Update related documentation to clearly state that KPF_THP indicates multiple order THPs. [[email protected]: directly use is_zero_folio(), per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ran Xiaokai <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Svetly Todorov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: fix khugepaged activation policyRyan Roberts1-6/+5
Since the introduction of mTHP, the docuementation has stated that khugepaged would be enabled when any mTHP size is enabled, and disabled when all mTHP sizes are disabled. There are 2 problems with this; 1. this is not what was implemented by the code and 2. this is not the desirable behavior. Desirable behavior is for khugepaged to be enabled when any PMD-sized THP is enabled, anon or file. (Note that file THP is still controlled by the top-level control so we must always consider that, as well as the PMD-size mTHP control for anon). khugepaged only supports collapsing to PMD-sized THP so there is no value in enabling it when PMD-sized THP is disabled. So let's change the code and documentation to reflect this policy. Further, per-size enabled control modification events were not previously forwarded to khugepaged to give it an opportunity to start or stop. Consequently the following was resulting in khugepaged eroneously not being activated: echo never > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface") Closes: https://lore.kernel.org/linux-mm/[email protected]/ Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: add docs for per-order mTHP split countersLance Yang1-4/+15
This commit introduces documentation for mTHP split counters in transhuge.rst. [[email protected]: improve the doc as suggested by Ryan] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: tweak Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mingzhe Yang <[email protected]> Signed-off-by: Lance Yang <[email protected]> Reviewed-by: Barry Song <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Bang Li <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-10Docs/admin-guide/mm/damon/start: add access pattern snapshot exampleSeongJae Park1-4/+42
DAMON user-space tool (damo) provides access pattern snapshot feature, which is expected to be frequently used for real time access pattern analysis. The snapshot output is also showing what DAMON provides on its own, including the 'age' information. In contrast, the recorded access patterns, which is shown as an example usage on the quick start section, shows what users can make from what DAMON provided. It includes information that generated outside of DAMON and makes the 'age' concept bit unclear. Hence snapshot output is easier at understanding the raw realtime output of DAMON. Add the snapshot usage example on the quick start section. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03Docs/damon: document damos_migrate_{hot,cold}Honggyu Kim1-3/+7
This patch adds damon description for "migrate_hot" and "migrate_cold" actions for both usage and design documents as long as a new "target_nid" knob to set the migration target node. [[email protected]: trivial fixups for DAMOS_MIGRATE_{HOT,COLD} documentation] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Honggyu Kim <[email protected]> Signed-off-by: SeongJae Park <[email protected]>Reviewed-by: SeongJae Park <[email protected]> Cc: Gregory Price <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Hyeongtak Ji <[email protected]> Cc: Masami Hiramatsu (Google) <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Rakie Kim <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03Documentation/admin-guide/mm/pagemap.rst: drop "Using pagemap to do ↵David Hildenbrand1-21/+0
something useful" That example was added in 2008. In 2015, we restricted access to the PFNs in the pagemap to CAP_SYS_ADMIN, making that approach quite less usable. It's 2024 now, and using that racy and low-lewel mechanism to calculate the USS should not be considered a good example anymore. /proc/$pid/smaps and /proc/$pid/smaps_rollup can do a much better job without any of that low-level handling. Let's just drop that example. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lance Yang <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: shmem: add mTHP counters for anonymous shmemBaolin Wang1-0/+13
Add mTHP counters for anonymous shmem. [[email protected]: update Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/4fd9e467d49ae4a747e428bcd821c7d13125ae67.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Lance Yang <[email protected]> Cc: Barry Song <[email protected]> Cc: Daniel Gomez <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: shmem: add multi-size THP sysfs interface for anonymous shmemBaolin Wang1-0/+25
To support the use of mTHP with anonymous shmem, add a new sysfs interface 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/' directory for each mTHP to control whether shmem is enabled for that mTHP, with a value similar to the top level 'shmem_enabled', which can be set to: "always", "inherit (to inherit the top level setting)", "within_size", "advise", "never". An 'inherit' option is added to ensure compatibility with these global settings, and the options 'force' and 'deny' are dropped, which are rather testing artifacts from the old ages. By default, PMD-sized hugepages have enabled="inherit" and all other hugepage sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'. In addition, if top level value is 'force', then only PMD-sized hugepages have enabled="inherit", otherwise configuration will be failed and vice versa. That means now we will avoid using non-PMD sized THP to override the global huge allocation. [[email protected]: fix transhuge.rst indentation] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: reflow transhuge.rst addition to 80 cols] [[email protected]: move huge_shmem_orders_lock under CONFIG_SYSFS] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: huge_memory.c needs mm_types.h] Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Daniel Gomez <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03docs/admin-guide/mm: correct typo 'quired' to 'queried'Daniel Watson1-1/+1
Convert the word "quired" to the word "queried" which makes more sense in this context. Signed-off-by: Daniel Watson <[email protected]> Signed-off-by: Jonathan Corbet <[email protected]> Link: https://lore.kernel.org/r/878qymrjrg.fsf@trent-reznor
2024-06-05mm: drop the 'anon_' prefix for swap-out mTHP countersBaolin Wang1-2/+2
The mTHP swap related counters: 'anon_swpout' and 'anon_swpout_fallback' are confusing with an 'anon_' prefix, since the shmem can swap out non-anonymous pages. So drop the 'anon_' prefix to keep consistent with the old swap counter names. This is needed in 6.10-rcX to avoid having an inconsistent ABI out in the field. Link: https://lkml.kernel.org/r/7a8989c13299920d7589007a30065c3e2c19f0e0.1716431702.git.baolin.wang@linux.alibaba.com Fixes: d0f048ac39f6 ("mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters") Fixes: 42248b9d34ea ("mm: add docs for per-order mTHP counters and transhuge_page ABI") Signed-off-by: Baolin Wang <[email protected]> Suggested-by: "Huang, Ying" <[email protected]> Acked-by: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds4-48/+55
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-11Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update ↵SeongJae Park1-2/+2
command To update effective size quota of DAMOS schemes on DAMON sysfs file interface, user should write 'update_schemes_effective_quotas' to the kdamond 'state' file. But the document is mistakenly saying the input string as 'update_schemes_effective_bytes'. Fix it (s/bytes/quotas/). Link: https://lkml.kernel.org/r/[email protected] Fixes: a6068d6dfa2f ("Docs/admin-guide/mm/damon/usage: document effective_bytes file") Signed-off-by: SeongJae Park <[email protected]> Cc: <[email protected]> [6.9.x] Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-11Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching ↵SeongJae Park1-1/+1
sysfs file The example usage of DAMOS filter sysfs files, specifically the part of 'matching' file writing for memcg type filter, is wrong. The intention is to exclude pages of a memcg that already getting enough care from a given scheme, but the example is setting the filter to apply the scheme to only the pages of the memcg. Fix it. Link: https://lkml.kernel.org/r/[email protected] Fixes: 9b7f9322a530 ("Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs") Closes: https://lore.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: <[email protected]> [6.3.x] Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05Docs/admin-guide/mm/damon/usage: update for young page type DAMOS filterSeongJae Park1-13/+13
Update DAMON usage document for the newly added DAMOS filter type, 'young page'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Honggyu Kim <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()David Hildenbrand1-1/+2
We want to limit the use of page_mapcount() to places where absolutely required, to prepare for kernel configs where we won't keep track of per-page mapcounts in large folios. khugepaged is one of the remaining "more challenging" page_mapcount() users, but we might be able to move away from page_mapcount() without resulting in a significant behavior change that would warrant special-casing based on kernel configs. In 2020, we first added support to khugepaged for collapsing COW-shared pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared across fork"), followed by support for collapsing PTE-mapped THP in commit 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") and limiting the memory waste via the "page_count() > 1" check in commit 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). As a default, khugepaged will allow up to half of the PTEs to map shared pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged setting. khugepaged does currently not care about swapcache page references, and does not check under folio lock: so in some corner cases the "shared vs. exclusive" detection might be a bit off, making us detect "exclusive" when it's actually "shared". Most of our anonymous folios in the system are usually exclusive. We frequently see sharing of anonymous folios for a short period of time, after which our short-lived suprocesses either quit or exec(). There are some famous examples, though, where child processes exist for a long time, and where memory is COW-shared with a lot of processes (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for reducing the memory footprint. We don't want to suddenly change the behavior to result in a significant increase in memory waste. Interestingly, khugepaged will only collapse an anonymous THP if at least one PTE is writable. After fork(), that means that something (usually a page fault) populated at least a single exclusive anonymous THP in that PMD range. So ... what happens when we switch to "is this folio mapped shared" instead of "is this page mapped shared" by using folio_likely_mapped_shared()? For "not-COW-shared" folios, small folios and for THPs (large folios) that are completely mapped into at least one process, switching to folio_likely_mapped_shared() will not result in a change. We'll only see a change for COW-shared PTE-mapped THPs that are partially mapped into all involved processes. There are two cases to consider: (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP If the folio is detected as exclusive, and it actually is exclusive, there is no change: page_mapcount() == 1. This is the common case without fork() or with short-lived child processes. folio_likely_mapped_shared() might currently still detect a folio as exclusive although it is shared (false negatives): if the first page is not mapped multiple times and if the average per-page mapcount is smaller than 1, implying that (1) the folio is partially mapped and (2) if we are responsible for many mapcounts by mapping many pages others can't ("mostly exclusive") (3) if we are not responsible for many mapcounts by mapping little pages ("mostly shared") it won't make a big impact on the end result. So while we might now detect a page as "exclusive" although it isn't, it's not expected to make a big difference in common cases. (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP folio_likely_mapped_shared() will never detect a large anonymous folio as shared although it is exclusive: there are no false positives. If we detect a THP as shared, at least one page of the THP is mapped by another process. It could well be that some pages are actually exclusive. For example, our child processes could have unmapped/COW'ed some pages such that they would now be exclusive to out process, which we now would treat as still-shared. Examples: (1) Parent maps all pages of a THP, child maps some pages. We detect all pages in the parent as shared although some are actually exclusive. (2) Parent maps all but some page of a THP, child maps the remainder. We detect all pages of the THP that the parent maps as shared although they are all exclusive. In (1) we wouldn't collapse a THP right now already: no PTE is writable, because a write fault would have resulted in COW of a single page and the parent would no longer map all pages of that THP. For (2) we would have collapsed a THP in the parent so far, now we wouldn't as long as the child process is still alive: unless the child process unmaps the remaining THP pages or we decide to split that THP. Possibly, the child COW'ed many pages, meaning that it's likely that we can populate a THP for our child first, and then for our parent. For (2), we are making really bad use of the THP in the first place (not even mapped completely in at least one process). If the THP would be completely partially mapped, it would be on the deferred split queue where we would split it lazily later. For short-running child processes, we don't particularly care. For long-running processes, the expectation is that such scenarios are rather rare: further, a THP might be best placed if most data in the PMD range is actually written, implying that we'll have to COW more pages first before khugepaged would collapse it. To summarize, in the common case, this change is not expected to matter much. The more common application of khugepaged operates on exclusive pages, either before fork() or after a child quit. Can we improve (A)? Yes, if we implement more precise tracking of "mapped shared" vs. "mapped exclusively", we could get rid of the false negatives completely. Can we improve (B)? We could count how many pages of a large folio we map inside the current page table and detect that we are responsible for most of the folio mapcount and conclude "as good as exclusive", which might help in some cases. ... but likely, some other mechanism should detect that the THP is not a good use in the scenario (not even mapped completely in a single process) and try splitting that folio lazily etc. We'll move the folio_test_anon() check before our "shared" check, so we might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order of checks now matches the one in __collapse_huge_page_isolate(). Extend documentation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Zi Yan <[email protected]> Cc: Yang Shi <[email protected]> Cc: John Hubbard <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: zswap: remove same_filled module paramsYosry Ahmed1-29/+0
These knobs offer more fine-grained control to userspace than needed and directly expose/influence kernel implementation; remove them. For disabling same_filled handling, there is no logical reason to refuse storing same-filled pages more efficiently and opt for compression. Scanning pages for patterns may be an argument, but the page contents will be read into the CPU cache anyway during compression. Also, removing the same_filled handling code does not move the needle significantly in terms of performance anyway [1]. For disabling non_same_filled handling, it was added when the compressed pages in zswap were not being properly charged to memcgs, as workloads could escape the accounting with compression [2]. This is no longer the case after commit f4840ccfca25 ("zswap: memcg accounting"), and using zswap without compression does not make much sense. [1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/ [2]https://lore.kernel.org/lkml/[email protected]/ [[email protected]: remove same_filled_pages from docs] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Cc: "Maciej S. Szmigiero" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: correct the docs for thp_fault_alloc and thp_fault_fallbackBarry Song1-2/+2
The documentation does not align with the code. In __do_huge_pmd_anonymous_page(), THP_FAULT_FALLBACK is incremented when mem_cgroup_charge() fails, despite the allocation succeeding, whereas THP_FAULT_ALLOC is only incremented after a successful charge. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Chris Li <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: add docs for per-order mTHP counters and transhuge_page ABIBarry Song1-0/+28
This patch includes documentation for mTHP counters and an ABI file for sys-kernel-mm-transparent-hugepage, which appears to have been missing for some time. [[email protected]: fix the name and unexpected indentation] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Chris Li <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-02Docs: typos/spellingRemington Brasga1-1/+1
Fix spelling and grammar in Docs descriptions Signed-off-by: Remington Brasga <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Jonathan Corbet <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-25docs: hugetlbpage.rst: add hugetlb migration descriptionBaolin Wang1-0/+7
Add some description of the hugetlb migration strategy. Link: https://lkml.kernel.org/r/63fb16e7a4ebc5cb69ce655af86e29b2d8e9ba34.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-03-29docs: zswap: fix shell command formatWeiji Wang1-2/+2
Format the shell commands as code block to keep the documentation in the same style Signed-off-by: Weiji Wang <[email protected]> Signed-off-by: Jonathan Corbet <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-23Docs/admin-guide/mm/damon/reclaim: document auto-tuning parametersSeongJae Park1-0/+27
Update DAMON_RECLAIM usage document for the user/self feedback based auto-tuning of the quota. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23Docs/admin-guide/mm/damon/usage: document quota goal metric fileSeongJae Park1-6/+6
Update DAMON usage document for the quota goal target_metric file. [[email protected]: fix a typo on the auto-tuning design reference link] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23Docs/admin-guide/mm/damon/usage: document effective_bytes fileSeongJae Park1-3/+16
Update DAMON usage document for the effective quota file of the DAMON sysfs interface. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/admin-guide/mm/damon/usage: fix wrong quotas diabling conditionSeongJae Park1-1/+2
After the introduction of DAMOS quotas, DAMOS quotas is not disabled if both size and time quotas are zero but the quota goal is set. The new rule is also applied to DAMON sysfs interface, but the usage doc is not updated. Update it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/mm/damon: move monitoring target regions setup detail from the usage to ↵SeongJae Park1-11/+5
the design document Design doc is aimed to have all concept level details, while the usage doc is focused on only how the features can be used. Some details about monitoring target regions construction is on the usage doc. Move the details about the monitoring target regions construction differences for DAMON operations set from the usage to the design doc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/mm/damon: move DAMON operation sets list from the usage to the design ↵SeongJae Park1-12/+7
document The list of DAMON operation sets and their explanation, which may better to be on design document, is written on the usage document. Move the detail to design document and make the usage document only reference the design document. [[email protected]: fix a typo on a reference link] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/mm/damon: move the list of DAMOS actions to design docSeongJae Park1-34/+13
DAMOS operation actions are explained nearly twice on the DAMON usage document, once for the sysfs interface, and then again for the debugfs interface. Duplication is bad. Also it would better to keep this kind of concept level details in design document and keep the usage document small and focus on only the usage. Move the list to design document and update usage document to reference it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleavingGregory Price1-0/+9
When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Hasan Al Maruf <[email protected]> Signed-off-by: Gregory Price <[email protected]> Co-developed-by: Rakie Kim <[email protected]> Signed-off-by: Rakie Kim <[email protected]> Co-developed-by: Honggyu Kim <[email protected]> Signed-off-by: Honggyu Kim <[email protected]> Co-developed-by: Hyeongtak Ji <[email protected]> Signed-off-by: Hyeongtak Ji <[email protected]> Co-developed-by: Srinivasulu Thanneeru <[email protected]> Signed-off-by: Srinivasulu Thanneeru <[email protected]> Co-developed-by: Ravi Jonnalagadda <[email protected]> Signed-off-by: Ravi Jonnalagadda <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/admin-guide/mm/damon/usage: update for monitor_on renamingSeongJae Park1-14/+15
Update DAMON debugfs interface sections on the usage document to reflect the fact that 'monitor_on' file has renamed to 'monitor_on_DEPRECATED'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hu Haowen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/admin-guide/mm/damon/usage: document 'DEPRECATED' file of DAMON debugfs ↵SeongJae Park1-3/+10
interface Document the newly added DAMON debugfs interface deprecation notice file on the usage document. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hu Haowen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints exampleSeongJae Park1-2/+2
Patch series "mm/damon: make DAMON debugfs interface deprecation unignorable". DAMON debugfs interface is deprecated in February 2023, by commit 5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs interface deprecation notice"). Make the fact unable to be easily ignored by removing an example usage from the document (patch 1), renaming the config (patch 2), adding a deprecation notice file to the debugfs directory (patches 3-5), and renaming the debugfs file that essnetial to be used for real use of DAMON (patches 6-9). This patch (of 9): DAMON tracepoints example on the DAMON usage document is using DAMON debugfs interface, which is deprecated. Use its alternative, DAMON sysfs interface. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hu Haowen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29zswap: memcontrol: implement zswap writeback disablingNhat Pham1-0/+10
During our experiment with zswap, we sometimes observe swap IOs due to occasional zswap store failures and writebacks-to-swap. These swapping IOs prevent many users who cannot tolerate swapping from adopting zswap to save memory and improve performance where possible. This patch adds the option to disable this behavior entirely: do not writeback to backing swapping device when a zswap store attempt fail, and do not write pages in the zswap pool back to the backing swap device (both when the pool is full, and when the new zswap shrinker is called). This new behavior can be opted-in/out on a per-cgroup basis via a new cgroup file. By default, writebacks to swap device is enabled, which is the previous behavior. Initially, writeback is enabled for the root cgroup, and a newly created cgroup will inherit the current setting of its parent. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be stored in the zswap pool (which itself consumes swap space in its current form). This patch should be applied on top of the zswap shrinker series: https://lore.kernel.org/linux-mm/[email protected]/ as it also disables the zswap shrinker, a major source of zswap writebacks. For the most part, this feature is motivated by internal parties who have already established their opinions regarding swapping - the workloads that are highly sensitive to IO, and especially those who are using servers with really slow disk performance (for instance, massive but slow HDDs). For these folks, it's impossible to convince them to even entertain zswap if swapping also comes as a packaged deal. Writeback disabling is quite a useful feature in these situations - on a mixed workloads deployment, they can disable writeback for the more IO-sensitive workloads, and enable writeback for other background workloads. For instance, on a server with HDD, I allocate memories and populate them with random values (so that zswap store will always fail), and specify memory.high low enough to trigger reclaim. The time it takes to allocate the memories and just read through it a couple of times (doing silly things like computing the values' average etc.): zswap.writeback disabled: real 0m30.537s user 0m23.687s sys 0m6.637s 0 pages swapped in 0 pages swapped out zswap.writeback enabled: real 0m45.061s user 0m24.310s sys 0m8.892s 712686 pages swapped in 461093 pages swapped out (the last two lines are from vmstat -s). [[email protected]: add a comment about recurring zswap store failures leading to reclaim inefficiency] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: David Heidelberg <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29mm/ksm: document ksm advisor and its sysfs knobsStefan Roesch1-0/+55
This documents the KSM advisor and its new knobs in /sys/fs/kernel/mm. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29userfaultfd: UFFDIO_MOVE uABIAndrea Arcangeli1-0/+3
Implement the uABI of UFFDIO_MOVE ioctl. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [2]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [2]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/ Update for the ioctl_userfaultfd(2) manpage: UFFDIO_MOVE (Since Linux xxx) Move a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to move are specified by the src, dst, and len fields of the uffdio_move structure pointed to by argp: struct uffdio_move { __u64 dst; /* Destination of move */ __u64 src; /* Source of move */ __u64 len; /* Number of bytes to move */ __u64 mode; /* Flags controlling behavior of move */ __s64 move; /* Number of bytes moved, or negated error */ }; The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_MOVE operation: UFFDIO_MOVE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES Allow holes in the source virtual range that is being moved. When not specified, the holes will result in ENOENT error. When specified, the holes will be accounted as successfully moved memory. This is mostly useful to move hugepage aligned virtual regions without knowing if there are transparent hugepages in the regions or not, but preventing the risk of having to split the hugepage during the operation. The move field is used by the kernel to return the number of bytes that was actually moved, or an error (a negated errno- style value). If the value returned in move doesn't match the value that was specified in len, the operation fails with the error EAGAIN. The move field is output-only; it is not read by the UFFDIO_MOVE operation. The operation may fail for various reasons. Usually, remapping of pages that are not exclusive to the given process fail; once KSM might deduplicate pages or fork() COW-shares pages during fork() with child processes, they are no longer exclusive. Further, the kernel might only perform lightweight checks for detecting whether the pages are exclusive, and return -EBUSY in case that check fails. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source VMA before fork(). This ioctl(2) operation returns 0 on success. In this case, the entire area was moved. On error, -1 is returned and errno is set to indicate the error. Possible errors include: EAGAIN The number of bytes moved (i.e., the value returned in the move field) does not equal the value that was specified in the len field. EINVAL Either dst or len was not a multiple of the system page size, or the range specified by src and len or dst and len was invalid. EINVAL An invalid bit was specified in the mode field. ENOENT The source virtual memory range has unmapped holes and UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. EEXIST The destination virtual memory range is fully or partially mapped. EBUSY The pages in the source virtual memory range are either pinned or not exclusive to the process. The kernel might only perform lightweight checks for detecting whether the pages are exclusive. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source virtual memory area before fork(). ENOMEM Allocating memory needed for the operation failed. ESRCH The target process has exited at the time of a UFFDIO_MOVE operation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Al Viro <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-20Docs/admin-guide/mm/damon/usage: use a list for 'state' sysfs file input ↵SeongJae Park1-24/+23
commands There are eight command inputs for 'state' DAMON sysfs file, and those are verbosely explained in multiple paragraphs. It is not easy to find explanation of specific command, and getting whole picture of supported commands. Replace the paragraphs with a list. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-20Docs/admin-guide/mm/damon/usage: add links to sysfs files hierarchySeongJae Park1-21/+49
'Sysfs Files Hierarchy' section of DAMON usage document shows whole picture of the interface. Then sections for detailed explanation of the files follow. Due to the amount of the files, navigating between the whole picture and the section for specific files sometimes require no subtle amount of scrolling. Add links from the whole picture to the dedicated sections for making the navigation easier. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-20Docs/admin-guide/mm/damon/usage: update context directory section labelSeongJae Park1-3/+3
The label for context DAMON sysfs directory section is having name sysfs_contexts. The name would be better to be used for the contexts directory. Rename it to represent a single context. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>