aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-11-30mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mappedHugh Dickins2-97/+26
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit; but if we slightly abuse subpages_mapcount by additionally demanding that one bit be set there when the compound page is PMD-mapped, then a cascade of two atomic ops is able to maintain the stats without bit_spin_lock. This is harder to reason about than when bit_spin_locked, but I believe safe; and no drift in stats detected when testing. When there are racing removes and adds, of course the sequence of operations is less well- defined; but each operation on subpages_mapcount is atomically good. What might be disastrous, is if subpages_mapcount could ever fleetingly appear negative: but the pte lock (or pmd lock) these rmap functions are called under, ensures that a last remove cannot race ahead of a first add. Continue to make an exception for hugetlb (PageHuge) pages, though that exception can be easily removed by a further commit if necessary: leave subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just carry on checking compound_mapcount too in folio_mapped(), page_mapped(). Evidence is that this way goes slightly faster than the previous implementation in all cases (pmds after ptes now taking around 103ms); and relieves us of worrying about contention on the bit_spin_lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: subpages_mapcount of PTE-mapped subpagesHugh Dickins2-93/+69
Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2. This patch (of 3): Following suggestion from Linus, instead of counting every PTE map of a compound page in subpages_mapcount, just count how many of its subpages are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and requires updating the count less often. This does then revert total_mapcount() and folio_mapcount() to needing a scan of subpages; but they are inherently racy, and need no locking, so Linus is right that the scans are much better done there. Plus (unlike in 6.1 and previous) subpages_mapcount lets us avoid the scan in the common case of no PTE maps. And page_mapped() and folio_mapped() remain scanless and just as efficient with the new meaning of subpages_mapcount: those are the functions which I most wanted to remove the scan from. The updated page_dup_compound_rmap() is no longer suitable for use by anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted. Evidence is that this way goes slightly faster than the previous implementation for most cases; but significantly faster in the (now scanless) pmds after ptes case, which started out at 870ms and was brought down to 495ms by the previous series, now takes around 105ms. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_vmemmap: remap head page to newly allocated pageJoao Martins1-7/+34
Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed back to page allocator is as following: for a 2M hugetlb page it will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, that means that it breaks the first 4K of a potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For this reason the memory that it's free back to page allocator cannot be used for hugetlb to allocate huge pages of the same size, but rather only of a smaller huge page size: Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node having 64G): * Before allocation: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 340 100 32 15 1 2 0 0 0 1 15558 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 31987 * After: Node 0, zone Normal, type Movable 30893 32006 31515 7 0 0 0 0 0 0 0 Notice how the memory freed back are put back into 4K / 8K / 16K page pools. And it allocates a total of 31987 pages (63974M). To fix this behaviour rather than remapping second vmemmap page (thus breaking the contiguous block of memory backing the struct pages) repopulate the first vmemmap page with a new one. We allocate and copy from the currently mapped vmemmap page, and then remap it later on. The same algorithm works if there's a pre initialized walk::reuse_page and the head page doesn't need to be skipped and instead we remap it when the @addr being changed is the @reuse_addr. The new head page is allocated in vmemmap_remap_free() given that on restore there's no need for functional change. Note that, because right now one hugepage is remapped at a time, thus only one free 4K page at a time is needed to remap the head page. Should it fail to allocate said new page, it reuses the one that's already mapped just like before. As a result, for every 64G of contiguous hugepages it can give back 1G more of contiguous memory per 64G, while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G hugetlb). After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, each node with 64G): * Before allocation Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 1 1 1 0 0 1 0 0 1 1 15564 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 32394 * After: Node 0, zone Normal, type Movable 0 50 97 108 96 81 70 46 18 0 0 In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out of the 32394 (64788M) allocated. So the memory freed back is indeed being used back in hugetlb and there's no massive order-0..order-2 pages accumulated unused. [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: add smp_wmb() to ensure page contents are visible prior to PTE write] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: handle the normal !PageCompound case firstHugh Dickins1-27/+27
Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts") propagated the "if (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern throughout; but Linus hated the way that gives primacy to the uncommon case: switch to "if (!PageCompound) {atomic} else if (compound) {lock} else {lock}" throughout. Linus has a bigger idea for how to improve it all, but here just make that rearrangement. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: lock_compound_mapcounts() on THP mapcountsHugh Dickins2-147/+189
Fix the races in maintaining compound_mapcount, subpages_mapcount and subpage _mapcount by using PG_locked in the first tail of any compound page for a bit_spin_lock() on such modifications; skipping the usual atomic operations on those fields in this case. Bring page_remove_file_rmap() and page_remove_anon_compound_rmap() back into page_remove_rmap() itself. Rearrange page_add_anon_rmap() and page_add_file_rmap() and page_remove_rmap() to follow the same "if (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern (with a PageTransHuge in the compound test, like before, to avoid BUG_ONs and optimize away that block when THP is not configured). Move all the stats updates outside, after the bit_spin_locked section, so that it is sure to be a leaf lock. Add page_dup_compound_rmap() to manage compound locking versus atomics in sync with the rest. In particular, hugetlb pages are still using the atomics: to avoid unnecessary interference there, and because they never have subpage mappings; but this exception can easily be changed. Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's __split_huge_pmd_locked() too. bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure is to the non-hugetlb non-THP pte-mapped compound pages (with large folios being currently dependent on TRANSPARENT_HUGEPAGE). There is never any scan of subpages in this case; but we have chosen to use PageCompound tests rather than PageTransCompound tests to gate the use of lock_compound_mapcounts(), so that page_mapped() is correct on all compound pages, whether or not TRANSPARENT_HUGEPAGE is enabled: could that be a problem for PREEMPT_RT, when there is contention on the lock - under heavy concurrent forking for example? If so, then it can be turned into a sleeping lock (like folio_lock()) when PREEMPT_RT. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 115 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 86ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. Mapping huge pages by ptes is largely unaffected but variable: between 5% faster and 5% slower in what I've recorded. Contention on the lock is likely to behave worse than contention on the atomics behaved. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: simplify compound page mapcount handlingHugh Dickins8-196/+112
Compound page (folio) mapcount calculations have been different for anon and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. And each huge mapping and unmapping of a file (or shmem) THP involved atomically incrementing and decrementing the mapcount of every subpage of that huge page, dirtying many struct page cachelines. Add subpages_mapcount field to the struct folio and first tail page, so that the total of subpage mapcounts is available in one place near the head: then page_mapcount() and total_mapcount() and page_mapped(), and their folio equivalents, are so quick that anon and file and hugetlb don't need to be optimized differently. Delete the unloved PageDoubleMap. page_add and page_remove rmap functions must now maintain the subpages_mapcount as well as the subpage _mapcount, when dealing with pte mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and NR_FILE_MAPPED statistics still needs reading through the subpages, using nr_subpages_unmapped() - but only when first or last pmd mapping finds subpages_mapcount raised (double-map case, not the common case). But are those counts (used to decide when to split an anon THP, and in vmscan's pagecache_reclaimable heuristic) correctly maintained? Not quite: since page_remove_rmap() (and also split_huge_pmd()) is often called without page lock, there can be races when a subpage pte mapcount 0<->1 while compound pmd mapcount 0<->1 is scanning - races which the previous implementation had prevented. The statistics might become inaccurate, and even drift down until they underflow through 0. That is not good enough, but is better dealt with in a followup patch. Update a few comments on first and second tail page overlaid fields. hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but subpages_mapcount and compound_pincount are already correctly at 0, so delete its reinitialization of compound_pincount. A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took 18 seconds on small pages, and used to take 1 second on huge pages, but now takes 119 milliseconds on huge pages. Mapping by pmds a second time used to take 860ms and now takes 92ms; mapping by pmds after mapping by ptes (when the scan is needed) used to take 870ms and now takes 495ms. But there might be some benchmarks which would show a slowdown, because tail struct pages now fall out of cache until final freeing checks them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,hugetlb: use folio fields in second tail pageHugh Dickins2-4/+3
Patch series "mm,huge,rmap: unify and speed up compound mapcounts". This patch (of 3): We want to declare one more int in the first tail of a compound page: that first tail page being valuable property, since every compound page has a first tail, but perhaps no more than that. No problem on 64-bit: there is already space for it. No problem with 32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always available") kindly cleared the space for it, apparently not realizing that only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail page->private might conflict) - but make sure of that in its Kconfig. But hugetlb pages use tail page->private of the first tail page for a subpool pointer, which will conflict; and they also use page->private of the 2nd, 3rd and 4th tails. Undo "mm: add private field of first tail to struct page and struct folio"'s recent addition of private_1 to the folio tail: instead add hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison to a second tail page of the folio: THP has long been using several fields of that tail, so make better use of it for hugetlb too. This is not how a generic folio should be declared in future, but it is an effective transitional way to make use of it. Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3. [[email protected]: prefix folio's page_1 and page_2 with double underscore, give folio's _flags_2 and _head_2 a line documentation each] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: John Hubbard <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zsmalloc: replace IS_ERR() with IS_ERR_VALUE()Deming Wang1-1/+1
Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Deming Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: use pte markers for swap errorsPeter Xu3-4/+6
PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR. Using a whole swap entry type for this purpose can be an overkill, especially if we already have PTE markers. Define a new bit for swapin error and replace it with pte markers. Then we can safely drop SWP_SWAPIN_ERROR and give one device slot back to swap. We used to have SWP_SWAPIN_ERROR taking the page pfn as part of the swap entry, but it's never used. Neither do I see how it can be useful because normally the swapin failure should not be caused by a bad page but bad swap device. Drop it alongside. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Huang Ying <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: always compile in pte markersPeter Xu2-11/+3
Patch series "mm: Use pte marker for swapin errors". This series uses the pte marker to replace the swapin error swap entry, then we save one more swap entry slot for swap devices. A new pte marker bit is defined. This patch (of 2): The PTE markers code is tiny and now it's enabled for most of the distributions. It's fine to keep it as-is, but to make a broader use of it (e.g. replacing read error swap entry) it needs to be there always otherwise we need special code path to take care of !PTE_MARKER case. It'll be easier just make pte marker always exist. Use this chance to extend its usage to anonymous too by simply touching up some of the old comments, because it'll be used for anonymous pages in the follow up patches. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Huang Ying <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: use kstrtobool() instead of strtobool()Christophe JAILLET2-2/+4
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Link: https://lkml.kernel.org/r/03f9401a6c8b87a1c786a2138d16b048f8d0eb53.1667336095.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <[email protected]> Acked-by: Pasha Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon: use kstrtobool() instead of strtobool()Christophe JAILLET2-2/+4
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Link: https://lkml.kernel.org/r/ed2b46489a513988688decb53850339cc228940c.1667336095.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs-schemes: implement DAMOS-tried regions clear commandSeongJae Park3-1/+37
When there are huge number of DAMON regions that specific scheme actions are tried to be applied, directories and files under 'tried_regions' scheme directory could waste some memory. Add another special input keyword ('clear_schemes_tried_regions') for 'state' file of each kdamond sysfs directory that can be used for cleanup of the 'tried_regions' sub-directories. [[email protected]: skip regions clearing if the scheme directory was removed] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: implement DAMOS tried regions update commandSeongJae Park3-2/+141
Implement the code for filling the data of 'tried_regions' DAMON sysfs directory. With this commit, DAMON sysfs interface users can write a special keyword, 'update_schemes_tried_regions' to the corresponding 'state' file of the kdamond. Then, DAMON sysfs interface will collect the tried regions information using the 'before_damos_apply()' callback for one aggregation interval and populate scheme region directories with the values. [[email protected]: skip tried regions update if the scheme directory was removed] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs-schemes: implement scheme region directorySeongJae Park1-1/+122
Implement region directories under 'tried_regions' directory of each scheme DAMON sysfs directory. This directory will provide the address range, the monitored access frequency ('nr_accesses'), and the age of each DAMON region that corresponding DAMON-based operation scheme has tried to be applied. Note that this commit doesn't implement the code for filling the data but only the sysfs directory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs-schemes: implement schemes/tried_regions directorySeongJae Park1-0/+57
For efficient and simple query-like DAMON monitoring results readings and deep level investigations of DAMOS, DAMON kernel API (include/linux/damon.h) users can use 'before_damos_apply' DAMON callback. However, DAMON sysfs interface users don't have such option. Add a directory, namely 'tried_regions', under each scheme directory to use it as the interface for the purpose. Note that this commit is implementing only the directory but the data filling. After the data filling change is made, users will be able to signal DAMON to fill the directory with the regions that corresponding scheme has tried to be applied. By setting the access pattern of the scheme, users could do the efficient query-like monitoring. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: add a callback for scheme target regions checkSeongJae Park1-1/+5
Patch series "efficiently expose damos action tried regions information". DAMON users can retrieve the monitoring results via 'after_aggregation' callbacks if the user is using the kernel API, or 'damon_aggregated' tracepoint if the user is in the user space. Those are useful if full monitoring results are necessary. However, if the user has interest in only a snapshot of the results for some regions having specific access pattern, the interfaces could be inefficient. For example, some users only want to know which memory regions are not accessed for more than a specific time at the moment. Also, some DAMOS users would want to know exactly to what memory regions the schemes' actions tried to be applied, for a debugging or a tuning. As DAMOS has its internal mechanism for quota and regions prioritization, the users would need to simulate DAMOS' mechanism against the monitoring results. That's unnecessarily complex. This patchset implements DAMON kernel API callbacks and sysfs directory for efficient exposure of the information for the use cases. The new callback will be called for each region when a DAMOS action is gonna tried to be applied to it. The sysfs directory will be called 'tried_regions' and placed under each scheme sysfs directory. Users can write a special keyworkd, 'update_schemes_regions', to the 'state' file of a kdamond sysfs directory. Then, DAMON sysfs interface will fill the directory with the information of regions that corresponding scheme action was tried to be applied for next one aggregation interval. Patches Sequence ---------------- The first one (patch 1) implements the callback for the kernel space users. Following two patches (patches 2 and 3) implements sysfs directories for the information and its sub directories. Two patches (patches 4 and 5) for implementing the special keywords for filling the data to and cleaning up the directories follow. Patch 6 adds a selftest for the new sysfs directory. Finally, two patches (patches 7 and 8) document the new feature in the administrator guide and the ABI document. This patch (of 8): Getting DAMON monitoring results of only specific access pattern (e.g., getting address ranges of memory that not accessed at all for two minutes) can be useful for efficient monitoring of the system. The information can also be helpful for deep level investigation of DAMON-based operation schemes. For that, users need to record (in case of the user space users) or iterate (in case of the kernel space users) full monitoring results and filter it out for the specific access pattern. In case of the DAMOS investigation, users will even need to simulate DAMOS' quota and prioritization mechanisms. It's inefficient and complex. Add a new DAMON callback that will be called before each scheme is applied to each region. DAMON kernel API users will be able to do the query-like monitoring results collection, or DAMOS investigation in an efficient and simple way using it. Commits for providing the capability to the user space users will follow. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb: convert move_hugetlb_state() to foliosSidhartha Kumar2-12/+14
Clean up unmap_and_move_huge_page() by converting move_hugetlb_state() to take in folios. [[email protected]: fix CONFIG_HUGETLB_PAGE=n build] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugeltb_cgroup: convert hugetlb_cgroup_commit_charge*() to foliosSidhartha Kumar1-6/+10
Convert hugetlb_cgroup_commit_charge*() to internally use folios to clean up the code after __set_hugetlb_cgroup() was changed to take a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to foliosSidhartha Kumar2-17/+19
Continue to use a folio inside free_huge_page() by converting hugetlb_cgroup_uncharge_page*() to folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb: convert free_huge_page to foliosSidhartha Kumar1-13/+14
Use folios inside free_huge_page(), this is in preparation for converting hugetlb_cgroup_uncharge_page() to take in a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb: convert isolate_or_dissolve_huge_page to foliosSidhartha Kumar1-7/+6
Removes a call to compound_head() by using a folio when operating on the head page of a hugetlb compound page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_migrate to foliosSidhartha Kumar2-6/+4
Cleans up intermediate page to folio conversion code in hugetlb_cgroup_migrate() by changing its arguments from pages to folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() to foliosSidhartha Kumar2-19/+25
Allows __prep_new_huge_page() to operate on a folio by converting set_hugetlb_cgroup*() to take in a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() to foliosSidhartha Kumar2-7/+11
Introduce folios in __remove_hugetlb_page() by converting hugetlb_cgroup_from_page() to use folios. Also gets rid of unsed hugetlb_cgroup_from_page_resv() function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert __set_hugetlb_cgroup() to foliosSidhartha Kumar1-2/+2
Patch series "convert hugetlb_cgroup helper functions to folios", v2. This patch series continues the conversion of hugetlb code from being managed in pages to folios by converting many of the hugetlb_cgroup helper functions to use folios. This allows the core hugetlb functions to pass in a folio to these helper functions. This patch (of 9); Change __set_hugetlb_cgroup() to use folios so it is explicit that the function operates on a head page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mempool: do not use ksize() for poisoningKees Cook1-6/+12
Nothing appears to be using ksize() within the kmalloc-backed mempools except the mempool poisoning logic. Use the actual pool size instead of the ksize() to avoid needing any special handling of the memory as needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE. [[email protected]: for slab mempools pool_data is not object size] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Marco Elver <[email protected]> Cc: Vincenzo Frascino <[email protected]> Reported-by: Anders Roxell <[email protected]> Link: https://lore.kernel.org/all/20221031105514.GB69385@mutt/ Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: vmscan: split khugepaged stats from direct reclaim statsJohannes Weiner4-10/+38
Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. However, khugepaged currently distorts the picture: as a kernel thread it doesn't impose allocation latencies on userspace, and it explicitly opts out of kswapd reclaim. Its activity showing up in the direct reclaim stats is misleading. Counting it as kswapd reclaim could also cause confusion when trying to understand actual kswapd behavior. Break out khugepaged from the direct reclaim counters into new pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): pgsteal_kswapd 1342185 pgsteal_direct 0 pgsteal_khugepaged 3623 pgscan_kswapd 1345025 pgscan_direct 0 pgscan_khugepaged 3623 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Eric Bergen <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: hugetlb_vmemmap: remove redundant list_del()Muchun Song1-3/+1
The ->lru field will be assigned to a new value in __free_page(). So it is unnecessary to delete it from the @list. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm, hwpoison: when copy-on-write hits poison, take page offlineTony Luck1-1/+3
Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Shuai Xue <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm, hwpoison: try to recover from copy-on write faultsTony Luck1-10/+20
Patch series "Copy-on-write poison recovery", v3. Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin <[email protected]> and Shuai Xue <[email protected]> for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: This patch (of 2): If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. [[email protected]: add call to kmsan_unpoison_memory(), per Miaohe Lin] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Tested-by: Shuai Xue <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: convert mm's rss stats into percpu_counterShakeel Butt1-62/+11
Currently mm_struct maintains rss_stats which are updated on page fault and the unmapping codepaths. For page fault codepath the updates are cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. The reason for caching is performance for multithreaded applications otherwise the rss_stats updates may become hotspot for such applications. However this optimization comes with the cost of error margin in the rss stats. The rss_stats for applications with large number of threads can be very skewed. At worst the error margin is (nr_threads * 64) and we have a lot of applications with 100s of threads, so the error margin can be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. Recently we started seeing the unbounded errors for rss_stats for specific applications which use TCP rx0cp. It seems like vm_insert_pages() codepath does not sync rss_stats at all. This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However this conversion enable us to get the accurate stats for situations where accuracy is more important than the cpu cost. This patch does not make such tradeoffs - we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/lru_sort: enable and disable synchronouslySeongJae Park1-29/+22
Writing a value to DAMON_RECLAIM's 'enabled' parameter turns on or off DAMON in an ansychronous way. This means the parameter cannot be used to read the current status of DAMON_RECLAIM. 'kdamond_pid' parameter should be used instead for the purpose. The documentation is easy to be read as it works in a synchronous way, so it is a little bit confusing. It also makes the user space tooling dirty. There's no real reason to have the asynchronous behavior, though. Simply make the parameter works synchronously, rather than updating the document. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/reclaim: enable and disable synchronouslySeongJae Park1-30/+23
Patch series "mm/damon/reclaim,lru_sort: enable/disable synchronously". Writing a value to DAMON_RECLAIM and DAMON_LRU_SORT's 'enabled' parameters turns on or off DAMON in an ansychronous way. This means the parameter cannot be used to read the current status of them. 'kdamond_pid' parameter should be used instead for the purpose. The documentation is easy to be read as it works in a synchronous way, so it is a little bit confusing. It also makes the user space tooling dirty. There's no real reason to have the asynchronous behavior, though. Simply make the parameter works synchronously, rather than updating the document. The first and second patches changes the behavior of the 'enabled' parameter for DAMON_RECLAIM and adds a selftest for the changed behavior, respectively. Following two patches make the same changes for DAMON_LRU_SORT. This patch (of 4): Writing a value to DAMON_RECLAIM's 'enabled' parameter turns on or off DAMON in an ansychronous way. This means the parameter cannot be used to read the current status of DAMON_RECLAIM. 'kdamond_pid' parameter should be used instead for the purpose. The documentation is easy to be read as it works in a synchronous way, so it is a little bit confusing. It also makes the user space tooling dirty. There's no real reason to have the asynchronous behavior, though. Simply make the parameter works synchronously, rather than updating the document. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/{reclaim,lru_sort}: remove unnecessarily included headersSeongJae Park2-4/+0
Some headers that 'reclaim.c' and 'lru_sort.c' are including are unnecessary now owing to previous cleanups and refactorings. Remove those. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/modules: deduplicate init steps for DAMON context setupSeongJae Park5-30/+53
DAMON_RECLAIM and DAMON_LRU_SORT has duplicated code for DAMON context and target initializations. Deduplicate the part by implementing a function for the initialization in 'modules-common.c' and using it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: split out schemes directory implementation to separate fileSeongJae Park4-1065/+1091
DAMON sysfs interface for 'schemes' directory is implemented using about one thousand lines of code. It has no strong dependency with other parts of its file, so split it out to another file for better code management. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: split out kdamond-independent schemes stats update logic ↵SeongJae Park1-15/+22
into a new function 'damon_sysfs_schemes_update_stats()' is coupled with both damon_sysfs_kdamond and damon_sysfs_schemes. It's a wide range of types dependency. It makes splitting the logics a little bit distracting. Split the function so that each function is coupled with smaller range of types. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: move unsigned long range directory to common moduleSeongJae Park3-100/+109
The implementation of unsigned long type range directories can be reused by multiple DAMON sysfs directories including those for DAMON-based Operation Schemes and the range of number of monitoring regions. Move the code into the files for DAMON sysfs common logics. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: move sysfs_lock to common moduleSeongJae Park4-4/+24
DAMON sysfs interface is implemented in a single file, sysfs.c, which has about 2,800 lines of code. As the interface is hierarchical and some of the code can be reused by different hierarchies, it would make more sense to split out the implementation into common parts and different parts in multiple files. As the beginning of the work, create files for common code and move the global mutex for directories modifications protection into the new file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: remove parameters of damon_sysfs_region_alloc()SeongJae Park1-11/+3
'damon_sysfs_region_alloc()' is always called with zero-filled 'struct damon_addr_range', because the start and end addresses should set by users. Remove unnecessary parameters of the function and simplify the body by using 'kzalloc()'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: use damon_addr_range for region's start and end valuesSeongJae Park1-14/+11
DAMON has a struct for each address range but DAMON sysfs interface is using the low type (unsigned long) for storing the start and end addresses of regions. Use the dedicated struct for better type safety. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split out scheme quota adjustment logic into a new functionSeongJae Park1-43/+48
DAMOS quota adjustment logic in 'kdamond_apply_schemes()', has some amount of code, and the logic is not so straightforward. Split it out to a new function for better readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split out scheme stat update logic into a new functionSeongJae Park1-5/+11
The function for applying a given DAMON scheme action to a given DAMON region, 'damos_apply_scheme()' is not quite short. Make it better to read by splitting out the stat update logic into a new function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split damos application logic into a new functionSeongJae Park1-34/+39
The DAMOS action applying function, 'damon_do_apply_schemes()', is still long and not easy to read. Split out the code for applying a single action to a single region into a new function for better readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split out DAMOS-charged region skip logic into a new functionSeongJae Park1-31/+65
Patch series "mm/damon: cleanup and refactoring code", v2. This patchset cleans up and refactors a range of DAMON code including the core, DAMON sysfs interface, and DAMON modules, for better readability and convenient future feature implementations. In detail, this patchset splits unnecessarily long and complex functions in core into smaller functions (patches 1-4). Then, it cleans up the DAMON sysfs interface by using more type-safe code (patch 5) and removing unnecessary function parameters (patch 6). Further, it refactor the code by distributing the code into multiple files (patches 7-10). Last two patches (patches 11 and 12) deduplicates and remove unnecessary header inclusion in DAMON modules (reclaim and lru_sort). This patch (of 12): The DAMOS action applying function, 'damon_do_apply_schemes()', is quite long and not so simple. Split out the already quota-charged region skip code, which is not a small amount of simple code, into a new function with some comments for better readability. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton23-115/+295
2022-11-30mm/khugepaged: invoke MMU notifiers in shmem/file collapse pathsJann Horn1-0/+5
Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn2-3/+3
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/khugepaged: take the right locks for page table retractionJann Horn1-4/+51
pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>