aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-11-30mm/damon/core: add a callback for scheme target regions checkSeongJae Park2-1/+10
Patch series "efficiently expose damos action tried regions information". DAMON users can retrieve the monitoring results via 'after_aggregation' callbacks if the user is using the kernel API, or 'damon_aggregated' tracepoint if the user is in the user space. Those are useful if full monitoring results are necessary. However, if the user has interest in only a snapshot of the results for some regions having specific access pattern, the interfaces could be inefficient. For example, some users only want to know which memory regions are not accessed for more than a specific time at the moment. Also, some DAMOS users would want to know exactly to what memory regions the schemes' actions tried to be applied, for a debugging or a tuning. As DAMOS has its internal mechanism for quota and regions prioritization, the users would need to simulate DAMOS' mechanism against the monitoring results. That's unnecessarily complex. This patchset implements DAMON kernel API callbacks and sysfs directory for efficient exposure of the information for the use cases. The new callback will be called for each region when a DAMOS action is gonna tried to be applied to it. The sysfs directory will be called 'tried_regions' and placed under each scheme sysfs directory. Users can write a special keyworkd, 'update_schemes_regions', to the 'state' file of a kdamond sysfs directory. Then, DAMON sysfs interface will fill the directory with the information of regions that corresponding scheme action was tried to be applied for next one aggregation interval. Patches Sequence ---------------- The first one (patch 1) implements the callback for the kernel space users. Following two patches (patches 2 and 3) implements sysfs directories for the information and its sub directories. Two patches (patches 4 and 5) for implementing the special keywords for filling the data to and cleaning up the directories follow. Patch 6 adds a selftest for the new sysfs directory. Finally, two patches (patches 7 and 8) document the new feature in the administrator guide and the ABI document. This patch (of 8): Getting DAMON monitoring results of only specific access pattern (e.g., getting address ranges of memory that not accessed at all for two minutes) can be useful for efficient monitoring of the system. The information can also be helpful for deep level investigation of DAMON-based operation schemes. For that, users need to record (in case of the user space users) or iterate (in case of the kernel space users) full monitoring results and filter it out for the specific access pattern. In case of the DAMOS investigation, users will even need to simulate DAMOS' quota and prioritization mechanisms. It's inefficient and complex. Add a new DAMON callback that will be called before each scheme is applied to each region. DAMON kernel API users will be able to do the query-like monitoring results collection, or DAMOS investigation in an efficient and simple way using it. Commits for providing the capability to the user space users will follow. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb: convert move_hugetlb_state() to foliosSidhartha Kumar3-15/+22
Clean up unmap_and_move_huge_page() by converting move_hugetlb_state() to take in folios. [[email protected]: fix CONFIG_HUGETLB_PAGE=n build] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugeltb_cgroup: convert hugetlb_cgroup_commit_charge*() to foliosSidhartha Kumar1-6/+10
Convert hugetlb_cgroup_commit_charge*() to internally use folios to clean up the code after __set_hugetlb_cgroup() was changed to take a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to foliosSidhartha Kumar3-25/+27
Continue to use a folio inside free_huge_page() by converting hugetlb_cgroup_uncharge_page*() to folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb: convert free_huge_page to foliosSidhartha Kumar1-13/+14
Use folios inside free_huge_page(), this is in preparation for converting hugetlb_cgroup_uncharge_page() to take in a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb: convert isolate_or_dissolve_huge_page to foliosSidhartha Kumar1-7/+6
Removes a call to compound_head() by using a folio when operating on the head page of a hugetlb compound page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_migrate to foliosSidhartha Kumar3-10/+8
Cleans up intermediate page to folio conversion code in hugetlb_cgroup_migrate() by changing its arguments from pages to folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() to foliosSidhartha Kumar3-25/+31
Allows __prep_new_huge_page() to operate on a folio by converting set_hugetlb_cgroup*() to take in a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() to foliosSidhartha Kumar3-26/+31
Introduce folios in __remove_hugetlb_page() by converting hugetlb_cgroup_from_page() to use folios. Also gets rid of unsed hugetlb_cgroup_from_page_resv() function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/hugetlb_cgroup: convert __set_hugetlb_cgroup() to foliosSidhartha Kumar2-9/+9
Patch series "convert hugetlb_cgroup helper functions to folios", v2. This patch series continues the conversion of hugetlb code from being managed in pages to folios by converting many of the hugetlb_cgroup helper functions to use folios. This allows the core hugetlb functions to pass in a folio to these helper functions. This patch (of 9); Change __set_hugetlb_cgroup() to use folios so it is explicit that the function operates on a head page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Bui Quang Minh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mempool: do not use ksize() for poisoningKees Cook1-6/+12
Nothing appears to be using ksize() within the kmalloc-backed mempools except the mempool poisoning logic. Use the actual pool size instead of the ksize() to avoid needing any special handling of the memory as needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE. [[email protected]: for slab mempools pool_data is not object size] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Marco Elver <[email protected]> Cc: Vincenzo Frascino <[email protected]> Reported-by: Anders Roxell <[email protected]> Link: https://lore.kernel.org/all/20221031105514.GB69385@mutt/ Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30maple_tree: mte_set_full() and mte_clear_full() clang-analyzer clean upLiam Howlett1-4/+9
mte_set_full() and mte_clear_full() were incorrectly setting a pointer to a value without returning a result. Fix this by returning the modified pointer to be use as necessary. Also add a third function to return if the bit is set or not. Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Suggested-by: Lukas Bulwahn <[email protected]> Suggested-by: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: vmscan: split khugepaged stats from direct reclaim statsJohannes Weiner7-10/+53
Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. However, khugepaged currently distorts the picture: as a kernel thread it doesn't impose allocation latencies on userspace, and it explicitly opts out of kswapd reclaim. Its activity showing up in the direct reclaim stats is misleading. Counting it as kswapd reclaim could also cause confusion when trying to understand actual kswapd behavior. Break out khugepaged from the direct reclaim counters into new pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): pgsteal_kswapd 1342185 pgsteal_direct 0 pgsteal_khugepaged 3623 pgscan_kswapd 1345025 pgscan_direct 0 pgscan_khugepaged 3623 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Eric Bergen <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Docs/admin-guide/mm/damon/usage: fix wrong usage example of init_regions fileSeongJae Park1-5/+6
DAMON debugfs interface assumes the users will write all inputs at once. However, redirecting a string of multiple lines sometimes end up writing line by line. Therefore, the example usage of 'init_regions' file, which writes input as a string of multiple lines can fail. Fix it to use a single line string instead. Also update the description of the usage to not assume users will write inputs in multiple lines. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Vinicius Petrucci <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Docs/admin-guide/mm/damon/usage: describe the rules of sysfs region directoriesSeongJae Park1-0/+3
Patch series "Docs/admin-buide/mm/damon/usage: minor fixes". DAMON usage document contains an unclear description and a wrong usage example. This patchset fixes the two minor problems. This patch (of 2): Target region directories of DAMON sysfs interface should contain no overlap and sorted by the address, but not clearly documented. Actually, a user had an issue[1] due to the poor documentation. Add clear description of it on the usage document. [1] https://lore.kernel.org/damon/CAEZ6=UNUcH2BvJj++OrT=XQLdkidU79wmCO=tantSOB36pPNTg@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Reported-by: Vinicius Petrucci <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: hugetlb_vmemmap: remove redundant list_del()Muchun Song1-3/+1
The ->lru field will be assigned to a new value in __free_page(). So it is unnecessary to delete it from the @list. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm, hwpoison: when copy-on-write hits poison, take page offlineTony Luck2-2/+8
Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Shuai Xue <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm, hwpoison: try to recover from copy-on write faultsTony Luck2-10/+46
Patch series "Copy-on-write poison recovery", v3. Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin <[email protected]> and Shuai Xue <[email protected]> for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: This patch (of 2): If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. [[email protected]: add call to kmsan_unpoison_memory(), per Miaohe Lin] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Tested-by: Shuai Xue <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30percpu_counter: add percpu_counter_sum_all interfaceShakeel Butt3-6/+34
The percpu_counter is used for scenarios where performance is more important than the accuracy. For percpu_counter users, who want more accurate information in their slowpath, percpu_counter_sum is provided which traverses all the online CPUs to accumulate the data. The reason it only needs to traverse online CPUs is because percpu_counter does implement CPU offline callback which syncs the local data of the offlined CPU. However there is a small race window between the online CPUs traversal of percpu_counter_sum and the CPU offline callback. The offline callback has to traverse all the percpu_counters on the system to flush the CPU local data which can be a lot. During that time, the CPU which is going offline has already been published as offline to all the readers. So, as the offline callback is running, percpu_counter_sum can be called for one counter which has some state on the CPU going offline. Since percpu_counter_sum only traverses online CPUs, it will skip that specific CPU and the offline callback might not have flushed the state for that specific percpu_counter on that offlined CPU. Normally this is not an issue because percpu_counter users can deal with some inaccuracy for small time window. However a new user i.e. mm_struct on the cleanup path wants to check the exact state of the percpu_counter through check_mm(). For such users, this patch introduces percpu_counter_sum_all() which traverses all possible CPUs and it is used in fork.c:check_mm() to avoid the potential race. This issue is exposed by the later patch "mm: convert mm's rss stats into percpu_counter". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reported-by: Marek Szyprowski <[email protected]> Tested-by: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: convert mm's rss stats into percpu_counterShakeel Butt8-107/+40
Currently mm_struct maintains rss_stats which are updated on page fault and the unmapping codepaths. For page fault codepath the updates are cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. The reason for caching is performance for multithreaded applications otherwise the rss_stats updates may become hotspot for such applications. However this optimization comes with the cost of error margin in the rss stats. The rss_stats for applications with large number of threads can be very skewed. At worst the error margin is (nr_threads * 64) and we have a lot of applications with 100s of threads, so the error margin can be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. Recently we started seeing the unbounded errors for rss_stats for specific applications which use TCP rx0cp. It seems like vm_insert_pages() codepath does not sync rss_stats at all. This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However this conversion enable us to get the accurate stats for situations where accuracy is more important than the cpu cost. This patch does not make such tradeoffs - we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/damon: add tests for DAMON_LRU_SORT's enabled parameterSeongJae Park2-1/+42
Add simple test cases for DAMON_LRU_SORT's 'enabled' parameter. Those tests are focusing on the synchronous behavior of DAMON_RECLAIM enabling and disabling. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/lru_sort: enable and disable synchronouslySeongJae Park1-29/+22
Writing a value to DAMON_RECLAIM's 'enabled' parameter turns on or off DAMON in an ansychronous way. This means the parameter cannot be used to read the current status of DAMON_RECLAIM. 'kdamond_pid' parameter should be used instead for the purpose. The documentation is easy to be read as it works in a synchronous way, so it is a little bit confusing. It also makes the user space tooling dirty. There's no real reason to have the asynchronous behavior, though. Simply make the parameter works synchronously, rather than updating the document. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30selftests/damon: add tests for DAMON_RECLAIM's enabled parameterSeongJae Park2-0/+43
Add simple test cases for DAMON_RECLAIM's 'enabled' parameter. Those tests are focusing on the synchronous behavior of DAMON_RECLAIM enabling and disabling. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/reclaim: enable and disable synchronouslySeongJae Park1-30/+23
Patch series "mm/damon/reclaim,lru_sort: enable/disable synchronously". Writing a value to DAMON_RECLAIM and DAMON_LRU_SORT's 'enabled' parameters turns on or off DAMON in an ansychronous way. This means the parameter cannot be used to read the current status of them. 'kdamond_pid' parameter should be used instead for the purpose. The documentation is easy to be read as it works in a synchronous way, so it is a little bit confusing. It also makes the user space tooling dirty. There's no real reason to have the asynchronous behavior, though. Simply make the parameter works synchronously, rather than updating the document. The first and second patches changes the behavior of the 'enabled' parameter for DAMON_RECLAIM and adds a selftest for the changed behavior, respectively. Following two patches make the same changes for DAMON_LRU_SORT. This patch (of 4): Writing a value to DAMON_RECLAIM's 'enabled' parameter turns on or off DAMON in an ansychronous way. This means the parameter cannot be used to read the current status of DAMON_RECLAIM. 'kdamond_pid' parameter should be used instead for the purpose. The documentation is easy to be read as it works in a synchronous way, so it is a little bit confusing. It also makes the user space tooling dirty. There's no real reason to have the asynchronous behavior, though. Simply make the parameter works synchronously, rather than updating the document. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/{reclaim,lru_sort}: remove unnecessarily included headersSeongJae Park2-4/+0
Some headers that 'reclaim.c' and 'lru_sort.c' are including are unnecessary now owing to previous cleanups and refactorings. Remove those. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/modules: deduplicate init steps for DAMON context setupSeongJae Park5-30/+53
DAMON_RECLAIM and DAMON_LRU_SORT has duplicated code for DAMON context and target initializations. Deduplicate the part by implementing a function for the initialization in 'modules-common.c' and using it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: split out schemes directory implementation to separate fileSeongJae Park4-1065/+1091
DAMON sysfs interface for 'schemes' directory is implemented using about one thousand lines of code. It has no strong dependency with other parts of its file, so split it out to another file for better code management. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: split out kdamond-independent schemes stats update logic ↵SeongJae Park1-15/+22
into a new function 'damon_sysfs_schemes_update_stats()' is coupled with both damon_sysfs_kdamond and damon_sysfs_schemes. It's a wide range of types dependency. It makes splitting the logics a little bit distracting. Split the function so that each function is coupled with smaller range of types. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: move unsigned long range directory to common moduleSeongJae Park3-100/+109
The implementation of unsigned long type range directories can be reused by multiple DAMON sysfs directories including those for DAMON-based Operation Schemes and the range of number of monitoring regions. Move the code into the files for DAMON sysfs common logics. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: move sysfs_lock to common moduleSeongJae Park4-4/+24
DAMON sysfs interface is implemented in a single file, sysfs.c, which has about 2,800 lines of code. As the interface is hierarchical and some of the code can be reused by different hierarchies, it would make more sense to split out the implementation into common parts and different parts in multiple files. As the beginning of the work, create files for common code and move the global mutex for directories modifications protection into the new file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: remove parameters of damon_sysfs_region_alloc()SeongJae Park1-11/+3
'damon_sysfs_region_alloc()' is always called with zero-filled 'struct damon_addr_range', because the start and end addresses should set by users. Remove unnecessary parameters of the function and simplify the body by using 'kzalloc()'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: use damon_addr_range for region's start and end valuesSeongJae Park1-14/+11
DAMON has a struct for each address range but DAMON sysfs interface is using the low type (unsigned long) for storing the start and end addresses of regions. Use the dedicated struct for better type safety. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split out scheme quota adjustment logic into a new functionSeongJae Park1-43/+48
DAMOS quota adjustment logic in 'kdamond_apply_schemes()', has some amount of code, and the logic is not so straightforward. Split it out to a new function for better readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split out scheme stat update logic into a new functionSeongJae Park1-5/+11
The function for applying a given DAMON scheme action to a given DAMON region, 'damos_apply_scheme()' is not quite short. Make it better to read by splitting out the stat update logic into a new function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split damos application logic into a new functionSeongJae Park1-34/+39
The DAMOS action applying function, 'damon_do_apply_schemes()', is still long and not easy to read. Split out the code for applying a single action to a single region into a new function for better readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/core: split out DAMOS-charged region skip logic into a new functionSeongJae Park1-31/+65
Patch series "mm/damon: cleanup and refactoring code", v2. This patchset cleans up and refactors a range of DAMON code including the core, DAMON sysfs interface, and DAMON modules, for better readability and convenient future feature implementations. In detail, this patchset splits unnecessarily long and complex functions in core into smaller functions (patches 1-4). Then, it cleans up the DAMON sysfs interface by using more type-safe code (patch 5) and removing unnecessary function parameters (patch 6). Further, it refactor the code by distributing the code into multiple files (patches 7-10). Last two patches (patches 11 and 12) deduplicates and remove unnecessary header inclusion in DAMON modules (reclaim and lru_sort). This patch (of 12): The DAMOS action applying function, 'damon_do_apply_schemes()', is quite long and not so simple. Split out the already quota-charged region skip code, which is not a small amount of simple code, into a new function with some comments for better readability. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton66-35938/+36531
2022-11-30revert "kbuild: fix -Wimplicit-function-declaration in ↵Andrew Morton1-2/+0
license_is_gpl_compatible" It causes build failures with unusual CC/HOSTCC combinations. Quoting https://lkml.kernel.org/r/[email protected]: HOSTCC scripts/mod/modpost.o - due to target missing In file included from include/linux/string.h:5, from scripts/mod/../../include/linux/license.h:5, from scripts/mod/modpost.c:24: include/linux/compiler.h:246:10: fatal error: asm/rwonce.h: No such file or directory 246 | #include <asm/rwonce.h> | ^~~~~~~~~~~~~~ compilation terminated. ... The problem is that HOSTCC is not necessarily the same compiler or even architecture as CC and pulling in <linux/compiler.h> or <asm/rwonce.h> files indirectly isn't a good idea then. My toolchain is providing HOSTCC = gcc (MacPorts) and CC = arm-linux-gnueabihf (built from gcc source) and all running on Darwin. If I change the include to <string.h> I can then "HOSTCC scripts/mod/modpost.c" but then it fails for "CC kernel/module/main.c" not finding <string.h>: CC kernel/module/main.o - due to target missing In file included from kernel/module/main.c:43:0: ./include/linux/license.h:5:20: fatal error: string.h: No such file or directory #include <string.h> ^ compilation terminated. Reported-by: "H. Nikolaus Schaller" <[email protected]> Cc: Sam James <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabledLee Jones1-0/+1
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few over the current threshold. This can mainly be seen on 32-bit architectures where the present limit (when !GCC) is a lowly 1024-Bytes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lee Jones <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Harry Wentland <[email protected]> Cc: Leo Li <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Cc: Rodrigo Siqueira <[email protected]> Cc: Thomas Zimmermann <[email protected]> Cc: Tom Rix <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frameLee Jones1-0/+7
Patch series "Fix a bunch of allmodconfig errors", v2. Since b339ec9c229aa ("kbuild: Only default to -Werror if COMPILE_TEST") WERROR now defaults to COMPILE_TEST meaning that it's enabled for allmodconfig builds. This leads to some interesting build failures when using Clang, each resolved in this set. With this set applied, I am able to obtain a successful allmodconfig Arm build. This patch (of 2): calculate_bandwidth() is presently broken on all !(X86_64 || SPARC64 || ARM64) architectures built with Clang (all released versions), whereby the stack frame gets blown up to well over 5k. This would cause an immediate kernel panic on most architectures. We'll revert this when the following bug report has been resolved: https://github.com/llvm/llvm-project/issues/41896. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lee Jones <[email protected]> Suggested-by: Arnd Bergmann <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Harry Wentland <[email protected]> Cc: Lee Jones <[email protected]> Cc: Leo Li <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Cc: Rodrigo Siqueira <[email protected]> Cc: Thomas Zimmermann <[email protected]> Cc: Tom Rix <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/khugepaged: invoke MMU notifiers in shmem/file collapse pathsJann Horn1-0/+5
Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn3-3/+7
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/khugepaged: take the right locks for page table retractionJann Horn1-4/+51
pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Yang Shi <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: migrate: fix THP's mapcount on isolationGavin Shan1-11/+11
The issue is reported when removing memory through virtio_mem device. The transparent huge page, experienced copy-on-write fault, is wrongly regarded as pinned. The transparent huge page is escaped from being isolated in isolate_migratepages_block(). The transparent huge page can't be migrated and the corresponding memory block can't be put into offline state. Fix it by replacing page_mapcount() with total_mapcount(). With this, the transparent huge page can be isolated and migrated, and the memory block can be put into offline state. Besides, The page's refcount is increased a bit earlier to avoid the page is released when the check is executed. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") Signed-off-by: Gavin Shan <[email protected]> Reported-by: Zhenyu Zhang <[email protected]> Tested-by: Zhenyu Zhang <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: William Kucharski <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> [5.7+] Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: introduce arch_has_hw_nonleaf_pmd_young()Juergen Gross3-5/+24
When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in pmdp_test_and_clear_young(): BUG: unable to handle page fault for address: ffff8880083374d0 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1 RIP: e030:pmdp_test_and_clear_young+0x25/0x40 This happens because the Xen hypervisor can't emulate direct writes to page table entries other than PTEs. This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young() similar to arch_has_hw_pte_young() and test that instead of CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG. Link: https://lkml.kernel.org/r/[email protected] Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") Signed-off-by: Juergen Gross <[email protected]> Reported-by: Sander Eikelenboom <[email protected]> Acked-by: Yu Zhao <[email protected]> Tested-by: Sander Eikelenboom <[email protected]> Acked-by: David Hildenbrand <[email protected]> [core changes] Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add dummy pmd_young() for architectures not having itJuergen Gross7-0/+13
In order to avoid #ifdeffery add a dummy pmd_young() implementation as a fallback. This is required for the later patch "mm: introduce arch_has_hw_nonleaf_pmd_young()". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]> Acked-by: Yu Zhao <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Sander Eikelenboom <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in ↵SeongJae Park1-2/+44
damon_sysfs_set_schemes() Commit da87878010e5 ("mm/damon/sysfs: support online inputs update") made 'damon_sysfs_set_schemes()' to be called for running DAMON context, which could have schemes. In the case, DAMON sysfs interface is supposed to update, remove, or add schemes to reflect the sysfs files. However, the code is assuming the DAMON context wouldn't have schemes at all, and therefore creates and adds new schemes. As a result, the code doesn't work as intended for online schemes tuning and could have more than expected memory footprint. The schemes are all in the DAMON context, so it doesn't leak the memory, though. Remove the wrong asssumption (the DAMON context wouldn't have schemes) in 'damon_sysfs_set_schemes()' to fix the bug. Link: https://lkml.kernel.org/r/[email protected] Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update") Signed-off-by: SeongJae Park <[email protected]> Cc: <[email protected]> [5.19+] Signed-off-by: Andrew Morton <[email protected]>
2022-11-30tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"Tiezhu Yang1-2/+2
The latest version of grep claims the egrep is now obsolete so the build now contains warnings that look like: egrep: warning: egrep is obsolescent; using grep -E fix this up by moving the related file to use "grep -E" instead. sed -i "s/egrep/grep -E/g" `grep egrep -rwl tools/vm` Here are the steps to install the latest grep: wget http://ftp.gnu.org/gnu/grep/grep-3.8.tar.gz tar xf grep-3.8.tar.gz cd grep-3.8 && ./configure && make sudo make install export PATH=/usr/local/bin:$PATH Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tiezhu Yang <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()ZhangPeng1-0/+7
Syzbot reported a null-ptr-deref bug: NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 1 PID: 3603 Comm: segctord Not tainted 6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0 fs/nilfs2/alloc.c:608 Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7 RSP: 0018:ffffc90003dff830 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010 RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158 R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0 Call Trace: <TASK> nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline] nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193 nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236 nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940 nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline] nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline] nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018 nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline] nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045 nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline] nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> ... If DAT metadata file is corrupted on disk, there is a case where req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during a b-tree operation that cascadingly updates ancestor nodes of the b-tree, because nilfs_dat_commit_alloc() for a lower level block can initialize the blocknr on the same DAT entry between nilfs_dat_prepare_end() and nilfs_dat_commit_end(). If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free() without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and causes the NULL pointer dereference above in nilfs_palloc_commit_free_entry() function, which leads to a crash. Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free(). This also calls nilfs_error() in that case to notify that there is a fatal flaw in the filesystem metadata and prevent further operations. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Reported-by: [email protected] Tested-by: Ryusuke Konishi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processingMike Kravetz3-12/+19
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races. This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from unmap_vmas(). This is used to indicate the 'final' unmapping of a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and the vm_lock is not deleted. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <[email protected]> Reported-by: Wei Chen <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>