aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-02-23mm/damon/reclaim: implement user-feedback driven quota auto-tuningSeongJae Park1-0/+28
DAMOS supports user-feedback driven quota auto-tuning, but only DAMON sysfs interface is using it. Add support of the feature on DAMON_RECLAIM by adding one more input parameter, namely 'quota_autotune_feedback', for providing the user feedback to DAMON_RECLAIM. It assumes the target value of the feedback is 10,000. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/sysfs-schemes: support PSI-based quota auto-tuneSeongJae Park1-2/+40
Extend DAMON sysfs interface to support the PSI-based quota auto-tuning by adding a new file, 'target_metric' under the quota goal directory. Old users don't get any behavioral changes since the default value of the metric is 'user input'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: implement PSI metric DAMOS quota goalSeongJae Park1-0/+25
Extend DAMOS quota goal metric with system wide memory pressure stall time. Specifically, the system level 'some' PSI for memory is used. The target value can be set in microseconds. DAMOS measures the increased amount of the PSI metric in last quota_reset_interval and use the ratio of it versus the user-specified target PSI value as the score for the auto-tuning feedback loop. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: support multiple metrics for quota goalSeongJae Park2-5/+22
DAMOS quota auto-tuning asks users to assess the current tuned quota and provide the feedback in a manual and repeated way. It allows users generate the feedback from a source that the kernel cannot access, and writing a script or a function for doing the manual and repeated feeding is not a big deal. However, additional works are additional works, and it could be more efficient if DAMOS could do the fetch itself, especially in case of DAMON sysfs interface use case, since it can avoid the context switches between the user-space and the kernel-space, though the overhead would be only trivial in most cases. Also in many cases, feedbacks could be made from kernel-accessible sources, such as PSI, CPU usage, etc. Make the quota goal to support multiple types of metrics including such ones. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: let goal specified with only target and current valuesSeongJae Park2-12/+7
DAMOS quota auto-tuning feature let users to set the goal by providing a function for getting the current score of the tuned quota. It allows flexible goal setup, but only simple user-set quota is currently being used. As a result, the only user of the DAMOS quota auto-tuning is using a silly void pointer casting based score value passing function. Simplify the interface and the user code by letting user directly set the target and the current value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: remove ->goal field of damos_quotaSeongJae Park1-12/+5
DAMOS quota auto-tuning feature supports static signle goal and dynamic multiple goals via DAMON kernel API, specifically via ->goal and ->goals fields of damos_quota struct, respectively. All in-tree DAMOS kernel API users are using only the dynamic multiple goals now. Remove the unsued static single goal interface. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/sysfs: use only quota->goalsSeongJae Park3-19/+35
DAMON sysfs interface implements multiple quota auto-tuning goals on its level since the DAMOS core logic was supporting only single goal. Now the core logic supports multiple goals on its level. Update DAMON sysfs interface to reuse the core logic and drop unnecessary duplicated multiple goals implementation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: add multiple goals per damos_quota and helpers for thoseSeongJae Park1-7/+71
The feedback-driven DAMOS quota auto-tuning feature allows only single goal to the DAMON kernel API users. The API users could implement multiple goals for the end-users on their level, and that's what DAMON sysfs interface is doing. More DAMON kernel API users such as DAMON_RECLAIM would need to do similar work. To reduce unnecessary future duplciated efforts, support multiple goals from DAMOS core layer. To make the support in minimum non-destructive change, keep the old single goal setup interface, and add multiple goals setup. The single goal will treated as one of the multiple goals, so old API users are not required to make any change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: split out quota goal related fields to a structSeongJae Park2-11/+12
'struct damos_quota' is not small now. Split out fields for quota goal to a separate struct for easier reading. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/sysfs: implement a kdamond command for updating schemes' effective ↵SeongJae Park3-0/+56
quotas Implement yet another kdamond 'state' file input command, namely 'update_schemes_effective_quotas'. If it is written, the 'effective_bytes' files of the kdamond will be updated to provide the current effective size quota of each scheme in bytes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/sysfs-schemes: implement quota effective_bytes fileSeongJae Park1-0/+14
DAMON sysfs interface allows users to set two types of quotas, namely time quota and size quota. DAMOS converts time quota to a size quota and use smaller one among the resulting two size quotas. The resulting effective size quota can be helpful for debugging and analysis, but not exposed to the user. The recently added feedback-driven quota auto-tuning is making it even more mysterious. Implement a DAMON sysfs interface read-only empty file, namely 'effective_bytes', under the quota goal DAMON sysfs directory. It will be extended to expose the effective quota to the end user. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/damon/core: set damos_quota->esz as public field and documentSeongJae Park1-4/+4
Patch series "mm/damon: let DAMOS feeds and tame/auto-tune itself". The Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning patchset[1] which has merged since commit 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") made the mechanism and the policy separated. That is, users can set a part of DAMOS control policies without a deep understanding of the mechanism but just their demands such as SLA. However, users are still required to do some additional work of manually collecting their target metric and feeding it to DAMOS. In the case of end-users who use DAMON sysfs interface, the context switches between user-space and kernel-space could also make it inefficient. The overhead is supposed to be only trivial in common cases, though. Meanwhile, in simple use cases, the target metric could be common system metrics that the kernel can efficiently self-retrieve, such as memory pressure stall time (PSI). Extend DAMOS quota auto-tuning to support multiple types of metrics including the DAMOS self-retrievable ones, and add support for memory pressure stall time metric. Different types of metrics can be supported in future. The auto-tuning capability is currently supported for only users of DAMOS kernel API and DAMON sysfs interface. Extend the support to DAMON_RECLAIM. Patches Sequence ================ First five patches are for helping debugging and fine-tuning existing quota control features. The first one (patch 1) exposes the effective quota that is made with given user inputs to DAMOS kernel API users and kernel-doc documents. Following four patches implement (patches 1, 2 and 3) and document (patches 4 and 5) a new DAMON sysfs file that exposes the value. Following six patches cleanup and simplify the existing DAMOS quota auto-tuning code by improving layout of comments and data structures (patches 6 and 7), supporting common use cases, namely multiple goals (patches 8, 9 and 10), and simplifying the interface (patch 11). Then six patches for the main purpose of this patchset follow. The first three changes extend the core logic for various target metrics (patch 12), implement memory pressure stall time-based target metric support (patch 13), and update DAMON sysfs interface to support the new target metric (patch 14). Then, documentation updates for the features on design (patch 15), ABI (patch 16), and usage (patch 17) follow. Last three patches add auto-tuning support on DAMON_RECLAIM. The patches implement DAMON_RECLAIM parameters for user-feedback driven quota auto-tuning (patch 18), memory pressure stall time-driven quota self-tuning (patch 19), and finally update the DAMON_RECLAIM usage document for the new parameters (patch 20). [1] https://lore.kernel.org/all/[email protected]/ This patch (of 20): DAMOS allow users to specify the quota as they want in multiple ways including time quota, size quota, and feedback-based auto-tuning. DAMOS makes one effective quota out of the inputs and use it at the end. Knowing the current effective quota helps understanding DAMOS' internal mechanism and fine-tuning quotas. DAMON kernel API users can get the information from ->esz field of damos_quota struct, but the field is marked as private purpose, and not kernel-doc documented. Make it public and document. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP checkLance Yang1-6/+12
khugepaged scans the entire address space in the background for each given mm, looking for opportunities to merge sequences of basic pages into huge pages. However, when an mm is inserted to the mm_slots list, and the MMF_DISABLE_THP flag is set later, this scanning process becomes unnecessary for that mm and can be skipped to avoid redundant operations, especially in scenarios with a large address space. On an Intel Core i5 CPU, the time taken by khugepaged to scan the address space of the process, which has been set with the MMF_DISABLE_THP flag after being added to the mm_slots list, is as follows (shorter is better): VMA Count | Old | New | Change --------------------------------------- 50 | 23us | 9us | -60.9% 100 | 32us | 9us | -71.9% 200 | 44us | 9us | -79.5% 400 | 75us | 9us | -88.0% 800 | 98us | 9us | -90.8% Once the count of VMAs for the process exceeds page_to_scan, khugepaged needs to wait for scan_sleep_millisecs ms before scanning the next process. IMO, unnecessary scans could actually be skipped with a very inexpensive mm->flags check in this case. This commit introduces a check before each scanning process to test the MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning process is bypassed, thereby improving the efficiency of khugepaged. This optimization is not a correctness issue but rather an enhancement to save expensive checks on each VMA when userspace cannot prctl itself before spawning into the new process. On some servers within our company, we deploy a daemon responsible for monitoring and updating local applications. Some applications prefer not to use THP, so the daemon calls prctl to disable THP before fork/exec. Conversely, for other applications, the daemon calls prctl to enable THP before fork/exec. Ideally, the daemon should invoke prctl after the fork, but its current implementation follows the described approach. In the Go standard library, there is no direct encapsulation of the fork system call; instead, fork and execve are combined into one through syscall.ForkExec. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lance Yang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Peter Xu <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: refactor vmalloc_dump_obj() functionUladzislau Rezki (Sony)1-16/+17
This patch tends to simplify the function in question, by removing an extra stack "objp" variable, returning back to an early exit approach if spin_trylock() fails or VA was not found. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: improve description of vmap node layerUladzislau Rezki (Sony)1-14/+46
This patch adds extra explanation of recently added vmap node layer based on community feedback. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: add a shrinker to drain vmap poolsUladzislau Rezki (Sony)1-0/+39
The added shrinker is used to return back current cached VAs into a global vmap space, when a system enters into a low memory mode. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: set nr_nodes based on CPUs in a systemUladzislau Rezki (Sony)1-6/+23
A number of nodes which are used in the alloc/free paths is set based on num_possible_cpus() in a system. Please note a high limit threshold though is fixed and corresponds to 128 nodes. For 32-bit or single core systems an access to a global vmap heap is not balanced. Such small systems do not suffer from lock contentions due to low number of CPUs. In such case the nr_nodes is equal to 1. Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 <default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf> confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: support multiple nodes in vmallocinfoUladzislau Rezki (Sony)1-73/+47
Allocated areas are spread among nodes, it implies that the scanning has to be performed individually of each node in order to dump all existing VAs. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: support multiple nodes in vread_iterUladzislau Rezki (Sony)1-14/+53
Extend the vread_iter() to be able to perform a sequential reading of VAs which are spread among multiple nodes. So a data read over the /dev/kmem correctly reflects a vmalloc memory layout. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: add a scan area of VA only onceUladzislau Rezki (Sony)1-6/+6
Invoke a kmemleak_scan_area() function only for newly allocated objects to add a scan area within that object. There is no reason to add a same scan area(pointer to beginning or inside the object) several times. If a VA is obtained from the cache its scan area has already been associated. Link: https://lkml.kernel.org/r/[email protected] Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock") Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: offload free_vmap_area_lock lockUladzislau Rezki (Sony)1-45/+342
Concurrent access to a global vmap space is a bottle-neck. We can simulate a high contention by running a vmalloc test suite. To address it, introduce an effective vmap node logic. Each node behaves as independent entity. When a node is accessed it serves a request directly(if possible) from its pool. This model has a size based pool for requests, i.e. pools are serialized and populated based on object size and real demand. A maximum object size that pool can handle is set to 256 pages. This technique reduces a pressure on the global vmap lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: remove global purge_vmap_area_root rb-treeUladzislau Rezki (Sony)1-53/+82
Similar to busy VA, lazily-freed area is stored to a node it belongs to. Such approach does not require any global locking primitive, instead an access becomes scalable what mitigates a contention. This patch removes a global purge-lock, global purge-tree and global purge list. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/vmalloc: remove vmap_area_listBaoquan He2-4/+0
Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get the base address of vmalloc area. Now, vmap_area_list is empty, so export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list. [[email protected]: fix a warning in the crash_save_vmcoreinfo_init()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: remove global vmap_area_root rb-treeUladzislau Rezki (Sony)1-68/+174
Store allocated objects in a separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. An addr_to_node() function is used to do a proper address conversion to determine a node that contains a VA. Such approach balances VAs across nodes as a result an access becomes scalable. Number of nodes in a system depends on number of CPUs. Please note: 1. As of now allocated VAs are bound to a node-0. It means the patch does not give any difference comparing with a current behavior; 2. The global vmap_area_lock, vmap_area_root are removed as there is no need in it anymore. The vmap_area_list is still kept and is _empty_. It is exported for a kexec only; 3. The vmallocinfo and vread() have to be reworked to be able to handle multiple nodes. [[email protected]: mark vmap_init_free_space() with __init tag] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix a wrong value passed to __find_vmap_area()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: move vmap_init_free_space() down in vmalloc.cUladzislau Rezki (Sony)1-41/+41
A vmap_init_free_space() is a function that setups a vmap space and is considered as part of initialization phase. Since a main entry which is vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow the pattern. There is no a functional change as a result of this patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: rename adjust_va_to_fit_type() functionUladzislau Rezki (Sony)1-7/+6
This patch renames the adjust_va_to_fit_type() function to va_clip() which is shorter and more expressive. There is no a functional change as a result of this patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Kazuhito Hagio <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: vmalloc: add va_alloc() helperUladzislau Rezki (Sony)1-13/+28
Patch series "Mitigate a vmap lock contention", v3. 1. Motivation - Offload global vmap locks making it scaled to number of CPUS; - If possible and there is an agreement, we can remove the "Per cpu kva allocator" to make the vmap code to be more simple; - There were complaints from XFS folk that a vmalloc might be contented on their workloads. 2. Design(high level overview) We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock. An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes. Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space. When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. Doing so we balance VAs across the nodes as a result an access becomes scalable. The addr_to_node() function does a proper address conversion to a correct node. A vmap space is divided on segments with fixed size, it is 16 pages. That way any address can be associated with a segment number. Number of segments are equal to num_possible_cpus() but not grater then 128. The numeration starts from 0. See below how it is converted: static inline unsigned int addr_to_node_id(unsigned long addr) { return (addr / zone_size) % nr_nodes; } On a free path, a VA can be easily found by converting its "va_start" address to a certain node it resides. It is moved from "busy" data to "lazy" data structure. Later on, as noted earlier, the lazy kworker decays each node pool and populates it by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc request. 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 <default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf> confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ This patch (of 11): Currently __alloc_vmap_area() function contains an open codded logic that finds and adjusts a VA based on allocation request. Introduce a va_alloc() helper that adjusts found VA only. There is no a functional change as a result of this patch. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Kazuhito Hagio <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm,page_owner: filter out stacks by a thresholdOscar Salvador1-1/+22
We want to be able to filter out the stacks based on a threshold we can can tune. By writing to 'count_threshold' file, we can adjust the threshold value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Marco Elver <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm,page_owner: display all stacks and their countOscar Salvador1-1/+92
This patch adds a new directory called 'page_owner_stacks' under /sys/kernel/debug/, with a file called 'show_stacks' in it. Reading from that file will show all stacks that were added by page_owner followed by their counting, giving us a clear overview of stack <-> count relationship. E.g: prep_new_page+0xa9/0x120 get_page_from_freelist+0x801/0x2210 __alloc_pages+0x18b/0x350 alloc_pages_mpol+0x91/0x1f0 folio_alloc+0x14/0x50 filemap_alloc_folio+0xb2/0x100 __filemap_get_folio+0x14a/0x490 ext4_write_begin+0xbd/0x4b0 [ext4] generic_perform_write+0xc1/0x1e0 ext4_buffered_write_iter+0x68/0xe0 [ext4] ext4_file_write_iter+0x70/0x740 [ext4] vfs_write+0x33d/0x420 ksys_write+0xa5/0xe0 do_syscall_64+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x6e/0x76 stack_count: 4578 The seq stack_{start,next} functions will iterate through the list stack_list in order to print all stacks. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Marco Elver <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm,page_owner: implement the tracking of the stacks countOscar Salvador1-1/+72
Implement {inc,dec}_stack_record_count() which increments or decrements on respective allocation and free operations, via __reset_page_owner() (free operation) and __set_page_owner() (alloc operation). Newly allocated stack_record structs will be added to the list stack_list via add_stack_record_to_list(). Modifications on the list are protected via a spinlock with irqs disabled, since this code can also be reached from IRQ context. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm,page_owner: maintain own list of stack_records structsOscar Salvador1-0/+15
page_owner needs to increment a stack_record refcount when a new allocation occurs, and decrement it on a free operation. In order to do that, we need to have a way to get a stack_record from a handle. Implement __stack_depot_get_stack_record() which just does that, and make it public so page_owner can use it. Also, traversing all stackdepot buckets comes with its own complexity, plus we would have to implement a way to mark only those stack_records that were originated from page_owner, as those are the ones we are interested in. For that reason, page_owner maintains its own list of stack_records, because traversing that list is faster than traversing all buckets while keeping at the same time a low complexity. For now, add to stack_list only the stack_records of dummy_handle and failure_handle, and set their refcount of 1. Further patches will add code to increment or decrement stack_records count on allocation and free operation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Marco Elver <[email protected]> Acked-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23merge mm-hotfixes-stable into mm-nonmm-stable to pick up stackdepot changesAndrew Morton7-102/+56
2024-02-23mm/debug_vm_pgtable: fix BUG_ON with pud advanced testAneesh Kumar K.V (IBM)1-0/+8
Architectures like powerpc add debug checks to ensure we find only devmap PUD pte entries. These debug checks are only done with CONFIG_DEBUG_VM. This patch marks the ptes used for PUD advanced test devmap pte entries so that we don't hit on debug checks on architecture like ppc64 as below. WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138 .... NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138 LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60 Call Trace: [c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable) [c000000004a2f980] [000d34c100000000] 0xd34c100000000 [c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334 [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48 [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388 Also kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202! .... NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174 LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334 Call Trace: [c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable) [c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334 [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48 [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388 Link: https://lkml.kernel.org/r/[email protected] Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage") Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm: cachestat: fix folio read-after-free in cache walkNhat Pham1-25/+26
In cachestat, we access the folio from the page cache's xarray to compute its page offset, and check for its dirty and writeback flags. However, we do not hold a reference to the folio before performing these actions, which means the folio can concurrently be released and reused as another folio/page/slab. Get around this altogether by just using xarray's existing machinery for the folio page offsets and dirty/writeback states. This changes behavior for tmpfs files to now always report zeroes in their dirty and writeback counters. This is okay as tmpfs doesn't follow conventional writeback cache behavior: its pages get "cleaned" during swapout, after which they're no longer resident etc. Link: https://lkml.kernel.org/r/[email protected] Fixes: cf264e1329fb ("cachestat: implement cachestat syscall") Reported-by: Jann Horn <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Signed-off-by: Nhat Pham <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Tested-by: Jann Horn <[email protected]> Cc: <[email protected]> [6.4+] Signed-off-by: Andrew Morton <[email protected]>
2024-02-23mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone indexByungchul Park1-0/+8
With numa balancing on, when a numa system is running where a numa node doesn't have its local memory so it has no managed zones, the following oops has been observed. It's because wakeup_kswapd() is called with a wrong zone index, -1. Fixed it by checking the index before calling wakeup_kswapd(). > BUG: unable to handle page fault for address: 00000000000033f3 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > Code: (omitted) > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > ? __die > ? page_fault_oops > ? __pte_offset_map_lock > ? exc_page_fault > ? asm_exc_page_fault > ? wakeup_kswapd > migrate_misplaced_page > __handle_mm_fault > handle_mm_fault > do_user_addr_fault > exc_page_fault > asm_exc_page_fault > RIP: 0033:0x55b897ba0808 > Code: (omitted) > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > </TASK> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Byungchul Park <[email protected]> Reported-by: Hyeongtak Ji <[email protected]> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system") Reviewed-by: Oscar Salvador <[email protected]> Cc: Baolin Wang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-23kasan: revert eviction of stack traces in generic modeMarco Elver4-77/+14
This partially reverts commits cc478e0b6bdf, 63b85ac56a64, 08d7c94d9635, a414d4286f34, and 773688a6cb24 to make use of variable-sized stack depot records, since eviction of stack entries from stack depot forces fixed- sized stack records. Care was taken to retain the code cleanups by the above commits. Eviction was added to generic KASAN as a response to alleviating the additional memory usage from fixed-sized stack records, but this still uses more memory than previously. With the re-introduction of variable-sized records for stack depot, we can just switch back to non-evictable stack records again, and return back to the previous performance and memory usage baseline. Before (observed after a KASAN kernel boot): pools: 597 refcounted_allocations: 17547 refcounted_frees: 6477 refcounted_in_use: 11070 freelist_size: 3497 persistent_count: 12163 persistent_bytes: 1717008 After: pools: 319 refcounted_allocations: 0 refcounted_frees: 0 refcounted_in_use: 0 freelist_size: 0 persistent_count: 29397 persistent_bytes: 5183536 As can be seen from the counters, with a generic KASAN config, refcounted allocations and evictions are no longer used. Due to using variable-sized records, I observe a reduction of 278 stack depot pools (saving 4448 KiB) with my test setup. Link: https://lkml.kernel.org/r/[email protected] Fixes: cc478e0b6bdf ("kasan: avoid resetting aux_lock") Fixes: 63b85ac56a64 ("kasan: stop leaking stack trace handles") Fixes: 08d7c94d9635 ("kasan: memset free track in qlink_free") Fixes: a414d4286f34 ("kasan: handle concurrent kasan_record_aux_stack calls") Fixes: 773688a6cb24 ("kasan: use stack_depot_put for Generic mode") Signed-off-by: Marco Elver <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22treewide: update LLVM Bugzilla linksNathan Chancellor1-1/+1
LLVM moved their issue tracker from their own Bugzilla instance to GitHub issues. While all of the links are still valid, they may not necessarily show the most up to date information around the issues, as all updates will occur on GitHub, not Bugzilla. Another complication is that the Bugzilla issue number is not always the same as the GitHub issue number. Thankfully, LLVM maintains this mapping through two shortlinks: https://llvm.org/bz<num> -> https://bugs.llvm.org/show_bug.cgi?id=<num> https://llvm.org/pr<num> -> https://github.com/llvm/llvm-project/issues/<mapped_num> Switch all "https://bugs.llvm.org/show_bug.cgi?id=<num>" links to the "https://llvm.org/pr<num>" shortlink so that the links show the most up to date information. Each migrated issue links back to the Bugzilla entry, so there should be no loss of fidelity of information here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nathan Chancellor <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Fangrui Song <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Mykola Lysenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22userfaultfd: use per-vma locks in userfaultfd operationsLokesh Gidra2-90/+295
All userfaultfd operations, except write-protect, opportunistically use per-vma locks to lock vmas. On failure, attempt again inside mmap_lock critical section. Write-protect operation requires mmap_lock as it iterates over multiple vmas. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lokesh Gidra <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Tim Murray <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctxLokesh Gidra1-27/+35
Increments and loads to mmap_changing are always in mmap_lock critical section. This ensures that if userspace requests event notification for non-cooperative operations (e.g. mremap), userfaultfd operations don't occur concurrently. This can be achieved by using a separate read-write semaphore in userfaultfd_ctx such that increments are done in write-mode and loads in read-mode, thereby eliminating the dependency on mmap_lock for this purpose. This is a preparatory step before we replace mmap_lock usage with per-vma locks in fill/move ioctls. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lokesh Gidra <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Tim Murray <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22kasan: increase the number of bits to shift when recording extra timestampsJuntong Deng2-2/+2
In 5d4c6ac94694 ("kasan: record and report more information") I thought that printk only displays a maximum of 99999 seconds, but actually printk can display a larger number of seconds. So increase the number of bits to shift when recording the extra timestamp (44 bits), without affecting the precision, shift it right by 9 bits, discarding all bits that do not affect the microsecond part (nanoseconds will not be shown). Currently the maximum time that can be displayed is 9007199.254740s, because 11111111111111111111111111111111111111111111 (44 bits) << 9 = 11111111111111111111111111111111111111111111000000000 = 9007199.254740 Link: https://lkml.kernel.org/r/AM6PR03MB58481629F2F28CE007412139994D2@AM6PR03MB5848.eurprd03.prod.outlook.com Fixes: 5d4c6ac94694 ("kasan: record and report more information") Signed-off-by: Juntong Deng <[email protected]> Acked-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22rmap: replace two calls to compound_order with folio_orderMatthew Wilcox (Oracle)1-2/+2
Removes two unnecessary conversions from folio to page. Should be no difference in behaviour. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22Introduce cpu_dcache_is_aliasing() across all architecturesMathieu Desnoyers1-0/+6
Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Russell King <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michael Sclafani <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Mikulas Patocka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm: add pte_batch_hint() to reduce scanning in folio_pte_batch()Ryan Roberts1-7/+12
Some architectures (e.g. arm64) can tell from looking at a pte, if some follow-on ptes also map contiguous physical memory with the same pgprot. (for arm64, these are contpte mappings). Take advantage of this knowledge to optimize folio_pte_batch() so that it can skip these ptes when scanning to create a batch. By default, if an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so the changes are optimized out and the behaviour is as before. arm64 will opt-in to providing this hint in the next patch, which will greatly reduce the cost of ptep_get() when scanning a range of contptes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: John Hubbard <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Barry Song <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Morse <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm: thp: batch-collapse PMD with set_ptes()Ryan Roberts1-25/+33
Refactor __split_huge_pmd_locked() so that a present PMD can be collapsed to PTEs in a single batch using set_ptes(). This should improve performance a little bit, but the real motivation is to remove the need for the arm64 backend to have to fold the contpte entries. Instead, since the ptes are set as a batch, the contpte blocks can be initially set up pre-folded (once the arm64 contpte support is added in the next few patches). This leads to noticeable performance improvement during split. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Barry Song <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Morse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/memory: optimize unmap/zap with PTE-mapped THPDavid Hildenbrand1-26/+66
Similar to how we optimized fork(), let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio. Most infrastructure we need for batching (mmu gather, rmap) is already there. We only have to add get_and_clear_full_ptes() and clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to process a PTE range. We won't bother sanity-checking the mapcount of all subpages, but only check the mapcount of the first subpage we process. If there is a real problem hiding somewhere, we can trigger it simply by using small folios, or when we zap single pages of a large folio. Ideally, we had that check in rmap code (including for delayed rmap), but then we cannot print the PTE. Let's keep it simple for now. If we ever have a cheap folio_mapcount(), we might just want to check for underflows there. To keep small folios as fast as possible force inlining of a specialized variant using __always_inline with nr=1. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/mmu_gather: improve cond_resched() handling with large folios and ↵David Hildenbrand1-15/+43
expensive page freeing In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now up to 256 folio fragments that span more than one page, before we conditionally reschedule. It's a pain that we have to handle cond_resched() in tlb_batch_pages_flush() manually and cannot simply handle it in release_pages() -- release_pages() can be called from atomic context. Well, in a perfect world we wouldn't have to make our code more complicated at all. With page poisoning and init_on_free, we might now run into soft lockups when we free a lot of rather large folio fragments, because page freeing time then depends on the actual memory size we are freeing instead of on the number of folios that are involved. In the absolute (unlikely) worst case, on arm64 with 64k we will be able to free up to 256 folio fragments that each span 512 MiB: zeroing out 128 GiB does sound like it might take a while. But instead of ignoring this unlikely case, let's just handle it. So, let's teach tlb_batch_pages_flush() that there are some configurations where page freeing is horribly slow, and let's reschedule more frequently -- similarly like we did for now before we had large folio fragments in there. Avoid yet another loop over all encoded pages in the common case by handling that separately. Note that with page poisoning/zeroing, we might now end up freeing only a single folio fragment at a time that might exceed the old 512 pages limit: but if we cannot even free a single MAX_ORDER page on a system without running into soft lockups, something else is already completely bogus. Freeing a PMD-mapped THP would similarly cause trouble. In theory, we might even free 511 order-0 pages + a single MAX_ORDER page, effectively having to zero out 8703 pages on arm64 with 64k, translating to ~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups, 544 MiB is unlikely to result in soft lockups, so we won't care about that for the time being. In the future, we might want to detect if handling cond_resched() is required at all, and just not do any of that with full preemption enabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/mmu_gather: add __tlb_remove_folio_pages()David Hildenbrand3-14/+74
Add __tlb_remove_folio_pages(), which will remove multiple consecutive pages that belong to the same large folio, instead of only a single page. We'll be using this function when optimizing unmapping/zapping of large folios that are mapped by PTEs. We're using the remaining spare bit in an encoded_page to indicate that the next enoced page in an array contains actually shifted "nr_pages". Teach swap/freeing code about putting multiple folio references, and delayed rmap handling to remove page ranges of a folio. This extension allows for still gathering almost as many small folios as we used to (-1, because we have to prepare for a possibly bigger next entry), but still allows for gathering consecutive pages that belong to the same large folio. Note that we don't pass the folio pointer, because it is not required for now. Further, we don't support page_size != PAGE_SIZE, it won't be required for simple PTE batching. We have to provide a separate s390 implementation, but it's fairly straight forward. Another, more invasive and likely more expensive, approach would be to use folio+range or a PFN range instead of page+nr_pages. But, we should do that consistently for the whole mmu_gather. For now, let's keep it simple and add "nr_pages" only. Note that it is now possible to gather significantly more pages: In the past, we were able to gather ~10000 pages, now we can also gather ~5000 folio fragments that span multiple pages. A folio fragment on x86-64 can span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192 pages (512 MiB THP). Gathering more memory is not considered something we should worry about, especially because these are already corner cases. While we can gather more total memory, we won't free more folio fragments. As long as page freeing time primarily only depends on the number of involved folios, there is no effective change for !preempt configurations. However, we'll adjust tlb_batch_pages_flush() separately to handle corner cases where page freeing time grows proportionally with the actual memory size. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAPDavid Hildenbrand1-2/+3
Nowadays, encoded pages are only used in mmu_gather handling. Let's update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS. If encoded page pointers would ever be used in other context again, we'd likely want to change the defines to reflect their context (e.g., ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple. This is a preparation for using the remaining spare bit to indicate that the next item in an array of encoded pages is a "nr_pages" argument and not an encoded page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/mmu_gather: pass "delay_rmap" instead of encoded page to ↵David Hildenbrand1-3/+4
__tlb_remove_page_size() We have two bits available in the encoded page pointer to store additional information. Currently, we use one bit to request delay of the rmap removal until after a TLB flush. We want to make use of the remaining bit internally for batching of multiple pages of the same folio, specifying that the next encoded page pointer in an array is actually "nr_pages". So pass page + delay_rmap flag instead of an encoded page, to handle the encoding internally. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22mm/memory: factor out zapping folio pte into zap_present_folio_pte()David Hildenbrand1-21/+32
Let's prepare for further changes by factoring it out into a separate function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>