blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2023-12-29	mm: migrate: fix getting incorrect page mapping during page migration	Baolin Wang	1	-17/+10
	When running stress-ng testing, we found below kernel crash after a few hours: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 pc : dentry_name+0xd8/0x224 lr : pointer+0x22c/0x370 sp : ffff800025f134c0 ...... Call trace: dentry_name+0xd8/0x224 pointer+0x22c/0x370 vsnprintf+0x1ec/0x730 vscnprintf+0x2c/0x60 vprintk_store+0x70/0x234 vprintk_emit+0xe0/0x24c vprintk_default+0x3c/0x44 vprintk_func+0x84/0x2d0 printk+0x64/0x88 __dump_page+0x52c/0x530 dump_page+0x14/0x20 set_migratetype_isolate+0x110/0x224 start_isolate_page_range+0xc4/0x20c offline_pages+0x124/0x474 memory_block_offline+0x44/0xf4 memory_subsys_offline+0x3c/0x70 device_offline+0xf0/0x120 ...... After analyzing the vmcore, I found this issue is caused by page migration. The scenario is that, one thread is doing page migration, and we will use the target page's ->mapping field to save 'anon_vma' pointer between page unmap and page move, and now the target page is locked and refcount is 1. Currently, there is another stress-ng thread performing memory hotplug, attempting to offline the target page that is being migrated. It discovers that the refcount of this target page is 1, preventing the offline operation, thus proceeding to dump the page. However, page_mapping() of the target page may return an incorrect file mapping to crash the system in dump_mapping(), since the target page->mapping only saves 'anon_vma' pointer without setting PAGE_MAPPING_ANON flag. There are seveval ways to fix this issue: (1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving 'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target page has not built mappings yet. (2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing the system, however, there are still some PFN walkers that call page_mapping() without holding the page lock, such as compaction. (3) Using target page->private field to save the 'anon_vma' pointer and 2 bits page state, just as page->mapping records an anonymous page, which can remove the page_mapping() impact for PFN walkers and also seems a simple way. So I choose option 3 to fix this issue, and this can also fix other potential issues for PFN walkers, such as compaction. Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.1702641709.git.baolin.wang@linux.alibaba.com Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Xu Yu <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert swap_cluster_readahead and swap_vma_readahead to return a folio	Matthew Wilcox (Oracle)	3	-19/+19
	shmem_swapin_cluster() immediately converts the page back to a folio, and swapin_readahead() may as well call folio_file_page() once instead of having each function call it. [[email protected]: avoid NULL pointer deref] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: return a folio from read_swap_cache_async()	Matthew Wilcox (Oracle)	3	-19/+18
	The only two callers simply call put_page() on the page returned, so they're happier calling folio_put(). Saves two calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: remove page_swap_info()	Matthew Wilcox (Oracle)	3	-10/+3
	It's more efficient to get the swap_info_struct by calling swp_swap_info() directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert swap_readpage() to swap_read_folio()	Matthew Wilcox (Oracle)	5	-20/+21
	All callers have a folio, so pass it in, saving two calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert swap_page_sector() to swap_folio_sector()	Matthew Wilcox (Oracle)	3	-8/+8
	All callers have a folio, so pass it in. Saves a couple of calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to swap_readpage_bdev_async()	Matthew Wilcox (Oracle)	1	-4/+4
	Make it plain that this takes the head page (which before this point was just an assumption, but is now enforced by the compiler). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to swap_readpage_bdev_sync()	Matthew Wilcox (Oracle)	1	-4/+4
	Make it plain that this takes the head page (which before this point was just an assumption, but is now enforced by the compiler). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to swap_readpage_fs()	Matthew Wilcox (Oracle)	1	-7/+6
	Saves a call to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to swap_writepage_bdev_async()	Matthew Wilcox (Oracle)	1	-5/+4
	Saves a call to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to swap_writepage_bdev_sync()	Matthew Wilcox (Oracle)	1	-5/+4
	Saves a call to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to swap_writepage_fs()	Matthew Wilcox (Oracle)	1	-9/+9
	Saves several calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: pass a folio to __swap_writepage()	Matthew Wilcox (Oracle)	3	-9/+9
	Both callers now have a folio, so pass that in instead of the page. Removes a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: return the folio from __read_swap_cache_async()	Matthew Wilcox (Oracle)	4	-75/+69
	Patch series "More swap folio conversions". These all seem like fairly straightforward conversions to me. A lot of compound_head() calls get removed. And page_swap_info(), which is nice. This patch (of 13): Move the folio->page conversion into the callers that actually want that. Most of the callers are happier with the folio anyway. If the page_allocated boolean is set, the folio allocated is of order-0, so it is safe to pass the page directly to swap_readpage(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx	Chengming Zhou	2	-72/+33
	First of all, we need to rename acomp_ctx->dstmem field to buffer, since we are now using for purposes other than compression. Then we change per-cpu mutex and buffer to per-acomp_ctx, since them belong to the acomp_ctx and are necessary parts when used in the compress/decompress contexts. So we can remove the old per-cpu mutex and dstmem. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Chris Li <[email protected]> (Google) Reviewed-by: Nhat Pham <[email protected]> Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/zswap: cleanup zswap_writeback_entry()	Chengming Zhou	1	-19/+10
	Also after the common decompress part goes to __zswap_load(), we can cleanup the zswap_writeback_entry() a little. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Chris Li <[email protected]> (Google) Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/zswap: cleanup zswap_load()	Chengming Zhou	1	-9/+5
	After the common decompress part goes to __zswap_load(), we can cleanup the zswap_load() a little. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Chis Li <[email protected]> (Google) Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/zswap: refactor out __zswap_load()	Chengming Zhou	1	-60/+32
	zswap_load() and zswap_writeback_entry() have the same part that decompress the data from zswap_entry to page, so refactor out the common part as __zswap_load(entry, page). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> (Google) Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/zswap: reuse dstmem when decompress	Chengming Zhou	1	-32/+12
	Patch series "mm/zswap: dstmem reuse optimizations and cleanups", v5. The problem this series tries to optimize is that zswap_load() and zswap_writeback_entry() have to malloc a temporary memory to support !zpool_can_sleep_mapped(). We can avoid it by reusing the percpu crypto_acomp_ctx->dstmem, which is also used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. This patch (of 5): In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first copy the entry->handle memory to a temporary memory, which is allocated using kmalloc. Obviously we can reuse the per-compressor dstmem to avoid allocating every time, since it's percpu-compressor and protected in percpu mutex. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> (Google) Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/ksm: document ksm advisor and its sysfs knobs	Stefan Roesch	1	-0/+55
	This documents the KSM advisor and its new knobs in /sys/fs/kernel/mm. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/ksm: add tracepoint for ksm advisor	Stefan Roesch	2	-0/+34
	This adds a new tracepoint for the ksm advisor. It reports the last scan time, the new setting of the pages_to_scan parameter and the average cpu percent usage of the ksmd background thread for the last scan. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/ksm: add sysfs knobs for advisor	Stefan Roesch	1	-0/+148
	This adds four new knobs for the KSM advisor to influence its behaviour. The knobs are: - advisor_mode: none: no advisor (default) scan-time: scan time advisor - advisor_max_cpu: 70 (default, cpu usage percent) - advisor_min_pages_to_scan: 500 (default) - advisor_max_pages_to_scan: 30000 (default) - advisor_target_scan_time: 200 (default in seconds) The new values will take effect on the next scan round. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/ksm: add ksm advisor	Stefan Roesch	1	-1/+157
	Patch series "mm/ksm: Add ksm advisor", v5. What is the KSM advisor? ========================= The ksm advisor automatically manages the pages_to_scan setting to achieve a target scan time. The target scan time defines how many seconds it should take to scan all the candidate KSM pages. In other words the pages_to_scan rate is changed by the advisor to achieve the target scan time. Why do we need a KSM advisor? ============================== The number of candidate pages for KSM is dynamic. It can often be observed that during the startup of an application more candidate pages need to be processed. Without an advisor the pages_to_scan parameter needs to be sized for the maximum number of candidate pages. With the scan time advisor the pages_to_scan parameter based can be changed based on demand. Algorithm ========== The algorithm calculates the change value based on the target scan time and the previous scan time. To avoid pertubations an exponentially weighted moving average is applied. The algorithm has a max and min value to: - guarantee responsiveness to changes - to limit CPU resource consumption Parameters to influence the KSM scan advisor ============================================= The respective parameters are: - ksm_advisor_mode 0: None (default), 1: scan time advisor - ksm_advisor_target_scan_time how many seconds a scan should of all candidate pages take - ksm_advisor_max_cpu upper limit for the cpu usage in percent of the ksmd background thread The initial value and the max value for the pages_to_scan parameter can be limited with: - ksm_advisor_min_pages_to_scan minimum value for pages_to_scan per batch - ksm_advisor_max_pages_to_scan maximum value for pages_to_scan per batch The default settings for the above two parameters should be suitable for most workloads. The parameters are exposed as knobs in /sys/kernel/mm/ksm. By default the scan time advisor is disabled. Currently there are two advisors: - none and - scan-time. Resource savings ================= Tests with various workloads have shown considerable CPU savings. Most of the workloads I have investigated have more candidate pages during startup. Once the workload is stable in terms of memory, the number of candidate pages is reduced. Without the advisor, the pages_to_scan needs to be sized for the maximum number of candidate pages. So having this advisor definitely helps in reducing CPU consumption. For the instagram workload, the advisor achieves a 25% CPU reduction. Once the memory is stable, the pages_to_scan parameter gets reduced to about 40% of its max value. The new advisor works especially well if the smart scan feature is also enabled. How is defining a target scan time better? =========================================== For an administrator it is more logical to set a target scan time.. The administrator can determine how many pages are scanned on each scan. Therefore setting a target scan time makes more sense. In addition the administrator might have a good idea about the memory sizing of its respective workloads. Setting cpu limits is easier than setting The pages_to_scan parameter. The pages_to_scan parameter is per batch. For the administrator it is difficult to set the pages_to_scan parameter. Tracing ======= A new tracing event has been added for the scan time advisor. The new trace event is called ksm_advisor. It reports the scan time, the new pages_to_scan setting and the cpu usage of the ksmd background thread. Other approaches ================= Approach 1: Adapt pages_to_scan after processing each batch. If KSM merges pages, increase the scan rate, if less KSM pages, reduce the the pages_to_scan rate. This doesn't work too well. While it increases the pages_to_scan for a short period, but generally it ends up with a too low pages_to_scan rate. Approach 2: Adapt pages_to_scan after each scan. The problem with that approach is that the calculated scan rate tends to be high. The more aggressive KSM scans, the more pages it can de-duplicate. There have been earlier attempts at an advisor: propose auto-run mode of ksm and its tests (https://marc.info/?l=linux-mm&m=166029880214485&w=2) This patch (of 5): This adds the ksm advisor. The ksm advisor automatically manages the pages_to_scan setting to achieve a target scan time. The target scan time defines how many seconds it should take to scan all the candidate KSM pages. In other words the pages_to_scan rate is changed by the advisor to achieve the target scan time. The algorithm has a max and min value to: - guarantee responsiveness to changes - limit CPU resource consumption The respective parameters are: - ksm_advisor_target_scan_time (how many seconds a scan should take) - ksm_advisor_max_cpu (maximum value for cpu percent usage) - ksm_advisor_min_pages (minimum value for pages_to_scan per batch) - ksm_advisor_max_pages (maximum value for pages_to_scan per batch) The algorithm calculates the change value based on the target scan time and the previous scan time. To avoid pertubations an exponentially weighted moving average is applied. The advisor is managed by two main parameters: target scan time, cpu max time for the ksmd background thread. These parameters determine how aggresive ksmd scans. In addition there are min and max values for the pages_to_scan parameter to make sure that its initial and max values are not set too low or too high. This ensures that it is able to react to changes quickly enough. The default values are: - target scan time: 200 secs - max cpu: 70% - min pages: 500 - max pages: 30000 By default the advisor is disabled. Currently there are two advisors: none and scan-time. Tests with various workloads have shown considerable CPU savings. Most of the workloads I have investigated have more candidate pages during startup, once the workload is stable in terms of memory, the number of candidate pages is reduced. Without the advisor, the pages_to_scan needs to be sized for the maximum number of candidate pages. So having this advisor definitely helps in reducing CPU consumption. For the instagram workload, the advisor achieves a 25% CPU reduction. Once the memory is stable, the pages_to_scan parameter gets reduced to about 40% of its max value. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stefan Roesch <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictable	Matthew Wilcox (Oracle)	3	-21/+0
	All callers have now been converted to folio_add_new_anon_rmap() and folio_add_lru_vma() so we can remove the wrapper. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert collapse_huge_page() to use a folio	Matthew Wilcox (Oracle)	1	-7/+8
	Replace three calls to compound_head() with one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert migrate_vma_insert_page() to use a folio	Matthew Wilcox (Oracle)	1	-11/+12
	Replaces five calls to compound_head() with one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: remove references to page_add_new_anon_rmap in comments	Matthew Wilcox (Oracle)	1	-2/+2
	Refer to folio_add_new_anon_rmap() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: remove stale example from comment	Matthew Wilcox (Oracle)	1	-14/+4
	folio_add_new_anon_rmap() no longer works this way, so just remove the entire example. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: remove some calls to page_add_new_anon_rmap()	Matthew Wilcox (Oracle)	3	-3/+3
	We already have the folio in these functions, we just need to use it. folio_add_new_anon_rmap() didn't exist at the time they were converted to folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert unuse_pte() to use a folio throughout	Matthew Wilcox (Oracle)	1	-22/+25
	Saves about eight calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: remove PageAnonExclusive assertions in unuse_pte()	Matthew Wilcox (Oracle)	1	-4/+0
	The page in question is either freshly allocated or known to be in the swap cache; these assertions are not particularly useful. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm: convert ksm_might_need_to_copy() to work on folios	Matthew Wilcox (Oracle)	4	-20/+26
	Patch series "Finish two folio conversions". Most callers of page_add_new_anon_rmap() and lru_cache_add_inactive_or_unevictable() have been converted to their folio equivalents, but there are still a few stragglers. There's a bit of preparatory work in ksm and unuse_pte(), but after that it's pretty mechanical. This patch (of 9): Accept a folio as an argument and return a folio result. Removes a call to compound_head() in do_swap_page(), and prevents folio & page from getting out of sync in unuse_pte(). Reviewed-by: David Hildenbrand <[email protected]> [[email protected]: fix smatch warning] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: only adjust the page if the folio changed] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	selftests/mm: add UFFDIO_MOVE ioctl test	Suren Baghdasaryan	3	-0/+214
	Add tests for new UFFDIO_MOVE ioctl which uses uffd to move source into destination buffer while checking the contents of both after the move. After the operation the content of the destination buffer should match the original source buffer's content while the source buffer should be zeroed. Separate tests are designed for PMD aligned and unaligned cases because they utilize different code paths in the kernel. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	selftests/mm: add uffd_test_case_ops to allow test case-specific operations	Suren Baghdasaryan	3	-0/+22
	Currently each test can specify unique operations using uffd_test_ops, however these operations are per-memory type and not per-test. Add uffd_test_case_ops which each test case can customize for its own needs regardless of the memory type being used. Pre- and post-allocation operations are added, some of which will be used in the next patch to implement test-specific operations like madvise after memory is allocated but before it is accessed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	selftests/mm: call uffd_test_ctx_clear at the end of the test	Suren Baghdasaryan	4	-4/+7
	uffd_test_ctx_clear() is being called from uffd_test_ctx_init() to unmap areas used in the previous test run. This approach is problematic because while unmapping areas uffd_test_ctx_clear() uses page_size and nr_pages which might differ from one test run to another. Fix this by calling uffd_test_ctx_clear() after each test is done. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Reviewed-by: Axel Rasmussen <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	userfaultfd: UFFDIO_MOVE uABI	Andrea Arcangeli	9	-1/+864
	Implement the uABI of UFFDIO_MOVE ioctl. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [2]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [2]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/ Update for the ioctl_userfaultfd(2) manpage: UFFDIO_MOVE (Since Linux xxx) Move a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to move are specified by the src, dst, and len fields of the uffdio_move structure pointed to by argp: struct uffdio_move { __u64 dst; /* Destination of move / __u64 src; / Source of move / __u64 len; / Number of bytes to move / __u64 mode; / Flags controlling behavior of move / __s64 move; / Number of bytes moved, or negated error */ }; The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_MOVE operation: UFFDIO_MOVE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES Allow holes in the source virtual range that is being moved. When not specified, the holes will result in ENOENT error. When specified, the holes will be accounted as successfully moved memory. This is mostly useful to move hugepage aligned virtual regions without knowing if there are transparent hugepages in the regions or not, but preventing the risk of having to split the hugepage during the operation. The move field is used by the kernel to return the number of bytes that was actually moved, or an error (a negated errno- style value). If the value returned in move doesn't match the value that was specified in len, the operation fails with the error EAGAIN. The move field is output-only; it is not read by the UFFDIO_MOVE operation. The operation may fail for various reasons. Usually, remapping of pages that are not exclusive to the given process fail; once KSM might deduplicate pages or fork() COW-shares pages during fork() with child processes, they are no longer exclusive. Further, the kernel might only perform lightweight checks for detecting whether the pages are exclusive, and return -EBUSY in case that check fails. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source VMA before fork(). This ioctl(2) operation returns 0 on success. In this case, the entire area was moved. On error, -1 is returned and errno is set to indicate the error. Possible errors include: EAGAIN The number of bytes moved (i.e., the value returned in the move field) does not equal the value that was specified in the len field. EINVAL Either dst or len was not a multiple of the system page size, or the range specified by src and len or dst and len was invalid. EINVAL An invalid bit was specified in the mode field. ENOENT The source virtual memory range has unmapped holes and UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. EEXIST The destination virtual memory range is fully or partially mapped. EBUSY The pages in the source virtual memory range are either pinned or not exclusive to the process. The kernel might only perform lightweight checks for detecting whether the pages are exclusive. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source virtual memory area before fork(). ENOMEM Allocating memory needed for the operation failed. ESRCH The target process has exited at the time of a UFFDIO_MOVE operation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Al Viro <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	mm/rmap: support move to different root anon_vma in folio_move_anon_rmap()	Andrea Arcangeli	1	-0/+24
	Patch series "userfaultfd move option", v6. This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has long been implemented and maintained by Andrea in his local tree [1], but was not upstreamed due to lack of use cases where this approach would be better than allocating a new page and copying the contents. Previous upstraming attempts could be found at [6] and [7]. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [3]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. TODOs for follow-up improvements: - cross-mm support. Known differences from single-mm and missing pieces: - memcg recharging (might need to isolate pages in the process) - mm counters - cross-mm deposit table moves - cross-mm test - document the address space where src and dest reside in struct uffdio_move - TLB flush batching. Will require extensive changes to PTL locking in move_pages_pte(). OTOH that might let us reuse parts of mremap code. This patch (of 5): For now, folio_move_anon_rmap() was only used to move a folio to a different anon_vma after fork(), whereby the root anon_vma stayed unchanged. For that, it was sufficient to hold the folio lock when calling folio_move_anon_rmap(). However, we want to make use of folio_move_anon_rmap() to move folios between VMAs that have a different root anon_vma. As folio_referenced() performs an RMAP walk without holding the folio lock but only holding the anon_vma in read mode, holding the folio lock is insufficient. When moving to an anon_vma with a different root anon_vma, we'll have to hold both, the folio lock and the anon_vma lock in write mode. Consequently, whenever we succeeded in folio_lock_anon_vma_read() to read-lock the anon_vma, we have to re-check if the mapping was changed in the meantime. If that was the case, we have to retry. Note that folio_move_anon_rmap() must only be called if the anon page is exclusive to a process, and must not be called on KSM folios. This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the anon_vma lock in write mode, and the mmap_lock in read mode. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Al Viro <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: [email protected] Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: fix more functions for block size > PAGE_SIZE	Matthew Wilcox (Oracle)	1	-21/+6
	Both __block_write_full_folio() and block_read_full_folio() assumed that block size <= PAGE_SIZE. Replace the shift with a divide, which is probably cheaper than first calculating the shift. That lets us remove block_size_bits() as these were the last callers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: handle large folios in __block_write_begin_int()	Matthew Wilcox (Oracle)	1	-10/+7
	When __block_write_begin_int() was converted to support folios, we did not expect large folios to be passed to it. With the current work to support large block size storage devices, this will no longer be true so change the checks on 'from' and 'to' to be related to the size of the folio instead of PAGE_SIZE. Also remove an assumption that the block size is smaller than PAGE_SIZE. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reported-by: Ryusuke Konishi <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Pankaj Raghav <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: fix various functions for block size > PAGE_SIZE	Matthew Wilcox (Oracle)	1	-5/+5
	If i_blkbits is larger than PAGE_SHIFT, we shift by a negative number, which is undefined. It is safe to shift the block left as a block device must be smaller than MAX_LFS_FILESIZE, which is guaranteed to fit in loff_t. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: cast block to loff_t before shifting it	Matthew Wilcox (Oracle)	1	-1/+1
	While sector_t is always defined as a u64 today, that hasn't always been the case and it might not always be the same size as loff_t in the future. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: fix grow_buffers() for block size > PAGE_SIZE	Matthew Wilcox (Oracle)	1	-11/+6
	We must not shift by a negative number so work in terms of a byte offset to avoid the awkward shift left-or-right-depending-on-sign option. This means we need to use check_mul_overflow() to ensure that a large block number does not result in a wrap. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryusuke Konishi <[email protected]> [[email protected]: add cast in grow_buffers() to avoid a multiplication libcall] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: calculate block number inside folio_init_buffers()	Matthew Wilcox (Oracle)	1	-7/+6
	The calculation of block from index doesn't work for devices with a block size larger than PAGE_SIZE as we end up shifting by a negative number. Instead, calculate the number of the first block from the folio's position in the block device. We no longer need to pass sizebits to grow_dev_folio(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	buffer: return bool from grow_dev_folio()	Matthew Wilcox (Oracle)	1	-25/+25
	Patch series "More buffer_head cleanups", v2. The first patch is a left-over from last cycle. The rest fix "obvious" block size > PAGE_SIZE problems. I haven't tested with a large block size setup (but I have done an ext4 xfstests run). This patch (of 7): Rename grow_dev_page() to grow_dev_folio() and make it return a bool. Document what that bool means; it's more subtle than it first appears. Also rename the 'failed' label to 'unlock' beacuse it's not exactly 'failed'. It just hasn't succeeded. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Pankaj Raghav <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	Merge tag 'gpio-fixes-for-v6.7-rc8' of ↵	Linus Torvalds	1	-3/+11
	git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - Andy steps down as GPIO reviewer - Kent becomes a reviewer for GPIO uAPI - add missing intel file to the relevant MAINTAINERS section * tag 'gpio-fixes-for-v6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: MAINTAINERS: Add a missing file to the INTEL GPIO section MAINTAINERS: Remove Andy from GPIO maintainers MAINTAINERS: split out the uAPI into a new section
2023-12-29	Merge tag 'platform-drivers-x86-v6.7-6' of ↵	Linus Torvalds	7	-68/+176
	git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Ilpo Järvinen: - Intel PMC GBE LTR regression - P2SB / PCI deadlock fix * tag 'platform-drivers-x86-v6.7-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86/intel/pmc: Move GBE LTR ignore to suspend callback platform/x86/intel/pmc: Allow reenabling LTRs platform/x86/intel/pmc: Add suspend callback platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe
2023-12-29	Merge tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux	Linus Torvalds	2	-3/+5
	Pull block fixes from Jens Axboe: "Fix for a badly numbered flag, and a regression fix for the badblocks updates from this merge window" * tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux: block: renumber QUEUE_FLAG_HW_WC badblocks: avoid checking invalid range in badblocks_check()
2023-12-29	mailmap: add entries for Mathieu Othacehe	Mathieu Othacehe	1	-1/+1
	Add my gnu.org mail address. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mathieu Othacehe <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Heiko Stuebner <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Konrad Dybcio <[email protected]> Cc: Matthieu Baerts <[email protected]> Cc: Matt Ranostay <[email protected]> Cc: Oleksij Rempel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	MAINTAINERS: change vmware.com addresses to broadcom.com	Zack Rusin	2	-5/+5
	Update the email addresses for vmwgfx and vmmouse to reflect the fact that VMware is now part of Broadcom. Add a .mailmap entry because the vmware.com address will start bouncing soon. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zack Rusin <[email protected]> Acked-by: Florian Fainelli <[email protected]> Cc: Ian Forbes <[email protected]> Cc: Martin Krastev <[email protected]> Cc: Maaz Mombasawala <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-29	arch/mm/fault: fix major fault accounting when retrying under per-VMA lock	Suren Baghdasaryan	5	-0/+11
	A test [1] in Android test suite started failing after [2] was merged. It turns out that after handling a major fault under per-VMA lock, the process major fault counter does not register that fault as major. Before [2] read faults would be done under mmap_lock, in which case FAULT_FLAG_TRIED flag is set before retrying. That in turn causes mm_account_fault() to account the fault as major once retry completes. With per-VMA locks we often retry because a fault can't be handled without locking the whole mm using mmap_lock. Therefore such retries do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] because we can now handle read major faults under per-VMA lock and upon retry the fact there was a major fault gets lost. Fix this by setting FAULT_FLAG_TRIED after retrying under per-VMA lock if VM_FAULT_MAJOR was returned. Ideally we would use an additional VM_FAULT bit to indicate the reason for the retry (could not handle under per-VMA lock vs other reason) but this simpler solution seems to work, so keeping it simple. [1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp [2] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>