blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2023-08-24	mm: convert do_set_pte() to set_pte_range()	Yin Fengwei	2	-15/+25
	set_pte_range() allows to setup page table entries for a specific range. It takes advantage of batched rmap update for large folio. It now takes care of calling update_mmu_cache_range(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yin Fengwei <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	rmap: add folio_add_file_rmap_range()	Yin Fengwei	1	-14/+46
	folio_add_file_rmap_range() allows to add pte mapping to a specific range of file folio. Comparing to page_add_file_rmap(), it batched updates __lruvec_stat for large folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yin Fengwei <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	filemap: add filemap_map_folio_range()	Yin Fengwei	1	-54/+55
	filemap_map_folio_range() maps partial/full folio. Comparing to original filemap_map_pages(), it updates refcount once per folio instead of per page and gets minor performance improvement for large folio. With a will-it-scale.page_fault3 like app (change file write fault testing to read fault testing. Trying to upstream it to will-it-scale at [1]), got 2% performance gain on a 48C/96T Cascade Lake test box with 96 processes running against xfs. [1]: https://github.com/antonblanchard/will-it-scale/pull/37 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yin Fengwei <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: use flush_icache_pages() in do_set_pmd()	Matthew Wilcox (Oracle)	1	-3/+1
	Push the iteration over each page down to the architectures (many can flush the entire THP without iteration). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO	Matthew Wilcox (Oracle)	1	-1/+1
	Current best practice is to reuse the name of the function as a define to indicate that the function is implemented by the architecture. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: convert page_table_check_pte_set() to page_table_check_ptes_set()	Matthew Wilcox (Oracle)	1	-7/+9
	Tell the page table check how many PTEs & PFNs we want it to check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Acked-by: Pasha Tatashin <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: memcg: use rstat for non-hierarchical stats	Yosry Ahmed	2	-29/+39
	Currently, memcg uses rstat to maintain aggregated hierarchical stats. Counters are maintained for hierarchical stats at each memcg. Rstat tracks which cgroups have updates on which cpus to keep those counters fresh on the read-side. Non-hierarchical stats are currently not covered by rstat. Their per-cpu counters are summed up on every read, which is expensive. The original implementation did the same. At some point before rstat, non-hierarchical aggregated counters were introduced by commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"). However, those counters were updated on the performance critical write-side, which caused regressions, so they were later removed by commit 815744d75152 ("mm: memcontrol: don't batch updates of local VM stats and events"). See [1] for more detailed history. Kernel versions in between a983b5ebee57 & 815744d75152 (a year and a half) enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. When moving to more recent kernels, a performance regression for reading non-hierarchical stats is observed. Now that we have rstat, we know exactly which percpu counters have updates for each stat. We can maintain non-hierarchical counters again, making reads much more efficient, without affecting the performance critical write-side. Hence, add non-hierarchical (i.e local) counters for the stats, and extend rstat flushing to keep those up-to-date. A caveat is that we now need a stats flush before reading local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or memcg_events_local(), where we previously only needed a flush to read hierarchical stats. Most contexts reading non-hierarchical stats are already doing a flush, add a flush to the only missing context in count_shadow_nodes(). With this patch, reading memory.stat from 1000 memcgs is 3x faster on a machine with 256 cpus on cgroup v1: # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done # time cat /sys/fs/cgroup/memory/cg/memory.stat > /dev/null real 0m0.125s user 0m0.005s sys 0m0.120s After: real 0m0.032s user 0m0.005s sys 0m0.027s To make sure there are no regressions on cgroup v2, I ran an artificial reclaim/refault stress test [2] that creates (NR_CPUS 2) cgroups, assigns them limits, runs a worker process in each cgroup that allocates tmpfs memory equal to quadruple the limit (to invoke reclaim continuously), and then reads back the entire file (to invoke refaults). All workers are run in parallel, and zram is used as a swapping backend. Both reclaim and refault have conditional stats flushing. I ran this on a machine with 112 cpus, once on mm-unstable, and once on mm-unstable with this patch reverted. (1) A few runs without this patch: # time ./stress_reclaim_refault.sh real 0m9.949s user 0m0.496s sys 14m44.974s # time ./stress_reclaim_refault.sh real 0m10.049s user 0m0.486s sys 14m55.791s # time ./stress_reclaim_refault.sh real 0m9.984s user 0m0.481s sys 14m53.841s (2) A few runs with this patch: # time ./stress_reclaim_refault.sh real 0m9.885s user 0m0.486s sys 14m48.753s # time ./stress_reclaim_refault.sh real 0m9.903s user 0m0.495s sys 14m48.339s # time ./stress_reclaim_refault.sh real 0m9.861s user 0m0.507s sys 14m49.317s No regressions are observed with this patch. There is actually a very slight improvement. If I have to guess, maybe it's because we avoid the percpu loop in count_shadow_nodes() when calling lruvec_page_state_local(), but I could not prove this using perf, it's probably in the noise. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: handle userfaults under VMA lock	Suren Baghdasaryan	1	-8/+1
	Enable handle_userfault to operate under VMA lock by releasing VMA lock instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT should never be used when handling faults under per-VMA lock protection because that would break the assumption that lock is dropped on retry. [[email protected]: fix a lockdep issue in vma_assert_write_locked] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: handle swap page faults under per-VMA lock	Suren Baghdasaryan	2	-15/+18
	When page fault is handled under per-VMA lock protection, all swap page faults are retried with mmap_lock because folio_lock_or_retry has to drop and reacquire mmap_lock if folio could not be immediately locked. Follow the same pattern as mmap_lock to drop per-VMA lock when waiting for folio and retrying once folio is available. With this obstacle removed, enable do_swap_page to operate under per-VMA lock protection. Drivers implementing ops->migrate_to_ram might still rely on mmap_lock, therefore we have to fall back to mmap_lock in that particular case. Note that the only time do_swap_page calls synchronous swap_readpage is when SWP_SYNCHRONOUS_IO is set, which is only set for QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem). Therefore we don't sleep in this path, and there's no need to drop the mmap or per-VMA lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Tested-by: Alistair Popple <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: change folio_lock_or_retry to use vm_fault directly	Suren Baghdasaryan	2	-18/+18
	Change folio_lock_or_retry to accept vm_fault struct and return the vm_fault_t directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED	Suren Baghdasaryan	1	-0/+12
	handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means mmap_lock has been released. However with per-VMA locks behavior is different and the caller should still release it. To make the rules consistent for the caller, drop the per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED. Currently the only path returning VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns VM_FAULT_COMPLETED for now. [[email protected]: fix riscv] Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Peter Xu <[email protected]> Tested-by: Conor Dooley <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	swap: remove remnants of polling from read_swap_cache_async	Suren Baghdasaryan	3	-10/+7
	Patch series "Per-VMA lock support for swap and userfaults", v7. When per-VMA locks were introduced in [1] several types of page faults would still fall back to mmap_lock to keep the patchset simple. Among them are swap and userfault pages. The main reason for skipping those cases was the fact that mmap_lock could be dropped while handling these faults and that required additional logic to be implemented. Implement the mechanism to allow per-VMA locks to be dropped for these cases. First, change handle_mm_fault to drop per-VMA locks when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault and return vm_fault_t which simplifies later patches. Finally allow swap and uffd page faults to be handled under per-VMA locks by dropping per-VMA and retrying, the same way it's done under mmap_lock. Naturally, once VMA lock is dropped that VMA should be assumed unstable and can't be used. This patch (of 6): Commit [1] introduced IO polling support duding swapin to reduce swap read latency for block devices that can be polled. However later commit [2] removed polling support. Therefore it seems safe to remove do_poll parameter in read_swap_cache_async and always call swap_readpage with synchronous=false waiting for IO completion in folio_lock_or_retry. [1] commit 23955622ff8d ("swap: add block io poll in swapin path") [2] commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio") Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: "Huang, Ying" <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm: memory-failure: fix potential page refcnt leak in memory_failure()	Miaohe Lin	1	-2/+1
	put_ref_page() is not called to drop extra refcnt when comes from madvise in the case pfn is valid but pgmap is NULL leading to page refcnt leak. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1e8aaedb182d ("mm,memory_failure: always pin the page in madvise_inject_error") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm/memory.c: fix mismerge	Matthew Wilcox	1	-1/+1
	Fix a build issue. Link: https://lkml.kernel.org/r/ZNerqcNS4EBJA/[email protected] Fixes: 4aaa60dad4d1 ("mm: allow per-VMA locks on file-backed VMAs") Signed-off-by: Matthew Wilcox <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd	Hugh Dickins	1	-9/+29
	Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() thought it had emptied: page lock on the huge page is enough to protect against WP faults (which find the PTE has been cleared), but not enough to protect against userfaultfd. "BUG: Bad rss-counter state" followed. retract_page_tables() protects against this by checking !vma->anon_vma; but we know that MADV_COLLAPSE needs to be able to work on private shmem mappings, even those with an anon_vma prepared for another part of the mapping; and we know that MADV_COLLAPSE needs to work on shared shmem mappings which are userfaultfd_armed(). Whether it needs to work on private shmem mappings which are userfaultfd_armed(), I'm not so sure: but assume that it does. Just for this case, take the pmd_lock() two steps earlier: not because it gives any protection against this case itself, but because ptlock nests inside it, and it's the dropping of ptlock which let the bug in. In other cases, continue to minimize the pmd_lock() hold time. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Jann Horn <[email protected]> Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ Acked-by: Peter Xu <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	hugetlb: clear flags in tail pages that will be freed individually	Mike Kravetz	1	-10/+1
	hugetlb manually creates and destroys compound pages. As such it makes assumptions about struct page layout. Commit ebc1baf5c9b4 ("mm: free up a word in the first tail page") breaks hugetlb. The following will fix the breakage. Link: https://lkml.kernel.org/r/20230822231741.GC4509@monkey Fixes: ebc1baf5c9b4 ("mm: free up a word in the first tail page") Signed-off-by: Mike Kravetz <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes	Andrew Morton	3	-6/+8

2023-08-24	shmem: fix smaps BUG sleeping while atomic	Hugh Dickins	1	-2/+4
	smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if need_resched(): "BUG: sleeping function called from invalid context". Since shmem_partial_swap_usage() is designed to count across a range, but smaps_pte_hole_lookup() only calls it for a single page slot, just break out of the loop on the last or only page, before checking need_resched(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 230100321518 ("mm/smaps: simplify shmem handling of pte holes") Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: <[email protected]> [5.16+] Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	madvise:madvise_free_pte_range(): don't use mapcount() against large folio ↵	Yin Fengwei	1	-1/+1
	for sharing check Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/[email protected] Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Signed-off-by: Yin Fengwei <[email protected]> Reviewed-by: Yu Zhao <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio ↵	Yin Fengwei	1	-1/+1
	for sharing check Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/[email protected] Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") Signed-off-by: Yin Fengwei <[email protected]> Reviewed-by: Yu Zhao <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against ↵	Yin Fengwei	1	-2/+2
	large folio for sharing check Patch series "don't use mapcount() to check large folio sharing", v2. In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio. Use folio_estimated_sharers() here as the estimated number is enough. This patchset will fix the cases: User space application call madvise() with MADV_FREE, MADV_COLD and MADV_PAGEOUT for specific address range. There are THP mapped to the range. Without the patchset, the THP is skipped. With the patch, the THP will be split and handled accordingly. David reported the cow self test skip some cases because of MADV_PAGEOUT skip THP: https://lore.kernel.org/linux-mm/[email protected]/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81 and I confirmed this patchset make it work again. This patch (of 3): Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folio. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") Signed-off-by: Yin Fengwei <[email protected]> Reviewed-by: Yu Zhao <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski	1	-4/+1
	Cross-merge networking fixes after downstream PR. Conflicts: include/net/inet_sock.h f866fbc842de ("ipv4: fix data-races around inet->inet_id") c274af224269 ("inet: introduce inet->inet_flags") https://lore.kernel.org/all/[email protected]/ Adjacent changes: drivers/net/bonding/bond_alb.c e74216b8def3 ("bonding: fix macvlan over alb bond support") f11e5bd159b0 ("bonding: support balance-alb with openvswitch") drivers/net/ethernet/broadcom/bgmac.c d6499f0b7c7c ("net: bgmac: Return PTR_ERR() for fixed_phy_register()") 23a14488ea58 ("net: bgmac: Fix return value check for fixed_phy_register()") drivers/net/ethernet/broadcom/genet/bcmmii.c 32bbe64a1386 ("net: bcmgenet: Fix return value check for fixed_phy_register()") acf50d1adbf4 ("net: bcmgenet: Return PTR_ERR() for fixed_phy_register()") net/sctp/socket.c f866fbc842de ("ipv4: fix data-races around inet->inet_id") b09bde5c3554 ("inet: move inet->mc_loop to inet->inet_frags") Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-22	tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs	Hugh Dickins	1	-1/+1
	It is particularly important for the userns mount case (when a sensible nr_inodes maximum may not be enforced) that tmpfs user xattrs be subject to memory cgroup limiting. Leave temporary buffer allocations as is, but change the persistent simple xattr allocations from GFP_KERNEL to GFP_KERNEL_ACCOUNT. This limits kernfs's cgroupfs too, but that's good. (I had intended to send this change earlier, but had been confused by shmem_alloc_inode() using GFP_KERNEL, and thought a discussion would be needed to change that too: no, I was forgetting the SLAB_ACCOUNT on that kmem_cache, which implicitly adds __GFP_ACCOUNT to all its allocations.) Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-22	parisc: Use generic mmap top-down layout and brk randomization	Helge Deller	1	-1/+4
	parisc uses a top-down layout by default that exactly fits the generic functions, so get rid of arch specific code and use the generic version by selecting ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. Note that on parisc the stack always grows up and a "unlimited stack" simply means that the value as defined in CONFIG_STACK_MAX_DEFAULT_SIZE_MB should be used. So RLIM_INFINITY is not an indicator to use the legacy memory layout. Signed-off-by: Helge Deller <[email protected]>
2023-08-21	mm: convert split_huge_pages_pid() to use a folio	Matthew Wilcox (Oracle)	1	-11/+10
	Replaces five calls to compound_head with one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: remove folio_test_transhuge()	Matthew Wilcox (Oracle)	1	-1/+1
	This function is misleading; people think it means "Is this a THP", when all it actually does is check whether this is a large folio. Remove it; the one remaining user should have been checking to see whether the folio is PMD sized or not. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: free up a word in the first tail page	Matthew Wilcox (Oracle)	1	-1/+1
	Store the folio order in the low byte of the flags word in the first tail page. This frees up the word that was being used to store the order and dtor bytes previously. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: add large_rmappable page flag	Matthew Wilcox (Oracle)	3	-9/+3
	Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: remove HUGETLB_PAGE_DTOR	Matthew Wilcox (Oracle)	2	-44/+7
	We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: remove free_compound_page() and the compound_page_dtors array	Matthew Wilcox (Oracle)	1	-19/+5
	The only remaining destructor is free_compound_page(). Inline it into destroy_large_folio() and remove the array it used to live in. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: convert prep_transhuge_page() to folio_prep_large_rmappable()	Matthew Wilcox (Oracle)	4	-14/+14
	Match folio_undo_large_rmappable(), and move the casting from page to folio into the callers (which they were largely doing anyway). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: convert free_transhuge_folio() to folio_undo_large_rmappable()	Matthew Wilcox (Oracle)	3	-14/+19
	Indirect calls are expensive, thanks to Spectre. Test for TRANSHUGE_PAGE_DTOR and destroy the folio appropriately. Move the free_compound_page() call into destroy_large_folio() to simplify later patches. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: convert free_huge_page() to free_huge_folio()	Matthew Wilcox (Oracle)	2	-26/+24
	Pass a folio instead of the head page to save a few instructions. Update the documentation, at least in English. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: call free_huge_page() directly	Matthew Wilcox (Oracle)	1	-3/+5
	Indirect calls are expensive, thanks to Spectre. Call free_huge_page() directly if the folio belongs to hugetlb. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT	David Hildenbrand	1	-7/+0
	Commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast") from 2012 documented as the primary reason why we would want to handle NUMA hinting faults from GUP: KVM secondary MMU page faults will trigger the NUMA hinting page faults through gup_fast -> get_user_pages -> follow_page -> handle_mm_fault. That is still the case today, and relevant KVM code has been converted to manually set FOLL_HONOR_NUMA_FAULT. So let's stop setting FOLL_HONOR_NUMA_FAULT for all GUP users and cross fingers that not that many other ones that really require such handling for autonuma remain. Possible interaction with MMU notifiers: Assume a driver obtains a page using get_user_pages() to map it into a secondary MMU, and uses the MMU notifier framework to get notified on changes. Assume get_user_pages() succeeded on a PROT_NONE-mapped page (because FOLL_HONOR_NUMA_FAULT is not set) in an accessible VMA and the page is mapped into a secondary MMU. Once user space would turn that mapping inaccessible using mprotect(PROT_NONE), the actual PTE in the page table might not change. If the MMU notifier would be smart and optimize for that case "why notify if the PTE didn't change", that could be problematic. At least change_pmd_range() with MMU_NOTIFY_PROTECTION_VMA for now does an unconditional mmu_notifier_invalidate_range_start() -> mmu_notifier_invalidate_range_end() and should be fine. Note that even if a PTE in an accessible VMA is pte_protnone(), the underlying page might be accessed by a secondary MMU that does not set FOLL_HONOR_NUMA_FAULT, and test_young() MMU notifiers would return "true". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: liubo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes	Andrew Morton	22	-86/+226

2023-08-21	lib/vsprintf: declare no_hash_pointers in sprintf.h	Andy Shevchenko	1	-2/+1
	Sparse is not happy to see non-static variable without declaration: lib/vsprintf.c:61:6: warning: symbol 'no_hash_pointers' was not declared. Should it be static? Declare respective variable in the sprintf.h. With this, add a comment to discourage its use if no real need. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Marco Elver <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	Rename kmemleak_initialized to kmemleak_late_initialized	Xiaolei Wang	1	-4/+4
	The old name is confusing because it implies the completion of earlier kmemleak_init(), the new name update to kmemleak_late_initial represents the completion of kmemleak_late_init(). No functional changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xiaolei Wang <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zhaoyang Huang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm/kmemleak: use object_cache instead of kmemleak_initialized to check in ↵	Xiaolei Wang	1	-1/+6
	set_track_prepare() Patch series "mm/kmemleak: use object_cache instead of kmemleak_initialized", v3. Use object_cache instead of kmemleak_initialized to check in set_track_prepare(), so that memory leaks after kmemleak_init() can be recorded and Rename kmemleak_initialized to kmemleak_late_initialized unreferenced object 0xc674ca80 (size 64): comm "swapper/0", pid 1, jiffies 4294938337 (age 204.880s) hex dump (first 32 bytes): 80 55 75 c6 80 54 75 c6 00 55 75 c6 80 52 75 c6 .Uu..Tu..Uu..Ru. 00 53 75 c6 00 00 00 00 00 00 00 00 00 00 00 00 .Su.......... This patch (of 2): kmemleak_initialized is set in kmemleak_late_init(), which also means that there is no call trace which object's memory leak is before kmemleak_late_init(), so use object_cache instead of kmemleak_initialized to check in set_track_prepare() to avoid no call trace records when there is a memory leak in the code between kmemleak_init() and kmemleak_late_init(). unreferenced object 0xc674ca80 (size 64): comm "swapper/0", pid 1, jiffies 4294938337 (age 204.880s) hex dump (first 32 bytes): 80 55 75 c6 80 54 75 c6 00 55 75 c6 80 52 75 c6 .Uu..Tu..Uu..Ru. 00 53 75 c6 00 00 00 00 00 00 00 00 00 00 00 00 .Su.......... Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 56a61617dd22 ("mm: use stack_depot for recording kmemleak's backtrace") Signed-off-by: Xiaolei Wang <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zhaoyang Huang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm/ksm: add pages scanned metric	Stefan Roesch	1	-1/+15
	ksm currently maintains several statistics, which let you determine how successful KSM is at sharing pages. However it does not contain a metric to determine how much work it does. This commit adds the pages scanned metric. This allows the administrator to determine how many pages have been scanned over a period of time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: allow fault_dirty_shared_page() to be called under the VMA lock	Matthew Wilcox (Oracle)	1	-1/+1
	By making maybe_unlock_mmap_for_io() handle the VMA lock correctly, we make fault_dirty_shared_page() safe to be called without the mmap lock held. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reported-by: David Hildenbrand <[email protected]> Tested-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm/secretmem: use a folio in secretmem_fault()	ZhangPeng	1	-6/+8
	Saves four implicit call to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm,thp: no space after colon in Mem-Info fields	Hugh Dickins	1	-3/+3
	Patch series "mm,thp: fix sloppy text output". Three independent trivial patches, fixing sloppy text output which has annoyed me; but might risk surprising a parser, so any can be dropped. This patch (of 3): The SysRq-m or OOM Mem-Info dmesg showed (long lines containing) ... shmem:NkB shmem_thp: NkB shmem_pmdmapped: NkB anon_thp: NkB ... Delete the space after the colon after shmem_thp, shmem_pmdmapped, anon_thp: as the shmem example shows, no other fields have a space after the colon in this output. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy	Aleksa Sarai	1	-1/+2
	This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT \| O_RDWR \| O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char execpath = NULL; char argv[] = { "bad-exe", NULL }, envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH \| O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. / New API / The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility / As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	memfd: improve userspace warnings for missing exec-related flags	Aleksa Sarai	1	-1/+1
	In order to incentivise userspace to switch to passing MFD_EXEC and MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call memfd_create() without the new flags. pr_warn_once() is not useful because on most systems the one warning is burned up during the boot process (on my system, systemd does this within the first second of boot) and thus userspace will in practice never see the warnings to push them to switch to the new flags. The original patchset[1] used pr_warn_ratelimited(), however there were concerns about the degree of spam in the kernel log[2,3]. The resulting inability to detect every case was flagged as an issue at the time[4]. While we could come up with an alternative rate-limiting scheme such as only outputting the message if vm.memfd_noexec has been modified, or only outputting the message once for a given task, these alternatives have downsides that don't make sense given how low-stakes a single kernel warning message is. Switching to pr_info_ratelimited() instead should be fine -- it's possible some monitoring tool will be unhappy with a stream of warning-level messages but there's already plenty of info-level message spam in dmesg. [1]: https://lore.kernel.org/[email protected]/ [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/ [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ [4]: https://lore.kernel.org/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2	Aleksa Sarai	1	-19/+11
	Given the difficulty of auditing all of userspace to figure out whether every memfd_create() user has switched to passing MFD_EXEC and MFD_NOEXEC_SEAL flags, it seems far less distruptive to make it possible for older programs that don't make use of executable memfds to run under vm.memfd_noexec=2. Otherwise, a small dependency change can result in spurious errors. For programs that don't use executable memfds, passing MFD_NOEXEC_SEAL is functionally a no-op and thus having the same In addition, every failure under vm.memfd_noexec=2 needs to print to the kernel log so that userspace can figure out where the error came from. The concerns about pr_warn_ratelimited() spam that caused the switch to pr_warn_once()[1,2] do not apply to the vm.memfd_noexec=2 case. This is a user-visible API change, but as it allows programs to do something that would be blocked before, and the sysctl itself was broken and recently released, it seems unlikely this will cause any issues. [1]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: convert ptlock_free() to use ptdescs	Vishal Moola (Oracle)	1	-2/+2
	This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guo Ren <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm: convert ptlock_alloc() to use ptdescs	Vishal Moola (Oracle)	1	-2/+2
	This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guo Ren <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm/z3fold: remove obsolete comment for struct z3fold_pool	Xiu Jianfeng	1	-2/+0
	Since commit e774a7bc7f0a ("mm: zswap: remove page reclaim logic from z3fold"), zpool and zpool_ops have been removed, so also remove the corresponding comments. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xiu Jianfeng <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21	mm/page_alloc: use get_pfnblock_migratetype to avoid extra page_to_pfn	Kemeng Shi	1	-2/+2
	We have get_pageblock_migratetype and get_pfnblock_migratetype to get migratetype of page. get_pfnblock_migratetype accepts both page and pfn from caller while get_pageblock_migratetype only accept page and get pfn with page_to_pfn from page. In case we already record pfn of page, we can simply call get_pfnblock_migratetype to avoid a page_to_pfn. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kemeng Shi <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Baolin Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>