aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-02-26mm: cma: print region name on failurePatrick Daly1-2/+2
Print the name of the CMA region for convenience. This is useful information to have when cma_alloc() fails. [[email protected]: print the "count" variable] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Patrick Daly <[email protected]> Signed-off-by: Georgi Djakov <[email protected]> Acked-by: Minchan Kim <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfoDavid Hildenbrand3-2/+20
Let's count the number of CMA pages per zone and print them in /proc/zoneinfo. Having access to the total number of CMA pages per zone is helpful for debugging purposes to know where exactly the CMA pages ended up, and to figure out how many pages of a zone might behave differently, even after some of these pages might already have been allocated. As one example, CMA pages part of a kernel zone cannot be used for ordinary kernel allocations but instead behave more like ZONE_MOVABLE. For now, we are only able to get the global nr+free cma pages from /proc/meminfo and the free cma pages per zone from /proc/zoneinfo. Example after this patch when booting a 6 GiB QEMU VM with "hugetlb_cma=2G": # cat /proc/zoneinfo | grep cma cma 0 nr_free_cma 0 cma 0 nr_free_cma 0 cma 524288 nr_free_cma 493016 cma 0 cma 0 # cat /proc/meminfo | grep Cma CmaTotal: 2097152 kB CmaFree: 1972064 kB Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way, one can be sure when spotting "cma 0", that there are definetly no CMA pages located in a zone. [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Yang <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/cma: expose all pages to the buddy if activation of an area failsDavid Hildenbrand1-22/+21
Right now, if activation fails, we might already have exposed some pages to the buddy for CMA use (although they will never get actually used by CMA), and some pages won't be exposed to the buddy at all. Let's check for "single zone" early and on error, don't expose any pages for CMA use - instead, expose them to the buddy available for any use. Simply call free_reserved_page() on every single page - easier than going via free_reserved_area(), converting back and forth between pfns and virt addresses. In addition, make sure to fixup totalcma_pages properly. Example: 6 GiB QEMU VM with "... hugetlb_cma=2G movablecore=20% ...": [ 0.006891] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node [ 0.006893] cma: Reserved 2048 MiB at 0x0000000100000000 [ 0.006893] hugetlb_cma: reserved 2048 MiB on node 0 ... [ 0.175433] cma: CMA area hugetlb0 could not be activated Before this patch: # cat /proc/meminfo MemTotal: 5867348 kB MemFree: 5692808 kB MemAvailable: 5542516 kB ... CmaTotal: 2097152 kB CmaFree: 1884160 kB After this patch: # cat /proc/meminfo MemTotal: 6077308 kB MemFree: 5904208 kB MemAvailable: 5747968 kB ... CmaTotal: 0 kB CmaFree: 0 kB Note: cma_init_reserved_mem() makes sure that we always cover full pageblocks / MAX_ORDER - 1 pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: cma: allocate cma areas bottom-upRoman Gushchin1-0/+17
Currently cma areas without a fixed base are allocated close to the end of the node. This placement is sub-optimal because of compaction: it brings pages into the cma area. In particular, it can bring in hot executable pages, even if there is a plenty of free memory on the machine. This results in cma allocation failures. Instead let's place cma areas close to the beginning of a node. In this case the compaction will help to free cma areas, resulting in better cma allocation success rates. If there is enough memory let's try to allocate bottom-up starting with 4GB to exclude any possible interference with DMA32. On smaller machines or in a case of a failure, stick with the old behavior. 16GB vm, 2GB cma area: With this patch: [ 0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G [ 0.002928] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node [ 0.002930] cma: Reserved 2048 MiB at 0x0000000100000000 [ 0.002931] hugetlb_cma: reserved 2048 MiB on node 0 Without this patch: [ 0.000000] Command line: root=/dev/vda3 rootflags=subvol=/root systemd.unified_cgroup_hierarchy=1 enforcing=0 console=ttyS0,115200 hugetlb_cma=2G [ 0.002930] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node [ 0.002933] cma: Reserved 2048 MiB at 0x00000003c0000000 [ 0.002934] hugetlb_cma: reserved 2048 MiB on node 0 v2: - switched to memblock_set_bottom_up(true), by Mike - start with 4GB, by Mike [[email protected]: whitespace fix, per Mike] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix 32-bit warnings] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix 32-bit systems] [[email protected]: build fix] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Wonhyuk Yang <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm,shmem,thp: limit shmem THP allocations to requested zonesRik van Riel1-1/+5
Hugh pointed out that the gma500 driver uses shmem pages, but needs to limit them to the DMA32 zone. Ensure the allocations resulting from the gfp_mask returned by limit_gfp_mask use the zone flags that were originally passed to shmem_getpage_gfp. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rik van Riel <[email protected]> Suggested-by: Hugh Dickins <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xu Yu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm,thp,shmem: make khugepaged obey tmpfs mount flagsRik van Riel2-6/+18
Currently if thp enabled=[madvise], mounting a tmpfs filesystem with huge=always and mmapping files from that tmpfs does not result in khugepaged collapsing those mappings, despite the mount flag indicating that it should. Fix that by breaking up the blocks of tests in hugepage_vma_check a little bit, and testing things in the correct order. Link: https://lkml.kernel.org/r/[email protected] Fixes: c2231020ea7b ("mm: thp: register mm for khugepaged when merging vma for shmem") Signed-off-by: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xu Yu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm,thp,shm: limit gfp mask to no more than specifiedRik van Riel1-0/+21
Matthew Wilcox pointed out that the i915 driver opportunistically allocates tmpfs memory, but will happily reclaim some of its pool if no memory is available. Make sure the gfp mask used to opportunistically allocate a THP is always at least as restrictive as the original gfp mask. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rik van Riel <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xu Yu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm,thp,shmem: limit shmem THP alloc gfp_maskRik van Riel3-6/+10
Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6. The allocation flags of anonymous transparent huge pages can be controlled through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can help the system from getting bogged down in the page reclaim and compaction code when many THPs are getting allocated simultaneously. However, the gfp_mask for shmem THP allocations were not limited by those configuration settings, and some workloads ended up with all CPUs stuck on the LRU lock in the page reclaim code, trying to allocate dozens of THPs simultaneously. This patch applies the same configurated limitation of THPs to shmem hugepage allocations, to prevent that from happening. This way a THP defrag setting of "never" or "defer+madvise" will result in quick allocation failures without direct reclaim when no 2MB free pages are available. With this patch applied, THP allocations for tmpfs will be a little more aggressive than today for files mmapped with MADV_HUGEPAGE, and a little less aggressive for files that are not mmapped or mapped without that flag. This patch (of 4): The allocation flags of anonymous transparent huge pages can be controlled through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can help the system from getting bogged down in the page reclaim and compaction code when many THPs are getting allocated simultaneously. However, the gfp_mask for shmem THP allocations were not limited by those configuration settings, and some workloads ended up with all CPUs stuck on the LRU lock in the page reclaim code, trying to allocate dozens of THPs simultaneously. This patch applies the same configurated limitation of THPs to shmem hugepage allocations, to prevent that from happening. Controlling the gfp_mask of THP allocations through the knobs in sysfs allows users to determine the balance between how aggressively the system tries to allocate THPs at fault time, and how much the application may end up stalling attempting those allocations. This way a THP defrag setting of "never" or "defer+madvise" will result in quick allocation failures without direct reclaim when no 2MB free pages are available. With this patch applied, THP allocations for tmpfs will be a little more aggressive than today for files mmapped with MADV_HUGEPAGE, and a little less aggressive for files that are not mmapped or mapped without that flag. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rik van Riel <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Xu Yu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: remove pagevec_lookup_entriesMatthew Wilcox (Oracle)3-39/+4
pagevec_lookup_entries() is now just a wrapper around find_get_entries() so remove it and convert all its callers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: pass pvec directly to find_get_entriesMatthew Wilcox (Oracle)4-20/+13
All callers of find_get_entries() use a pvec, so pass it directly instead of manipulating it in the caller. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: remove nr_entries parameter from pagevec_lookup_entriesMatthew Wilcox (Oracle)3-6/+5
All callers want to fetch the full size of the pvec. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: add an 'end' parameter to pagevec_lookup_entriesMatthew Wilcox (Oracle)3-38/+16
Simplifies the callers and uses the existing functionality in find_get_entries(). We can also drop the final argument of truncate_exceptional_pvec_entries() and simplify the logic in that function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: add an 'end' parameter to find_get_entriesMatthew Wilcox (Oracle)4-15/+10
This simplifies the callers and leads to a more efficient implementation since the XArray has this functionality already. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: add and use find_lock_entriesMatthew Wilcox (Oracle)4-97/+78
We have three functions (shmem_undo_range(), truncate_inode_pages_range() and invalidate_mapping_pages()) which want exactly this function, so add it to filemap.c. Before this patch, shmem_undo_range() would split any compound page which overlaps either end of the range being punched in both the first and second loops through the address space. After this patch, that functionality is left for the second loop, which is arguably more appropriate since the first loop is supposed to run through all the pages quickly, and splitting a page can sleep. [[email protected]: add assertion] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26iomap: use mapping_seek_hole_dataMatthew Wilcox (Oracle)2-119/+43
Enhance mapping_seek_hole_data() to handle partially uptodate pages and convert the iomap seek code to call it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/filemap: add mapping_seek_hole_dataMatthew Wilcox (Oracle)3-70/+82
Rewrite shmem_seek_hole_data() and move it to filemap.c. [[email protected]: don't put an xa_is_value() page] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/filemap: add helper for finding pagesMatthew Wilcox (Oracle)1-55/+42
There is a lot of common code in find_get_entries(), find_get_pages_range() and find_get_pages_range_tag(). Factor out find_get_entry() which simplifies all three functions. [[email protected]: remove VM_BUG_ON_PAGE()] Link: https://lkml.kernel.org/r/[email protected]: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/filemap: rename find_get_entry to mapping_get_entryMatthew Wilcox (Oracle)1-3/+4
find_get_entry doesn't "find" anything. It returns the entry at a particular index. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: add FGP_ENTRYMatthew Wilcox (Oracle)5-41/+13
The functionality of find_lock_entry() and find_get_entry() can be provided by pagecache_get_page(), which lets us delete find_lock_entry() and make find_get_entry() static. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/swap: optimise get_shadow_from_swap_cacheMatthew Wilcox (Oracle)1-3/+1
There's no need to get a reference to the page, just load the entry and see if it's a shadow entry. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/shmem: use pagevec_lookup in shmem_unlock_mappingMatthew Wilcox (Oracle)1-10/+1
The comment shows that the reason for using find_get_entries() is now stale; find_get_pages() will not return 0 if it hits a consecutive run of swap entries, and I don't believe it has since 2011. pagevec_lookup() is a simpler function to use than find_get_pages(), so use it instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: make pagecache tagged lookups return only head pagesMatthew Wilcox (Oracle)1-5/+6
Patch series "Overhaul multi-page lookups for THP", v4. This THP prep patchset changes several page cache iteration APIs to only return head pages. - It's only possible to tag head pages in the page cache, so only return head pages, not all their subpages. - Factor a lot of common code out of the various batch lookup routines - Add mapping_seek_hole_data() - Unify find_get_entries() and pagevec_lookup_entries() - Make find_get_entries only return head pages, like find_get_entry(). These are only loosely connected, but they seem to make sense together as a series. This patch (of 14): Pagecache tags are used for dirty page writeback. Since dirtiness is tracked on a per-THP basis, we only want to return the head page rather than each subpage of a tagged page. All the filesystems which use huge pages today are in-memory, so there are no tagged huge pages today. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Yang Shi <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-25Merge tag 'kbuild-v5.12' of ↵Linus Torvalds65-498/+698
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Fix false-positive build warnings for ARCH=ia64 builds - Optimize dictionary size for module compression with xz - Check the compiler and linker versions in Kconfig - Fix misuse of extra-y - Support DWARF v5 debug info - Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x exceeded the limit - Add generic syscall{tbl,hdr}.sh for cleanups across arches - Minor cleanups of genksyms - Minor cleanups of Kconfig * tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (38 commits) initramfs: Remove redundant dependency of RD_ZSTD on BLK_DEV_INITRD kbuild: remove deprecated 'always' and 'hostprogs-y/m' kbuild: parse C= and M= before changing the working directory kbuild: reuse this-makefile to define abs_srctree kconfig: unify rule of config, menuconfig, nconfig, gconfig, xconfig kconfig: omit --oldaskconfig option for 'make config' kconfig: fix 'invalid option' for help option kconfig: remove dead code in conf_askvalue() kconfig: clean up nested if-conditionals in check_conf() kconfig: Remove duplicate call to sym_get_string_value() Makefile: Remove # characters from compiler string Makefile: reuse CC_VERSION_TEXT kbuild: check the minimum linker version in Kconfig kbuild: remove ld-version macro scripts: add generic syscallhdr.sh scripts: add generic syscalltbl.sh arch: syscalls: remove $(srctree)/ prefix from syscall tables arch: syscalls: add missing FORCE and fix 'targets' to make if_changed work gen_compile_commands: prune some directories kbuild: simplify access to the kernel's version ...
2021-02-25Merge tag 'ext4_for_linus' of ↵Linus Torvalds6-49/+59
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Miscellaneous ext4 cleanups and bug fixes. Pretty boring this cycle..." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add .kunitconfig fragment to enable ext4-specific tests ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it ext4: reset retry counter when ext4_alloc_file_blocks() makes progress ext4: fix potential htree index checksum corruption ext4: factor out htree rep invariant check ext4: Change list_for_each* to list_for_each_entry* ext4: don't try to processed freed blocks until mballoc is initialized ext4: use DEFINE_MUTEX() for mutex lock
2021-02-25Merge tag 'pci-v5.12-changes' of ↵Linus Torvalds73-781/+5543
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Remove unnecessary locking around _OSC (Bjorn Helgaas) - Clarify message about _OSC failure (Bjorn Helgaas) - Remove notification of PCIe bandwidth changes (Bjorn Helgaas) - Tidy checking of syscall user config accessors (Heiner Kallweit) Resource management: - Decline to resize resources if boot config must be preserved (Ard Biesheuvel) - Fix pci_register_io_range() memory leak (Geert Uytterhoeven) Error handling (Keith Busch): - Clear error status from the correct device - Retain error recovery status so drivers can use it after reset - Log the type of Port (Root or Switch Downstream) that we reset - Always request a reset for Downstream Ports in frozen state Endpoint framework and NTB (Kishon Vijay Abraham I): - Make *_get_first_free_bar() take into account 64 bit BAR - Add helper API to get the 'next' unreserved BAR - Make *_free_bar() return error codes on failure - Remove unused pci_epf_match_device() - Add support to associate secondary EPC with EPF - Add support in configfs to associate two EPCs with EPF - Add pci_epc_ops to map MSI IRQ - Add pci_epf_ops to expose function-specific attrs - Allow user to create sub-directory of 'EPF Device' directory - Implement ->msi_map_irq() ops for cadence - Configure LM_EP_FUNC_CFG based on epc->function_num_map for cadence - Add EP function driver to provide NTB functionality - Add support for EPF PCI Non-Transparent Bridge - Add specification for PCI NTB function device - Add PCI endpoint NTB function user guide - Add configfs binding documentation for pci-ntb endpoint function Broadcom STB PCIe controller driver: - Add support for BCM4908 and external PERST# signal controller (Rafał Miłecki) Cadence PCIe controller driver: - Retrain Link to work around Gen2 training defect (Nadeem Athani) - Fix merge botch in cdns_pcie_host_map_dma_ranges() (Krzysztof Wilczyński) Freescale Layerscape PCIe controller driver: - Add LX2160A rev2 EP mode support (Hou Zhiqiang) - Convert to builtin_platform_driver() (Michael Walle) MediaTek PCIe controller driver: - Fix OF node reference leak (Krzysztof Wilczyński) Microchip PolarFlare PCIe controller driver: - Add Microchip PolarFire PCIe controller driver (Daire McNamara) Qualcomm PCIe controller driver: - Use PHY_REFCLK_USE_PAD only for ipq8064 (Ansuel Smith) - Add support for ddrss_sf_tbu clock for sm8250 (Dmitry Baryshkov) Renesas R-Car PCIe controller driver: - Drop PCIE_RCAR config option (Lad Prabhakar) - Always allocate MSI addresses in 32bit space (Marek Vasut) Rockchip PCIe controller driver: - Add FriendlyARM NanoPi M4B DT binding (Chen-Yu Tsai) - Make 'ep-gpios' DT property optional (Chen-Yu Tsai) Synopsys DesignWare PCIe controller driver: - Work around ECRC configuration hardware defect (Vidya Sagar) - Drop support for config space in DT 'ranges' (Rob Herring) - Change size to u64 for EP outbound iATU (Shradha Todi) - Add upper limit address for outbound iATU (Shradha Todi) - Make dw_pcie ops optional (Jisheng Zhang) - Remove unnecessary dw_pcie_ops from al driver (Jisheng Zhang) Xilinx Versal CPM PCIe controller driver: - Fix OF node reference leak (Pan Bian) Miscellaneous: - Remove tango host controller driver (Arnd Bergmann) - Remove IRQ handler & data together (altera-msi, brcmstb, dwc) (Martin Kaiser) - Fix xgene-msi race in installing chained IRQ handler (Martin Kaiser) - Apply CONFIG_PCI_DEBUG to entire drivers/pci hierarchy (Junhao He) - Fix pci-bridge-emul array overruns (Russell King) - Remove obsolete uses of WARN_ON(in_interrupt()) (Sebastian Andrzej Siewior)" * tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (69 commits) PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064 PCI: qcom: Add support for ddrss_sf_tbu clock dt-bindings: PCI: qcom: Document ddrss_sf_tbu clock for sm8250 PCI: al: Remove useless dw_pcie_ops PCI: dwc: Don't assume the ops in dw_pcie always exist PCI: dwc: Add upper limit address for outbound iATU PCI: dwc: Change size to u64 for EP outbound iATU PCI: dwc: Drop support for config space in 'ranges' PCI: layerscape: Convert to builtin_platform_driver() PCI: layerscape: Add LX2160A rev2 EP mode support dt-bindings: PCI: layerscape: Add LX2160A rev2 compatible strings PCI: dwc: Work around ECRC configuration issue PCI/portdrv: Report reset for frozen channel PCI/AER: Specify the type of Port that was reset PCI/ERR: Retain status from error notification PCI/AER: Clear AER status from Root Port when resetting Downstream Port PCI/ERR: Clear status of the reporting device dt-bindings: arm: rockchip: Add FriendlyARM NanoPi M4B PCI: rockchip: Make 'ep-gpios' DT property optional Documentation: PCI: Add PCI endpoint NTB function user guide ...
2021-02-25Merge tag 'nds32-for-linux-5.12' of ↵Linus Torvalds4-50/+5
git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux Pull nds32 updates from Greentime Hu: "Code clean-up and refinement" * tag 'nds32-for-linux-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux: nds32: Fix bogus reference to <asm/procinfo.h> nds32: use get_kernel_nofault in dump_mem nds32: remove dump_instr nds32: configs: Cleanup CONFIG_CROSS_COMPILE nds32: Replace <linux/clk-provider.h> by <linux/of_clk.h>
2021-02-25nds32: Fix bogus reference to <asm/procinfo.h>Geert Uytterhoeven1-1/+1
Andestech(nds32) never had <asm/procinfo.h>. Signed-off-by: Geert Uytterhoeven <[email protected]> Acked-by: Greentime Hu <[email protected]> Signed-off-by: Greentime Hu <[email protected]>
2021-02-25nds32: use get_kernel_nofault in dump_memChristoph Hellwig1-12/+3
Use the proper get_kernel_nofault helper to access an unsafe kernel pointer without faulting instead of playing with set_fs and get_user. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Nick Hu <[email protected]> Acked-by: Greentime Hu <[email protected]> Signed-off-by: Greentime Hu <[email protected]>
2021-02-25nds32: remove dump_instrChristoph Hellwig1-35/+0
dump_inst has a return before actually doing anything, so just drop the whole thing. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Nick Hu <[email protected]> Acked-by: Greentime Hu <[email protected]> Signed-off-by: Greentime Hu <[email protected]>
2021-02-25nds32: configs: Cleanup CONFIG_CROSS_COMPILEKrzysztof Kozlowski1-1/+0
CONFIG_CROSS_COMPILE is gone since commit f1089c92da79 ("kbuild: remove CONFIG_CROSS_COMPILE support"). Signed-off-by: Krzysztof Kozlowski <[email protected]> Acked-by: Greentime Hu <[email protected]> Signed-off-by: Greentime Hu <[email protected]>
2021-02-25nds32: Replace <linux/clk-provider.h> by <linux/of_clk.h>Geert Uytterhoeven1-1/+1
The Andes platform code is not a clock provider, and just needs to call of_clk_init(). Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>. Signed-off-by: Geert Uytterhoeven <[email protected]> Acked-by: Greentime Hu <[email protected]> Reviewed-by: Stephen Boyd <[email protected]> Signed-off-by: Greentime Hu <[email protected]>
2021-02-24Merge tag 'x86-entry-2021-02-24' of ↵Linus Torvalds37-196/+318
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq entry updates from Thomas Gleixner: "The irq stack switching was moved out of the ASM entry code in course of the entry code consolidation. It ended up being suboptimal in various ways. This reworks the X86 irq stack handling: - Make the stack switching inline so the stackpointer manipulation is not longer at an easy to find place. - Get rid of the unnecessary indirect call. - Avoid the double stack switching in interrupt return and reuse the interrupt stack for softirq handling. - A objtool fix for CONFIG_FRAME_POINTER=y builds where it got confused about the stack pointer manipulation" * tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix stack-swizzle for FRAME_POINTER=y um: Enforce the usage of asm-generic/softirq_stack.h x86/softirq/64: Inline do_softirq_own_stack() softirq: Move do_softirq_own_stack() to generic asm header softirq: Move __ARCH_HAS_DO_SOFTIRQ to Kconfig x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK x86/softirq: Remove indirection in do_softirq_own_stack() x86/entry: Use run_sysvec_on_irqstack_cond() for XEN upcall x86/entry: Convert device interrupts to inline stack switching x86/entry: Convert system vectors to irq stack macro x86/irq: Provide macro for inlining irq stack switching x86/apic: Split out spurious handling code x86/irq/64: Adjust the per CPU irq stack pointer by 8 x86/irq: Sanitize irq stack tracking x86/entry: Fix instrumentation annotation
2021-02-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds117-1734/+2027
Merge misc updates from Andrew Morton: "A few small subsystems and some of MM. 172 patches. Subsystems affected by this patch series: hexagon, scripts, ntfs, ocfs2, vfs, and mm (slab-generic, slab, slub, debug, pagecache, swap, memcg, pagemap, mprotect, mremap, page-reporting, vmalloc, kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction, mempolicy, oom-kill, hugetlbfs, and migration)" * emailed patches from Andrew Morton <[email protected]>: (172 commits) mm/migrate: remove unneeded semicolons hugetlbfs: remove unneeded return value of hugetlb_vmtruncate() hugetlbfs: fix some comment typos hugetlbfs: correct some obsolete comments about inode i_mutex hugetlbfs: make hugepage size conversion more readable hugetlbfs: remove meaningless variable avoid_reserve hugetlbfs: correct obsolete function name in hugetlbfs_read_iter() hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fs hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr() hugetlbfs: remove special hugetlbfs_set_page_dirty() mm/hugetlb: change hugetlb_reserve_pages() to type bool mm, oom: fix a comment in dump_task() mm/mempolicy: use helper range_in_vma() in queue_pages_test_walk() numa balancing: migrate on fault among multiple bound nodes mm, compaction: make fast_isolate_freepages() stay within zone mm/compaction: fix misbehaviors of fast_find_migrateblock() mm/compaction: correct deferral logic for proactive compaction mm/compaction: remove duplicated VM_BUG_ON_PAGE !PageLocked mm/compaction: remove rcu_read_lock during page compaction z3fold: simplify the zhdr initialization code in init_z3fold_page() ...
2021-02-24mm/migrate: remove unneeded semicolonsChengyang Fan1-1/+1
Remove superfluous semicolons after function definitions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengyang Fan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: remove unneeded return value of hugetlb_vmtruncate()Miaohe Lin1-5/+2
The function hugetlb_vmtruncate() is guaranteed to always success since commit 7aa91e104028 ("hugetlb: allow extending ftruncate on hugetlbfs"). So we should remove the unneeded return value which is always 0. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: fix some comment typosMiaohe Lin1-4/+4
Fix typos reserv to reserve, minimim to minimum. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: correct some obsolete comments about inode i_mutexMiaohe Lin1-2/+2
Since commit 9902af79c01a ("parallel lookups: actual switch to rwsem"), i_mutex of inode is converted to i_rwsem. So replace i_mutex with i_rwsem to make comments up to date. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: make hugepage size conversion more readableMiaohe Lin1-2/+2
The calculation 1U << (h->order + PAGE_SHIFT - 10) is actually equal to (PAGE_SHIFT << (h->order)) >> 10. So we can make it more readable by replace it with huge_page_size(h) >> 10. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: remove meaningless variable avoid_reserveMiaohe Lin1-3/+9
The variable avoid_reserve is meaningless because we never changed its value and just passed it to alloc_huge_page(). So remove it to make code more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc hugepage yet. Also add a comment offered by Mike Kravetz to explain this. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: correct obsolete function name in hugetlbfs_read_iter()Miaohe Lin1-1/+1
Since commit 36e789144267 ("kill do_generic_mapping_read"), the function do_generic_mapping_read() is renamed to do_generic_file_read(). And then commit 47c27bc46946 ("fs: pass iocb to do_generic_file_read") renamed it to generic_file_buffered_read(). So replace do_generic_mapping_read() with generic_file_buffered_read() to keep comment uptodate. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: use helper macro default_hstate in init_hugetlbfs_fsMiaohe Lin1-1/+1
Since commit e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes"), we can use macro default_hstate to get the struct hstate which we use by default. But init_hugetlbfs_fs() forgot to use it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: remove useless BUG_ON(!inode) in hugetlbfs_setattr()Miaohe Lin1-2/+0
When we reach here with inode = NULL, we should have crashed as inode has already been dereferenced via hstate_inode. So this BUG_ON(!inode) does not take effect and should be removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24hugetlbfs: remove special hugetlbfs_set_page_dirty()Mike Kravetz1-12/+1
Matthew Wilcox noticed that hugetlbfs_set_page_dirty always returns 0. Instead, it should return 1 or 0 depending on the previous state of the dirty bit. In addition, the call to compound_head is redundant as it is also performed in calling routine set_page_dirty. Replace the hugetlbfs specific routine hugetlbfs_set_page_dirty with __set_page_dirty_no_writeback as it addresses both of these issues. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm/hugetlb: change hugetlb_reserve_pages() to type boolMike Kravetz3-26/+17
While reviewing a bug in hugetlb_reserve_pages, it was noticed that all callers ignore the return value. Any failure is considered an ENOMEM error by the callers. Change the function to be of type bool. The function will return true if the reservation was successful, false otherwise. Callers currently assume a zero return code indicates success. Change the callers to look for true to indicate success. No functional change, only code cleanup. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm, oom: fix a comment in dump_task()Tang Yizhou1-3/+2
If p is a kthread, it will be checked in oom_unkillable_task() so we can delete the corresponding comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tang Yizhou <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm/mempolicy: use helper range_in_vma() in queue_pages_test_walk()Miaohe Lin1-1/+1
The helper range_in_vma() is introduced via commit 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages"). But we forgot to use it in queue_pages_test_walk(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24numa balancing: migrate on fault among multiple bound nodesHuang Ying2-1/+19
Now, NUMA balancing can only optimize the page placement among the NUMA nodes if the default memory policy is used. Because the memory policy specified explicitly should take precedence. But this seems too strict in some situations. For example, on a system with 4 NUMA nodes, if the memory of an application is bound to the node 0 and 1, NUMA balancing can potentially migrate the pages between the node 0 and 1 to reduce cross-node accessing without breaking the explicit memory binding policy. So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to set_mempolicy() when mode is MPOL_BIND. With the flag specified, NUMA balancing will be enabled within the thread to optimize the page placement within the constrains of the specified memory binding policy. With the newly added flag, the NUMA balancing control mechanism becomes, - sysctl knob numa_balancing can enable/disable the NUMA balancing globally. - even if sysctl numa_balancing is enabled, the NUMA balancing will be disabled for the memory areas or applications with the explicit memory policy by default. - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for the applications when specifying the explicit memory policy (MPOL_BIND). Various page placement optimization based on the NUMA balancing can be done with these flags. As the first step, in this patch, if the memory of the application is bound to multiple nodes (MPOL_BIND), and in the hint page fault handler the accessing node are in the policy nodemask, the page will be tried to be migrated to the accessing node to reduce the cross-node accessing. If the newly added MPOL_F_NUMA_BALANCING flag is specified by an application on an old kernel version without its support, set_mempolicy() will return -1 and errno will be set to EINVAL. The application can use this behavior to run on both old and new kernel versions. And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other than MPOL_BIND, set_mempolicy() will return -1 and errno will be set to EINVAL as before. Because we don't support optimization based on the NUMA balancing for these modes. In the previous version of the patch, we tried to reuse MPOL_MF_LAZY for mbind(). But that flag is tied to MPOL_MF_MOVE.*, so it seems not a good API/ABI for the purpose of the patch. And because it's not clear whether it's necessary to enable NUMA balancing for a specific memory area inside an application, so we only add the flag at the thread level (set_mempolicy()) instead of the memory area level (mbind()). We can do that when it become necessary. To test the patch, we run a test case as follows on a 4-node machine with 192 GB memory (48 GB per node). 1. Change pmbench memory accessing benchmark to call set_mempolicy() to bind its memory to node 1 and 3 and enable NUMA balancing. Some related code snippets are as follows, #include <numaif.h> #include <numa.h> struct bitmask *bmp; int ret; bmp = numa_parse_nodestring("1,3"); ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, bmp->maskp, bmp->size + 1); /* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */ if (ret < 0 && errno == EINVAL) ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1); if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } 2. Run a memory eater on node 3 to use 40 GB memory before running pmbench. 3. Run pmbench with 64 processes, the working-set size of each process is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB. The CPU and the memory (as in step 1.) of all pmbench processes is bound to node 1 and 3. So, after CPU usage is balanced, some pmbench processes run on the CPUs of the node 3 will access the memory of the node 1. 4. After the pmbench processes run for 100 seconds, kill the memory eater. Now it's possible for some pmbench processes to migrate their pages from node 1 to node 3 to reduce cross-node accessing. Test results show that, with the patch, the pages can be migrated from node 1 to node 3 after killing the memory eater, and the pmbench score can increase about 17.5%. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm, compaction: make fast_isolate_freepages() stay within zoneVlastimil Babka1-5/+11
Compaction always operates on pages from a single given zone when isolating both pages to migrate and freepages. Pageblock boundaries are intersected with zone boundaries to be safe in case zone starts or ends in the middle of pageblock. The use of pageblock_pfn_to_page() protects against non-contiguous pageblocks. The functions fast_isolate_freepages() and fast_isolate_around() don't currently protect the fast freepage isolation thoroughly enough against these corner cases, and can result in freepage isolation operate outside of zone boundaries: - in fast_isolate_freepages() if we get a pfn from the first pageblock of a zone that starts in the middle of that pageblock, 'highest' can be a pfn outside of the zone. If we fail to isolate anything in this function, we may then call fast_isolate_around() on a pfn outside of the zone and there effectively do a set_pageblock_skip(page_to_pfn(highest)) which may currently hit a VM_BUG_ON() in some configurations - fast_isolate_around() checks only the zone end boundary and not beginning, nor that the pageblock is contiguous (with pageblock_pfn_to_page()) so it's possible that we end up calling isolate_freepages_block() on a range of pfn's from two different zones and end up e.g. isolating freepages under the wrong zone's lock. This patch should fix the above issues. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm/compaction: fix misbehaviors of fast_find_migrateblock()Wonhyuk Yang1-15/+12
In the fast_find_migrateblock(), it iterates ocer the freelist to find the proper pageblock. But there are some misbehaviors. First, if the page we found is equal to cc->migrate_pfn, it is considered that we didn't find a suitable pageblock. Secondly, if the loop was terminated because order is less than PAGE_ALLOC_COSTLY_ORDER, it could be considered that we found a suitable one. Thirdly, if the skip bit is set on the page block and we goto continue, it doesn't check nr_scanned. Fourthly, if the page block's skip bit is set, it checks that page block is the last of list, which is unnecessary. Link: https://lkml.kernel.org/r/[email protected] Fixes: 70b44595eafe9 ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Wonhyuk Yang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-24mm/compaction: correct deferral logic for proactive compactionCharan Teja Reddy1-6/+14
should_proactive_compact_node() returns true when sum of the weighted fragmentation score of all the zones in the node is greater than the wmark_high of compaction, which then triggers the proactive compaction that operates on the individual zones of the node. But proactive compaction runs on the zone only when its weighted fragmentation score is greater than wmark_low(=wmark_high - 10). This means that the sum of the weighted fragmentation scores of all the zones can exceed the wmark_high but individual weighted fragmentation zone scores can still be less than wmark_low which makes the unnecessary trigger of the proactive compaction only to return doing nothing. Issue with the return of proactive compaction with out even trying is its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is same, thinking that compaction didn't make any progress but in reality it didn't even try. With the delay between successive retries for proactive compaction is 500msec, it can result into the deferral for ~30sec with out even trying the proactive compaction. Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with sizes 5GB and 6GB respectively. After opening some apps on the android, the weighted fragmentation scores of these zones are 47 and 49 respectively. Since the sum of these fragmentation scores are above the wmark_high which triggers the proactive compaction and there since the individual zones weighted fragmentation scores are below wmark_low, it returns without trying the proactive compaction. As a result the weighted fragmentation scores of the zones are still 47 and 49 which makes the existing logic to defer the compaction thinking that noprogress is made across the compaction. Fix this by checking just zone fragmentation score, not the weighted, in __compact_finished() and use the zones weighted fragmentation score in fragmentation_score_node(). In the test case above, If the weighted average of is above wmark_high, then individual score (not adjusted) of atleast one zone has to be above wmark_high. Thus it avoids the unnecessary trigger and deferrals of the proactive compaction. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Charan Teja Reddy <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Vinayak Menon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>