aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-06-03mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in ↵Daniel Jordan1-3/+4
deferred init Patch series "initialize deferred pages with interrupts enabled", v4. Keep interrupts enabled during deferred page initialization in order to make code more modular and allow jiffies to update. Original approach, and discussion can be found here: http://lkml.kernel.org/r/[email protected] This patch (of 3): deferred_init_memmap() disables interrupts the entire time, so it calls touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it will run with interrupts enabled, at which point cond_resched() should be used instead. deferred_grow_zone() makes the same watchdog calls through code shared with deferred init but will continue to run with interrupts disabled, so it can't call cond_resched(). Pull the watchdog calls up to these two places to allow the first to be changed later, independently of the second. The frequency reduces from twice per pageblock (init and free) to once per max order block. Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages") Signed-off-by: Daniel Jordan <[email protected]> Signed-off-by: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Williams <[email protected]> Cc: Shile Zhang <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: James Morris <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Yiqian Wei <[email protected]> Cc: <[email protected]> [4.17+] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc: restrict and formalize compound_page_dtors[]Anshuman Khandual2-6/+6
Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and explicitly position them according to enum compound_dtor_id. This improves protection against possible misalignment between compound_page_dtors[] and enum compound_dtor_id later on. Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm, page_alloc: reset the zone->watermark_boost earlyCharan Teja Reddy1-1/+1
Updating the zone watermarks by any means, like min_free_kbytes, water_mark_scale_factor etc, when ->watermark_boost is set will result in higher low and high watermarks than the user asked. Below are the steps to reproduce the problem on system setup of Android kernel running on Snapdragon hardware. 1) Default settings of the system are as below: #cat /proc/sys/vm/min_free_kbytes = 5162 #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 8340 high 8539 2) Monitor the zone->watermark_boost(by adding a debug print in the kernel) and whenever it is greater than zero value, write the same value of min_free_kbytes obtained from step 1. #echo 5162 > /proc/sys/vm/min_free_kbytes 3) Then read the zone watermarks in the system while the ->watermark_boost is zero. This should show the same values of watermarks as step 1 but shown a higher values than asked. #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 21148 high 21347 These higher values are because of updating the zone watermarks using the macro min_wmark_pages(zone) which also adds the zone->watermark_boost. #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) So the steps that lead to the issue are: 1) On the extfrag event, watermarks are boosted by storing the required value in ->watermark_boost. 2) User tries to update the zone watermarks level in the system through min_free_kbytes or watermark_scale_factor. 3) Later, when kswapd woke up, it resets the zone->watermark_boost to zero. In step 2), we use the min_wmark_pages() macro to store the watermarks in the zone structure thus the values are always offsetted by ->watermark_boost value. This can be avoided by resetting the ->watermark_boost to zero before it is used. Signed-off-by: Charan Teja Reddy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Vinayak Menon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: reset numa stats for boot pagesetsSandipan Das1-0/+15
Initially, the per-cpu pagesets of each zone are set to the boot pagesets. The real pagesets are allocated later but before that happens, page allocations do occur and the numa stats for the boot pagesets get incremented since they are common to all zones at that point. The real pagesets, however, are allocated for the populated zones only. Unpopulated zones, like those associated with memory-less nodes, continue using the boot pageset and end up skewing the numa stats of the corresponding node. E.g. $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 4 5 6 7 node 1 size: 8131 MB node 1 free: 6980 MB node distances: node 0 1 0: 10 40 1: 40 10 $ numastat node0 node1 numa_hit 108 56495 numa_miss 0 0 numa_foreign 0 0 interleave_hit 0 4537 local_node 108 31547 other_node 0 24948 Hence, the boot pageset stats need to be cleared after the real pagesets are allocated. After this point, the stats of the boot pagesets do not change as page allocations requested for a memory-less node will either fail (if __GFP_THISNODE is used) or get fulfilled by a preferred zone of a different node based on the fallback zonelist. [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Link: http://lkml.kernel.org/r/9c9c2d1b15e37f6e6bf32f99e3100035e90c4ac9.1588868430.git.sandipan@linux.ibm.com Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: rename gfpflags_to_migratetype to gfp_migratetype for same conventionWei Yang4-8/+7
Pageblock migrate type is encoded in GFP flags, just as zone_type and zonelist. Currently we use gfp_zone() and gfp_zonelist() to extract related information, it would be proper to use the same naming convention for migrate type. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: use NODE_MASK_NONE in build_zonelists()Wei Yang1-2/+1
Slightly simplify the code by initializing user_mask with NODE_MASK_NONE, instead of later calling nodes_clear(). This saves a line of code. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc: integrate classzone_idx and high_zoneidxJoonsoo Kim12-150/+175
classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Baoquan He <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ye Xiaolong <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc: use ac->high_zoneidx for classzone_idxJoonsoo Kim1-1/+1
Patch series "integrate classzone_idx and high_zoneidx", v5. This patchset is followup of the problem reported and discussed two years ago [1, 2]. The problem this patchset solves is related to the classzone_idx on the NUMA system. It causes a problem when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. This problem was reported two years ago, and, at that time, the solution got general agreements [2]. But it was not upstreamed. [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop [2]: http://lkml.kernel.org/r/[email protected] This patch (of 2): Currently, we use classzone_idx to calculate lowmem reserve proetection for an allocation request. This classzone_idx causes a problem on NUMA systems when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. Before further explanation, I should first clarify how to compute the classzone_idx and the high_zoneidx. - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and represents the index of the highest zone the allocation can use - classzone_idx was supposed to be the index of the highest zone on the local node that the allocation can use, that is actually available in the system Think about following example. Node 0 has 4 populated zone, DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones, such as MOVABLE, doesn't exist on node 1 and this makes following difference. Assume that there is an allocation request whose gfp_zone(gfp_mask) is the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is initiated on node 0, it's classzone_idx is 3 since actually available/usable zone on local (node 0) is MOVABLE. If this allocation is initiated on node 1, it's classzone_idx is 2 since actually available/usable zone on local (node 1) is NORMAL. You can see that classzone_idx of the allocation request are different according to their starting node, even if their high_zoneidx is the same. Think more about these two allocation requests. If they are processed on local, there is no problem. However, if allocation is initiated on node 1 are processed on remote, in this example, at the NORMAL zone on node 0, due to memory shortage, problem occurs. Their different classzone_idx leads to different lowmem reserve and then different min watermark. See the following example. root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo Node 0, zone DMA per-node stats ... pages free 3965 min 5 low 8 high 11 spanned 4095 present 3998 managed 3977 protection: (0, 2961, 4928, 5440) ... Node 0, zone DMA32 pages free 757955 min 1129 low 1887 high 2645 spanned 1044480 present 782303 managed 758116 protection: (0, 0, 1967, 2479) ... Node 0, zone Normal pages free 459806 min 750 low 1253 high 1756 spanned 524288 present 524288 managed 503620 protection: (0, 0, 0, 4096) ... Node 0, zone Movable pages free 130759 min 195 low 326 high 457 spanned 1966079 present 131072 managed 131072 protection: (0, 0, 0, 0) ... Node 1, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone Normal per-node stats ... pages free 233277 min 383 low 640 high 897 spanned 262144 present 262144 managed 257744 protection: (0, 0, 0, 0) ... Node 1, zone Movable pages free 0 min 0 low 0 high 0 spanned 262144 present 0 managed 0 protection: (0, 0, 0, 0) - static min watermark for the NORMAL zone on node 0 is 750. - lowmem reserve for the request with classzone idx 3 at the NORMAL on node 0 is 4096. - lowmem reserve for the request with classzone idx 2 at the NORMAL on node 0 is 0. So, overall min watermark is: allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846 allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750 Allocation initiated on node 1 will have some precedence than allocation initiated on node 0 because min watermark of the former allocation is lower than the other. So, allocation initiated on node 1 could succeed on node 0 when allocation initiated on node 0 could not, and, this could cause too many numa_miss allocation. Then, performance could be downgraded. Recently, there was a regression report about this problem on CMA patches since CMA memory are placed in ZONE_MOVABLE by those patches. I checked that problem is disappeared with this fix that uses high_zoneidx for classzone_idx. http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop Using high_zoneidx for classzone_idx is more consistent way than previous approach because system's memory layout doesn't affect anything to it. With this patch, both classzone_idx on above example will be 3 so will have the same min watermark. allocation initiated on node 0: 750 + 4096 = 4846 allocation initiated on node 1: 750 + 4096 = 4846 One could wonder if there is a side effect that allocation initiated on node 1 will use higher bar when allocation is handled on local since classzone_idx could be higher than before. It will not happen because the zone without managed page doesn't contributes lowmem_reserve at all. Reported-by: Ye Xiaolong <[email protected]> Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Ye Xiaolong <[email protected]> Reviewed-by: Baoquan He <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/vmstat.c: do not show lowmem reserve protection information of empty zoneBaoquan He1-6/+6
Because the lowmem reserve protection of a zone can't tell anything if the zone is empty, except of adding one more line in /proc/zoneinfo. Let's remove it from that zone's showing. Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is emptyBaoquan He1-1/+3
When requesting memory allocation from a specific zone is not satisfied, it will fall to lower zone to try allocating memory. In this case, lower zone's ->lowmem_reserve[] will help protect its own memory resource. The higher the relevant ->lowmem_reserve[] is, the harder the upper zone can get memory from this lower zone. However, this protection mechanism should be applied to populated zone, but not an empty zone. So filling ->lowmem_reserve[] for empty zone is not necessary, and may mislead people that it's valid data in that zone. Node 2, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1024, 1024) Node 2, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1024, 1024) Node 2, zone Normal per-node stats nr_inactive_anon 0 nr_active_anon 143 nr_inactive_file 0 nr_active_file 0 nr_unevictable 0 nr_slab_reclaimable 45 nr_slab_unreclaimable 254 Here clear out zone->lowmem_reserve[] if zone is empty. Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: only tune sysctl_lowmem_reserve_ratio value once when ↵Baoquan He1-2/+9
changing it Patch series "improvements about lowmem_reserve and /proc/zoneinfo", v2. This patch (of 3): When people write to /proc/sys/vm/lowmem_reserve_ratio to change sysctl_lowmem_reserve_ratio[], setup_per_zone_lowmem_reserve() is called to recalculate all ->lowmem_reserve[] for each zone of all nodes as below: static void setup_per_zone_lowmem_reserve(void) { ... for_each_online_pgdat(pgdat) { for (j = 0; j < MAX_NR_ZONES; j++) { ... while (idx) { ... if (sysctl_lowmem_reserve_ratio[idx] < 1) { sysctl_lowmem_reserve_ratio[idx] = 0; lower_zone->lowmem_reserve[j] = 0; } else { ... } } } } Meanwhile, here, sysctl_lowmem_reserve_ratio[idx] will be tuned if its value is smaller than '1'. As we know, sysctl_lowmem_reserve_ratio[] is set for zone without regarding to which node it belongs to. That means the tuning will be done on all nodes, even though it has been done in the first node. And the tuning will be done too even when init_per_zone_wmark_min() calls setup_per_zone_lowmem_reserve(), where actually nobody tries to change sysctl_lowmem_reserve_ratio[]. So now move the tuning into lowmem_reserve_ratio_sysctl_handler(), to make code logic more reasonable. Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Baoquan He <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: remove unused free_bootmem_with_active_regionsBaoquan He2-29/+0
Since commit 397dc00e249ec64e10 ("mips: sgi-ip27: switch from DISCONTIGMEM to SPARSEMEM"), the last caller of free_bootmem_with_active_regions() was gone. Now no user calls it any more. Let's remove it. Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocationsRoman Gushchin1-0/+14
Currently a cma area is barely used by the page allocator because it's used only as a fallback from movable, however kswapd tries hard to make sure that the fallback path isn't used. This results in a system evicting memory and pushing data into swap, while lots of CMA memory is still available. This happens despite the fact that alloc_contig_range is perfectly capable of moving any movable allocations out of the way of an allocation. To effectively use the cma area let's alter the rules: if the zone has more free cma pages than the half of total free pages in the zone, use cma pageblocks first and fallback to movable blocks in the case of failure. [[email protected]: ifdef the cma-specific code] Link: http://lkml.kernel.org/r/[email protected] Co-developed-by: Rik van Riel <[email protected]> Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Qian Cai <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Joonsoo Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: extract check_[new|free]_page_bad() common part to ↵Wei Yang1-19/+17
page_bad_reason() We share similar code in check_[new|free]_page_bad() to get the page's bad reason. Let's extract it and reduce code duplication. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: rename free_pages_check() to check_free_page()Wei Yang1-5/+5
free_pages_check() is the counterpart of check_new_page(). Rename it to use the same naming convention. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: rename free_pages_check_bad() to check_free_page_bad()Wei Yang1-2/+2
free_pages_check_bad() is the counterpart of check_new_page_bad(). Rename it to use the same naming convention. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: bad_flags is not necessary for bad_page()Wei Yang1-24/+10
After commit 5b57b8f22709 ("mm/debug.c: always print flags in dump_page()"), page->flags is always printed for a bad page. It is not necessary to have bad_flags any more. Suggested-by: Anshuman Khandual <[email protected]> Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/page_alloc.c: bad_[reason|flags] is not necessary when PageHWPoisonWei Yang1-7/+5
Patch series "mm/page_alloc.c: cleanup on check page", v3. This patchset does some cleanup related to check page. 1. Remove unnecessary bad_reason assignment 2. Remove bad_flags to bad_page() 3. Rename function for naming convention 4. Extract common part to check page Thanks for suggestions from David Rientjes and Anshuman Khandual. This patch (of 5): Since function returns directly, bad_[reason|flags] is not used any where. And move this to the first. This is a following cleanup for commit e570f56cccd21 ("mm: check_new_page_bad() directly returns in __PG_HWPOISON case") Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03docs/vm: update memory-models documentationMike Rapoport1-5/+4
To reflect the updates to free_area_init() family of functions. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: simplify find_min_pfn_with_active_regions()Mike Rapoport1-19/+1
find_min_pfn_with_active_regions() calls find_min_pfn_for_node() with nid parameter set to MAX_NUMNODES. This makes the find_min_pfn_for_node() traverse all memblock memory regions although the first PFN in the system can be easily found with memblock_start_of_DRAM(). Use memblock_start_of_DRAM() in find_min_pfn_with_active_regions() and drop now unused find_min_pfn_for_node(). Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: clean up free_area_init_node() and its helpersMike Rapoport1-82/+22
free_area_init_node() now always uses memblock info and the zone PFN limits so it does not need the backwards compatibility functions to calculate the zone spanned and absent pages. The removal of the compat_ versions of zone_{abscent,spanned}_pages_in_node() in turn, makes zone_size and zhole_size parameters unused. The node_start_pfn is determined by get_pfn_range_for_nid(), so there is no need to pass it to free_area_init_node(). As a result, the only required parameter to free_area_init_node() is the node ID, all the rest are removed along with no longer used compat_zone_{abscent,spanned}_pages_in_node() helpers. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: rename free_area_init_node() to free_area_init_memoryless_node()Mike Rapoport3-15/+6
free_area_init_node() is only used by x86 to initialize a memory-less nodes. Make its name reflect this and drop all the function parameters except node ID as they are anyway zero. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: free_area_init: allow defining max_zone_pfn in descending orderMike Rapoport3-34/+34
Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it is sorted in descending order allows using free_area_init() on such architectures. Add top -> down traversal of max_zone_pfn array in free_area_init() and use the latter in ARC node/zone initialization. [[email protected]: ARC fix] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: arc: free_area_init(): take into account PAE40 mode] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: declare arch_has_descending_max_zone_pfns()] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Reviewed-by: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Guenter Roeck <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODESMike Rapoport4-47/+0
The memmap_init() function was made to iterate over memblock regions and as the result the early_pfn_in_nid() function became obsolete. Since CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real implementation of early_pfn_in_nid(), it is also not needed anymore. Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES. Co-developed-by: Hoan Tran <[email protected]> Signed-off-by: Hoan Tran <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memmap_init: iterate over memblock regions rather that check each PFNBaoquan He2-28/+19
When called during boot the memmap_init_zone() function checks if each PFN is valid and actually belongs to the node being initialized using early_pfn_valid() and early_pfn_in_nid(). Each such check may cost up to O(log(n)) where n is the number of memory banks, so for large amount of memory overall time spent in early_pfn*() becomes substantial. Since the information is anyway present in memblock, we can iterate over memblock memory regions in memmap_init() and only call memmap_init_zone() for PFN ranges that are know to be valid and in the appropriate node. [[email protected]: fix a compilation warning from Clang] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix the incorrect hole in fast_isolate_freepages()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Qian Cai <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03xtensa: simplify detection of memory zone boundariesMike Rapoport1-4/+4
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03unicore32: simplify detection of memory zone boundariesMike Rapoport4-50/+15
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03sparc32: simplify detection of memory zone boundariesMike Rapoport1-16/+5
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03parisc: simplify detection of memory zone boundariesMike Rapoport1-19/+3
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03m68k: mm: simplify detection of memory zone boundariesMike Rapoport2-13/+8
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03csky: simplify detection of memory zone boundariesMike Rapoport1-15/+11
The free_area_init() function only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03arm64: simplify detection of memory zone boundaries for UMA configsMike Rapoport1-54/+0
The free_area_init() function only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Acked-by: Catalin Marinas <[email protected]> Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03arm: simplify detection of memory zone boundariesMike Rapoport1-59/+7
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03alpha: simplify detection of memory zone boundariesMike Rapoport1-14/+4
free_area_init() only requires the definition of maximal PFN for each of the supported zone rater than calculation of actual zone sizes and the sizes of the holes between the zones. After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP the free_area_init() is available to all architectures. Using this function instead of free_area_init_node() simplifies the zone detection. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: use free_area_init() instead of free_area_init_nodes()Mike Rapoport15-25/+18
free_area_init() has effectively became a wrapper for free_area_init_nodes() and there is no point of keeping it. Still free_area_init() name is shorter and more general as it does not imply necessity to initialize multiple nodes. Rename free_area_init_nodes() to free_area_init(), update the callers and drop old version of free_area_init(). Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Reviewed-by: Baoquan He <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Brian Cain <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: free_area_init: use maximal zone PFNs rather than zone sizesMike Rapoport12-60/+38
Currently, architectures that use free_area_init() to initialize memory map and node and zone structures need to calculate zone and hole sizes. We can use free_area_init_nodes() instead and let it detect the zone boundaries while the architectures will only have to supply the possible limits for the zones. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Reviewed-by: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP optionMike Rapoport20-119/+74
CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of nodes and zones structures between the systems that have region to node mapping in memblock and those that don't. Currently all the NUMA architectures enable this option and for the non-NUMA systems we can presume that all the memory belongs to node 0 and therefore the compile time configuration option is not required. The remaining few architectures that use DISCONTIGMEM without NUMA are easily updated to use memblock_add_node() instead of memblock_add() and thus have proper correspondence of memblock regions to NUMA nodes. Still, free_area_init_node() must have a backward compatible version because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is different. Once all the architectures will use the new semantics, the entire compatibility layer can be dropped. To avoid addition of extra run time memory to store node id for architectures that keep memblock but have only a single node, the node id field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and the corresponding accessors presume that in those cases it is always 0. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Acked-by: Catalin Marinas <[email protected]> [arm64] Cc: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: make early_pfn_to_nid() and related defintions close to each otherMike Rapoport3-37/+27
early_pfn_to_nid() and its helper __early_pfn_to_nid() are spread around include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c. Drop unused stub for __early_pfn_to_nid() and move its actual generic implementation close to its users. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Reviewed-by: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: memblock: replace dereferences of memblock_region.nid with API callsMike Rapoport4-10/+17
Patch series "mm: rework free_area_init*() funcitons". After the discussion [1] about removal of CONFIG_NODES_SPAN_OTHER_NODES and CONFIG_HAVE_MEMBLOCK_NODE_MAP options, I took it a bit further and updated the node/zone initialization. Since all architectures have memblock, it is possible to use only the newer version of free_area_init_node() that calculates the zone and node boundaries based on memblock node mapping and architectural limits on possible zone PFNs. The architectures that still determined zone and hole sizes can be switched to the generic code and the old code that took those zone and hole sizes can be simply removed. And, since it all started from the removal of CONFIG_NODES_SPAN_OTHER_NODES, the memmap_init() is now updated to iterate over memblocks and so it does not need to perform early_pfn_to_nid() query for every PFN. [1] https://lore.kernel.org/lkml/[email protected] This patch (of 21): There are several places in the code that directly dereference memblock_region.nid despite this field being defined only when CONFIG_HAVE_MEMBLOCK_NODE_MAP=y. Replace these with calls to memblock_get_region_nid() to improve code robustness and to avoid possible breakage when CONFIG_HAVE_MEMBLOCK_NODE_MAP will be removed. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Hoan Tran <[email protected]> [arm64] Reviewed-by: Baoquan He <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: clarify __GFP_MEMALLOC usageMichal Hocko1-0/+5
It seems that the existing documentation is not explicit about the expected usage and potential risks enough. While it is calls out that users have to free memory when using this flag it is not really apparent that users have to careful to not deplete memory reserves and that they should implement some sort of throttling wrt. freeing process. This is partly based on Neil's explanation [1]. Let's also call out that a pre allocated pool allocator should be considered. [1] http://lkml.kernel.org/r/[email protected] [[email protected]: coding style fixes] [[email protected]: update] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neil Brown <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: John Hubbard <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03string.h: fix incompatibility between FORTIFY_SOURCE and KASANDaniel Axtens1-12/+48
The memcmp KASAN self-test fails on a kernel with both KASAN and FORTIFY_SOURCE. When FORTIFY_SOURCE is on, a number of functions are replaced with fortified versions, which attempt to check the sizes of the operands. However, these functions often directly invoke __builtin_foo() once they have performed the fortify check. Using __builtins may bypass KASAN checks if the compiler decides to inline it's own implementation as sequence of instructions, rather than emit a function call that goes out to a KASAN-instrumented implementation. Why is only memcmp affected? ============================ Of the string and string-like functions that kasan_test tests, only memcmp is replaced by an inline sequence of instructions in my testing on x86 with gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2). I believe this is due to compiler heuristics. For example, if I annotate kmalloc calls with the alloc_size annotation (and disable some fortify compile-time checking!), the compiler will replace every memset except the one in kmalloc_uaf_memset with inline instructions. (I have some WIP patches to add this annotation.) Does this affect other functions in string.h? ============================================= Yes. Anything that uses __builtin_* rather than __real_* could be affected. This looks like: - strncpy - strcat - strlen - strlcpy maybe, under some circumstances? - strncat under some circumstances - memset - memcpy - memmove - memcmp (as noted) - memchr - strcpy Whether a function call is emitted always depends on the compiler. Most bugs should get caught by FORTIFY_SOURCE, but the missed memcmp test shows that this is not always the case. Isn't FORTIFY_SOURCE disabled with KASAN? ========================================- The string headers on all arches supporting KASAN disable fortify with kasan, but only when address sanitisation is _also_ disabled. For example from x86: #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__) /* * For files that are not instrumented (e.g. mm/slub.c) we * should use not instrumented version of mem* functions. */ #define memcpy(dst, src, len) __memcpy(dst, src, len) #define memmove(dst, src, len) __memmove(dst, src, len) #define memset(s, c, n) __memset(s, c, n) #ifndef __NO_FORTIFY #define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */ #endif #endif This comes from commit 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions"), and doesn't work when KASAN is enabled and the file is supposed to be sanitised - as with test_kasan.c I'm pretty sure this is not wrong, but not as expansive it should be: * we shouldn't use __builtin_memcpy etc in files where we don't have instrumentation - it could devolve into a function call to memcpy, which will be instrumented. Rather, we should use __memcpy which by convention is not instrumented. * we also shouldn't be using __builtin_memcpy when we have a KASAN instrumented file, because it could be replaced with inline asm that will not be instrumented. What is correct behaviour? ========================== Firstly, there is some overlap between fortification and KASAN: both provide some level of _runtime_ checking. Only fortify provides compile-time checking. KASAN and fortify can pick up different things at runtime: - Some fortify functions, notably the string functions, could easily be modified to consider sub-object sizes (e.g. members within a struct), and I have some WIP patches to do this. KASAN cannot detect these because it cannot insert poision between members of a struct. - KASAN can detect many over-reads/over-writes when the sizes of both operands are unknown, which fortify cannot. So there are a couple of options: 1) Flip the test: disable fortify in santised files and enable it in unsanitised files. This at least stops us missing KASAN checking, but we lose the fortify checking. 2) Make the fortify code always call out to real versions. Do this only for KASAN, for fear of losing the inlining opportunities we get from __builtin_*. (We can't use kasan_check_{read,write}: because the fortify functions are _extern inline_, you can't include _static_ inline functions without a compiler warning. kasan_check_{read,write} are static inline so we can't use them even when they would otherwise be suitable.) Take approach 2 and call out to real versions when KASAN is enabled. Use __underlying_foo to distinguish from __real_foo: __real_foo always refers to the kernel's implementation of foo, __underlying_foo could be either the kernel implementation or the __builtin_foo implementation. This is sometimes enough to make the memcmp test succeed with FORTIFY_SOURCE enabled. It is at least enough to get the function call into the module. One more fix is needed to make it reliable: see the next patch. Fixes: 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions") Signed-off-by: Daniel Axtens <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: David Gow <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Cc: Daniel Micay <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03kasan: stop tests being eliminated as dead code with FORTIFY_SOURCEDaniel Axtens1-10/+19
Patch series "Fix some incompatibilites between KASAN and FORTIFY_SOURCE", v4. 3 KASAN self-tests fail on a kernel with both KASAN and FORTIFY_SOURCE: memchr, memcmp and strlen. When FORTIFY_SOURCE is on, a number of functions are replaced with fortified versions, which attempt to check the sizes of the operands. However, these functions often directly invoke __builtin_foo() once they have performed the fortify check. The compiler can detect that the results of these functions are not used, and knows that they have no other side effects, and so can eliminate them as dead code. Why are only memchr, memcmp and strlen affected? ================================================ Of string and string-like functions, kasan_test tests: * strchr -> not affected, no fortified version * strrchr -> likewise * strcmp -> likewise * strncmp -> likewise * strnlen -> not affected, the fortify source implementation calls the underlying strnlen implementation which is instrumented, not a builtin * strlen -> affected, the fortify souce implementation calls a __builtin version which the compiler can determine is dead. * memchr -> likewise * memcmp -> likewise * memset -> not affected, the compiler knows that memset writes to its first argument and therefore is not dead. Why does this not affect the functions normally? ================================================ In string.h, these functions are not marked as __pure, so the compiler cannot know that they do not have side effects. If relevant functions are marked as __pure in string.h, we see the following warnings and the functions are elided: lib/test_kasan.c: In function `kasan_memchr': lib/test_kasan.c:606:2: warning: statement with no effect [-Wunused-value] memchr(ptr, '1', size + 1); ^~~~~~~~~~~~~~~~~~~~~~~~~~ lib/test_kasan.c: In function `kasan_memcmp': lib/test_kasan.c:622:2: warning: statement with no effect [-Wunused-value] memcmp(ptr, arr, size+1); ^~~~~~~~~~~~~~~~~~~~~~~~ lib/test_kasan.c: In function `kasan_strings': lib/test_kasan.c:645:2: warning: statement with no effect [-Wunused-value] strchr(ptr, '1'); ^~~~~~~~~~~~~~~~ ... This annotation would make sense to add and could be added at any point, so the behaviour of test_kasan.c should change. The fix ======= Make all the functions that are pure write their results to a global, which makes them live. The strlen and memchr tests now pass. The memcmp test still fails to trigger, which is addressed in the next patch. [[email protected]: drop patch 3] Link: http://lkml.kernel.org/r/[email protected] Fixes: 0c96350a2d2f ("lib/test_kasan.c: add tests for several string/memory API functions") Signed-off-by: Daniel Axtens <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: David Gow <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Cc: Daniel Micay <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/gup: might_lock_read(mmap_sem) in get_user_pages_fast()John Hubbard1-0/+3
Instead of scattering these assertions across the drivers, do this assertion inside the core of get_user_pages_fast*() functions. That also includes pin_user_pages_fast*() routines. Add a might_lock_read(mmap_sem) call to internal_get_user_pages_fast(). Suggested-by: Matthew Wilcox <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Jason Gunthorpe <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03drm/i915: convert get_user_pages() --> pin_user_pages()John Hubbard1-9/+13
This code was using get_user_pages*(), in a "Case 2" scenario (DMA/RDMA), using the categorization from [1]. That means that it's time to convert the get_user_pages*() + put_page() calls to pin_user_pages*() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Chris Wilson <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Jani Nikula <[email protected]> Cc: "Joonas Lahtinen" <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Matthew Auld <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/gup: introduce pin_user_pages_fast_only()John Hubbard2-0/+38
This is the FOLL_PIN equivalent of __get_user_pages_fast(), except with a more descriptive name, and gup_flags instead of a boolean "write" in the argument list. Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Chris Wilson <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Jani Nikula <[email protected]> Cc: "Joonas Lahtinen" <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/gup: refactor and de-duplicate gup_fast() codeJohn Hubbard2-41/+43
There were two nearly identical sets of code for gup_fast() style of walking the page tables with interrupts disabled. This has lead to the usual maintenance problems that arise from having duplicated code. There is already a core internal routine in gup.c for gup_fast(), so just enhance it very slightly: allow skipping the fall-back to "slow" (regular) get_user_pages(), via the new FOLL_FAST_ONLY flag. Then, just call internal_get_user_pages_fast() from __get_user_pages_fast(), and adjust the API to match pre-existing API behavior. There is a change in behavior from this refactoring: the nested form of interrupt disabling is used in all gup_fast() variants now. That's because there is only one place that interrupt disabling for page walking is done, and so the safer form is required. This should, if anything, eliminate possible (rare) bugs, because the non-nested form of enabling interrupts was fragile at best. [[email protected]: fixup] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Chris Wilson <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Jani Nikula <[email protected]> Cc: "Joonas Lahtinen" <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/gup: move __get_user_pages_fast() down a few lines in gup.cJohn Hubbard1-66/+66
Patch series "mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()", v2. In order to convert the drm/i915 driver from get_user_pages() to pin_user_pages(), a FOLL_PIN equivalent of __get_user_pages_fast() was required. That led to refactoring __get_user_pages_fast(), with the following goals: 1) As above: provide a pin_user_pages*() routine for drm/i915 to call, in place of __get_user_pages_fast(), 2) Get rid of the gup.c duplicate code for walking page tables with interrupts disabled. This duplicate code is a minor maintenance problem anyway. 3) Make it easy for an upcoming patch from Souptick, which aims to convert __get_user_pages_fast() to use a gup_flags argument, instead of a bool writeable arg. Also, if this series looks good, we can ask Souptick to change the name as well, to whatever the consensus is. My initial recommendation is: get_user_pages_fast_only(), to match the new pin_user_pages_only(). This patch (of 4): This is in order to avoid a forward declaration of internal_get_user_pages_fast(), in the next patch. This is code movement only--all generated code should be identical. Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Chris Wilson <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Jani Nikula <[email protected]> Cc: "Joonas Lahtinen" <[email protected]> Cc: Matthew Auld <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/memcg: optimize memory.numa_stat like memory.statShakeel Butt1-25/+26
Currently reading memory.numa_stat traverses the underlying memcg tree multiple times to accumulate the stats to present the hierarchical view of the memcg tree. However the kernel already maintains the hierarchical view of the stats and use it in memory.stat. Just use the same mechanism in memory.numa_stat as well. I ran a simple benchmark which reads root_mem_cgroup's memory.numa_stat file in the presense of 10000 memcgs. The results are: Without the patch: $ time cat /dev/cgroup/memory/memory.numa_stat > /dev/null real 0m0.700s user 0m0.001s sys 0m0.697s With the patch: $ time cat /dev/cgroup/memory/memory.numa_stat > /dev/null real 0m0.001s user 0m0.001s sys 0m0.000s [[email protected]: avoid forcing out-of-line code generation] Signed-off-by: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Michal Hocko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm/slub: fix a memory leak in sysfs_slab_add()Wang Hai1-1/+3
syzkaller reports for memory leak when kobject_init_and_add() returns an error in the function sysfs_slab_add() [1] When this happened, the function kobject_put() is not called for the corresponding kobject, which potentially leads to memory leak. This patch fixes the issue by calling kobject_put() even if kobject_init_and_add() fails. [1] BUG: memory leak unreferenced object 0xffff8880a6d4be88 (size 8): comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s) hex dump (first 8 bytes): 70 69 64 5f 33 00 ff ff pid_3... backtrace: kstrdup+0x35/0x70 mm/util.c:60 kstrdup_const+0x3d/0x50 mm/util.c:82 kvasprintf_const+0x112/0x170 lib/kasprintf.c:48 kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xd8/0x170 lib/kobject.c:473 sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811 __kmem_cache_create+0x50a/0x570 mm/slub.c:4384 create_cache+0x113/0x1e0 mm/slab_common.c:407 kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505 kmem_cache_create+0xd/0x10 mm/slab_common.c:564 create_pid_cachep kernel/pid_namespace.c:54 [inline] create_pid_namespace kernel/pid_namespace.c:96 [inline] copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148 create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95 unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229 ksys_unshare+0x3d2/0x770 kernel/fork.c:2969 __do_sys_unshare kernel/fork.c:3037 [inline] __se_sys_unshare kernel/fork.c:3035 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035 do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295 Fixes: 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wang Hai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-03mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()Andrea Arcangeli1-3/+28
Write protect anon page faults require an accurate mapcount to decide if to break the COW or not. This is implemented in the THP path with reuse_swap_page() -> page_trans_huge_map_swapcount()/page_trans_huge_mapcount(). If the COW triggers while the other processes sharing the page are under a huge pmd split, to do an accurate reading, we must ensure the mapcount isn't computed while it's being transferred from the head page to the tail pages. reuse_swap_cache() already runs serialized by the page lock, so it's enough to add the page lock around __split_huge_pmd_locked too, in order to add the missing serialization. Note: the commit in "Fixes" is just to facilitate the backporting, because the code before such commit didn't try to do an accurate THP mapcount calculation and it instead used the page_count() to decide if to COW or not. Both the page_count and the pin_count are THP-wide refcounts, so they're inaccurate if used in reuse_swap_page(). Reverting such commit (besides the unrelated fix to the local anon_vma assignment) would have also opened the window for memory corruption side effects to certain workloads as documented in such commit header. Signed-off-by: Andrea Arcangeli <[email protected]> Suggested-by: Jann Horn <[email protected]> Reported-by: Jann Horn <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults") Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>