aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-12-01x86/kasan: support KASAN_VMALLOCDaniel Axtens2-0/+62
In the case where KASAN directly allocates memory to back vmalloc space, don't map the early shadow page over it. We prepopulate pgds/p4ds for the range that would otherwise be empty. This is required to get it synced to hardware on boot, allowing the lower levels of the page tables to be filled dynamically. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Axtens <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Reviewed-by: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01fork: support VMAP_STACK with KASAN_VMALLOCDaniel Axtens2-4/+9
Supporting VMAP_STACK with KASAN_VMALLOC is straightforward: - clear the shadow region of vmapped stacks when swapping them in - tweak Kconfig to allow VMAP_STACK to be turned on with KASAN Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Axtens <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Reviewed-by: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01kasan: add test for vmallocDaniel Axtens1-0/+26
Test kasan vmalloc support by adding a new test to the module. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Axtens <[email protected]> Reviewed-by: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01kasan: support backing vmalloc space with real shadow memoryDaniel Axtens9-9/+408
Patch series "kasan: support backing vmalloc space with real shadow memory", v11. Currently, vmalloc space is backed by the early shadow page. This means that kasan is incompatible with VMAP_STACK. This series provides a mechanism to back vmalloc space with real, dynamically allocated memory. I have only wired up x86, because that's the only currently supported arch I can work with easily, but it's very easy to wire up other architectures, and it appears that there is some work-in-progress code to do this on arm64 and s390. This has been discussed before in the context of VMAP_STACK: - https://bugzilla.kernel.org/show_bug.cgi?id=202009 - https://lkml.org/lkml/2018/7/22/198 - https://lkml.org/lkml/2019/7/19/822 In terms of implementation details: Most mappings in vmalloc space are small, requiring less than a full page of shadow space. Allocating a full shadow page per mapping would therefore be wasteful. Furthermore, to ensure that different mappings use different shadow pages, mappings would have to be aligned to KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE. Instead, share backing space across multiple mappings. Allocate a backing page when a mapping in vmalloc space uses a particular page of the shadow region. This page can be shared by other vmalloc mappings later on. We hook in to the vmap infrastructure to lazily clean up unused shadow memory. Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that: - Turning on KASAN, inline instrumentation, without vmalloc, introuduces a 4.1x-4.2x slowdown in vmalloc operations. - Turning this on introduces the following slowdowns over KASAN: * ~1.76x slower single-threaded (test_vmalloc.sh performance) * ~2.18x slower when both cpus are performing operations simultaneously (test_vmalloc.sh sequential_test_order=1) This is unfortunate but given that this is a debug feature only, not the end of the world. The benchmarks are also a stress-test for the vmalloc subsystem: they're not indicative of an overall 2x slowdown! This patch (of 4): Hook into vmalloc and vmap, and dynamically allocate real shadow memory to back the mappings. Most mappings in vmalloc space are small, requiring less than a full page of shadow space. Allocating a full shadow page per mapping would therefore be wasteful. Furthermore, to ensure that different mappings use different shadow pages, mappings would have to be aligned to KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE. Instead, share backing space across multiple mappings. Allocate a backing page when a mapping in vmalloc space uses a particular page of the shadow region. This page can be shared by other vmalloc mappings later on. We hook in to the vmap infrastructure to lazily clean up unused shadow memory. To avoid the difficulties around swapping mappings around, this code expects that the part of the shadow region that covers the vmalloc space will not be covered by the early shadow page, but will be left unmapped. This will require changes in arch-specific code. This allows KASAN with VMAP_STACK, and may be helpful for architectures that do not have a separate module space (e.g. powerpc64, which I am currently working on). It also allows relaxing the module alignment back to PAGE_SIZE. Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that: - Turning on KASAN, inline instrumentation, without vmalloc, introuduces a 4.1x-4.2x slowdown in vmalloc operations. - Turning this on introduces the following slowdowns over KASAN: * ~1.76x slower single-threaded (test_vmalloc.sh performance) * ~2.18x slower when both cpus are performing operations simultaneously (test_vmalloc.sh sequential_test_order=3D1) This is unfortunate but given that this is a debug feature only, not the end of the world. The full benchmark results are: Performance No KASAN KASAN original x baseline KASAN vmalloc x baseline x KASAN fix_size_alloc_test 662004 11404956 17.23 19144610 28.92 1.68 full_fit_alloc_test 710950 12029752 16.92 13184651 18.55 1.10 long_busy_list_alloc_test 9431875 43990172 4.66 82970178 8.80 1.89 random_size_alloc_test 5033626 23061762 4.58 47158834 9.37 2.04 fix_align_alloc_test 1252514 15276910 12.20 31266116 24.96 2.05 random_size_align_alloc_te 1648501 14578321 8.84 25560052 15.51 1.75 align_shift_alloc_test 147 830 5.65 5692 38.72 6.86 pcpu_alloc_test 80732 125520 1.55 140864 1.74 1.12 Total Cycles 119240774314 763211341128 6.40 1390338696894 11.66 1.82 Sequential, 2 cpus No KASAN KASAN original x baseline KASAN vmalloc x baseline x KASAN fix_size_alloc_test 1423150 14276550 10.03 27733022 19.49 1.94 full_fit_alloc_test 1754219 14722640 8.39 15030786 8.57 1.02 long_busy_list_alloc_test 11451858 52154973 4.55 107016027 9.34 2.05 random_size_alloc_test 5989020 26735276 4.46 68885923 11.50 2.58 fix_align_alloc_test 2050976 20166900 9.83 50491675 24.62 2.50 random_size_align_alloc_te 2858229 17971700 6.29 38730225 13.55 2.16 align_shift_alloc_test 405 6428 15.87 26253 64.82 4.08 pcpu_alloc_test 127183 151464 1.19 216263 1.70 1.43 Total Cycles 54181269392 308723699764 5.70 650772566394 12.01 2.11 fix_size_alloc_test 1420404 14289308 10.06 27790035 19.56 1.94 full_fit_alloc_test 1736145 14806234 8.53 15274301 8.80 1.03 long_busy_list_alloc_test 11404638 52270785 4.58 107550254 9.43 2.06 random_size_alloc_test 6017006 26650625 4.43 68696127 11.42 2.58 fix_align_alloc_test 2045504 20280985 9.91 50414862 24.65 2.49 random_size_align_alloc_te 2845338 17931018 6.30 38510276 13.53 2.15 align_shift_alloc_test 472 3760 7.97 9656 20.46 2.57 pcpu_alloc_test 118643 132732 1.12 146504 1.23 1.10 Total Cycles 54040011688 309102805492 5.72 651325675652 12.05 2.11 [[email protected]: fixups] Link: http://lkml.kernel.org/r/[email protected] Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D202009 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mark Rutland <[email protected]> [shadow rework] Signed-off-by: Daniel Axtens <[email protected]> Co-developed-by: Mark Rutland <[email protected]> Acked-by: Vasily Gorbik <[email protected]> Reviewed-by: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/vmalloc: rework vmap_area_lockUladzislau Rezki (Sony)1-30/+50
With the new allocation approach introduced in the 5.2 kernel, it becomes possible to get rid of one global spinlock. By doing that we can further improve the KVA from the performance point of view. Basically we can have two independent locks, one for allocation part and another one for deallocation, because of two different entities: "free data structures" and "busy data structures". As a result, allocation/deallocation operations can still interfere between each other in case of running simultaneously on different CPUs, it means there is still dependency, but with two locks it becomes lower. Summarizing: - it reduces the high lock contention - it allows to perform operations on "free" and "busy" trees in parallel on different CPUs. Please note it does not solve scalability issue. Test results: In order to evaluate this patch, we can run "vmalloc test driver" to see how many CPU cycles it takes to complete all test cases running sequentially. All online CPUs run it so it will cause a high lock contention. HiKey 960, ARM64, 8xCPUs, big.LITTLE: <snip> sudo ./test_vmalloc.sh sequential_test_order=1 <snip> <default> [ 390.950557] All test took CPU0=457126382 cycles [ 391.046690] All test took CPU1=454763452 cycles [ 391.128586] All test took CPU2=454539334 cycles [ 391.222669] All test took CPU3=455649517 cycles [ 391.313946] All test took CPU4=388272196 cycles [ 391.410425] All test took CPU5=384036264 cycles [ 391.492219] All test took CPU6=387432964 cycles [ 391.578433] All test took CPU7=387201996 cycles <default> <patched> [ 304.721224] All test took CPU0=391521310 cycles [ 304.821219] All test took CPU1=393533002 cycles [ 304.917120] All test took CPU2=392243032 cycles [ 305.008986] All test took CPU3=392353853 cycles [ 305.108944] All test took CPU4=297630721 cycles [ 305.196406] All test took CPU5=297548736 cycles [ 305.288602] All test took CPU6=297092392 cycles [ 305.381088] All test took CPU7=297293597 cycles <patched> ~14%-23% patched variant is better. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Andrew Morton <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01selftests: vm: add fragment CONFIG_TEST_VMALLOCAnders Roxell1-0/+1
When running test_vmalloc.sh smoke the following print out states that the fragment is missing. # ./test_vmalloc.sh: You must have the following enabled in your kernel: # CONFIG_TEST_VMALLOC=m Rework to add the fragment 'CONFIG_TEST_VMALLOC=m' to the config file. Link: http://lkml.kernel.org/r/[email protected] Fixes: a05ef00c9790 ("selftests/vm: add script helper for CONFIG_TEST_VMALLOC_MODULE") Signed-off-by: Anders Roxell <[email protected]> Cc: Shuah Khan <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/vmalloc: add more comments to the adjust_va_to_fit_type()Uladzislau Rezki (Sony)1-0/+13
When fit type is NE_FIT_TYPE there is a need in one extra object. Usually the "ne_fit_preload_node" per-CPU variable has it and there is no need in GFP_NOWAIT allocation, but there are exceptions. This commit just adds more explanations, as a result giving answers on questions like when it can occur, how often, under which conditions and what happens if GFP_NOWAIT gets failed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Daniel Wagner <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/vmalloc: respect passed gfp_mask when doing preloadingUladzislau Rezki (Sony)1-4/+4
Allocation functions should comply with the given gfp_mask as much as possible. The preallocation code in alloc_vmap_area doesn't follow that pattern and it is using a hardcoded GFP_KERNEL. Although this doesn't really make much difference because vmalloc is not GFP_NOWAIT compliant in general (e.g. page table allocations are GFP_KERNEL) there is no reason to spread that bad habit and it is good to fix the antipattern. [[email protected]: rewrite changelog] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Daniel Wagner <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/vmalloc: remove preempt_disable/enable when doing preloadingUladzislau Rezki (Sony)1-17/+20
Some background. The preemption was disabled before to guarantee that a preloaded object is available for a CPU, it was stored for. That was achieved by combining the disabling the preemption and taking the spin lock while the ne_fit_preload_node is checked. The aim was to not allocate in atomic context when spinlock is taken later, for regular vmap allocations. But that approach conflicts with CONFIG_PREEMPT_RT philosophy. It means that calling spin_lock() with disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel. Therefore, get rid of preempt_disable() and preempt_enable() when the preload is done for splitting purpose. As a result we do not guarantee now that a CPU is preloaded, instead we minimize the case when it is not, with this change, by populating the per cpu preload pointer under the vmap_area_lock. This implies that at least each caller that has done the preallocation will not fallback to an atomic allocation later. It is possible that the preallocation would be pointless or that no preallocation is done because of the race but the data shows that this is really rare. For example i run the special test case that follows the preload pattern and path. 20 "unbind" threads run it and each does 1000000 allocations. Only 3.5 times among 1000000 a CPU was not preloaded. So it can happen but the number is negligible. [[email protected]: changelog additions] Link: http://lkml.kernel.org/r/[email protected] Fixes: 82dd23e84be3 ("mm/vmalloc.c: preload a CPU with one object for split purpose") Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Daniel Wagner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/vmalloc.c: remove unnecessary highmem_mask from parameter of ↵Liu Xiang1-1/+1
gfpflags_allow_blocking() gfpflags_allow_blocking() does not care about __GFP_HIGHMEM, so highmem_mask can be removed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Liu Xiang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/sparse.c: do not waste pre allocated memmap spaceMichal Hocko1-6/+8
Vincent has noticed [1] that there is something unusual with the memmap allocations going on on his platform : I noticed this because on my ARM64 platform, with 1 GiB of memory the : first [and only] section is allocated from the zeroing path while with : 2 GiB of memory the first 1 GiB section is allocated from the : non-zeroing path. The underlying problem is that although sparse_buffer_init allocates enough memory for all sections on the node sparse_buffer_alloc is not able to consume them due to mismatch in the expected allocation alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE alignment the real memmap has to be aligned to section_map_size() this results in a wasted initial chunk of the preallocated memmap and unnecessary fallback allocation for a section. While we are at it also change __populate_section_memmap to align to the requested size because at least VMEMMAP has constrains to have memmap properly aligned. [1] http://lkml.kernel.org/r/[email protected] [[email protected]: tweak layout, per David] Link: http://lkml.kernel.org/r/[email protected] Fixes: 35fd1eb1e821 ("mm/sparse: abstract sparse buffer allocations") Signed-off-by: Michal Hocko <[email protected]> Reported-by: Vincent Whitchurch <[email protected]> Debugged-by: Vincent Whitchurch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/sparse.c: mark populate_section_memmap as __meminitIlya Leoshkevich1-2/+2
Building the kernel on s390 with -Og produces the following warning: WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap() The function populate_section_memmap() references the function __meminit __populate_section_memmap(). This is often because populate_section_memmap lacks a __meminit annotation or the annotation of __populate_section_memmap is wrong. While -Og is not supported, in theory this might still happen with another compiler or on another architecture. So fix this by using the correct section annotations. [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ilya Leoshkevich <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/sparse: consistently do not zero memmapVincent Whitchurch1-1/+1
sparsemem without VMEMMAP has two allocation paths to allocate the memory needed for its memmap (done in sparse_mem_map_populate()). In one allocation path (sparse_buffer_alloc() succeeds), the memory is not zeroed (since it was previously allocated with memblock_alloc_try_nid_raw()). In the other allocation path (sparse_buffer_alloc() fails and sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the memory is zeroed. AFAICS this difference does not appear to be on purpose. If the code is supposed to work with non-initialized memory (__init_single_page() takes care of zeroing the struct pages which are actually used), we should consistently not zero the memory, to avoid masking bugs. ( I noticed this because on my ARM64 platform, with 1 GiB of memory the first [and only] section is allocated from the zeroing path while with 2 GiB of memory the first 1 GiB section is allocated from the non-zeroing path. ) Michal: "the main user visible problem is a memory wastage. The overal amount of memory should be small. I wouldn't call it stable material." Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vincent Whitchurch <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/memory_hotplug.c: don't allow to online/offline memory blocks with holesDavid Hildenbrand1-2/+26
Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes (a range that is not IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see add_memory_resource()). All memory blocks that belong to boot memory are already online. Note that boot memory can have holes and the memmap of the holes is marked PG_reserved. However, also memory allocated early during boot is PG_reserved - basically every page of boot memory that is not given to the buddy is PG_reserved. Therefore, when we stop allowing to offline memory blocks with holes, we implicitly no longer have to deal with onlining memory blocks with holes. E.g., online_pages() will do a walk_system_ram_range(..., online_pages_range), whereby online_pages_range() will effectively only free the memory holes not falling into a hole to the buddy. The other pages (holes) are kept PG_reserved (via move_pfn_range_to_zone()->memmap_init_zone()). This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved completely in memmap_init_zone(). Offlining memory blocks added during boot is usually not guaranteed to work either way (unmovable data might have easily ended up on that memory during boot). So stopping to do that should not really hurt. Also, people are not even aware of a setup where onlining/offlining of memory blocks with holes used to work reliably (see [1] and [2] especially regarding the hotplug path) - I doubt it worked reliably. For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird). Please note that hardware errors (PG_hwpoison) are not memory holes and are not affected by this change when offlining. [1] https://lkml.org/lkml/2019/10/22/135 [2] https://lkml.org/lkml/2019/8/14/1365 Link: http://lkml.kernel.org/r/[email protected] Reviewed-by: Dan Williams <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01drivers/base/memory.c: drop the mem_sysfs_mutexDavid Hildenbrand1-19/+14
The mem_sysfs_mutex isn't really helpful. Also, it's not really clear what the mutex protects at all. The device lists of the memory subsystem are protected separately. We don't need that mutex when looking up. creating, or removing independent devices. find_memory_block_by_id() will perform locking on its own and grab a reference of the returned device. At the time memory_dev_init() is called, we cannot have concurrent hot(un)plug operations yet - we're still fairly early during boot. We don't need any locking. The creation/removal of memory block devices should be protected on a higher level - especially using the device hotplug lock to avoid documented issues (see Documentation/core-api/memory-hotplug.rst) - or if that is reworked, using similar locking. Protecting in the context of these functions only doesn't really make sense. Especially, if we would have a situation where the same memory blocks are created/deleted at the same time, there is something horribly going wrong (imagining adding/removing a DIMM at the same time from two call paths) - after the functions succeeded something else in the callers would blow up (e.g., create_memory_block_devices() succeeded but there are no memory block devices anymore). All relevant call paths (except when adding memory early during boot via ACPI, which is now documented) hold the device hotplug lock when adding memory, and when removing memory. Let's document that instead. Add a simple safety net to create_memory_block_devices() in case we would actually remove memory blocks while adding them, so we'll never dereference a NULL pointer. Simplify memory_dev_init() now that the lock is gone. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01include/linux/memory_hotplug.h: move definitions of {set,clear}_zone_contiguousBen Dooks (Codethink)1-3/+3
The {set,clear}_zone_contiguous are built whatever the configuratoon so move the definitions outside the current ifdef to avoid the following compiler warnings: mm/page_alloc.c:1550:6: warning: no previous prototype for 'set_zone_contiguous' [-Wmissing-prototypes] mm/page_alloc.c:1571:6: warning: no previous prototype for 'clear_zone_contiguous' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Dooks (Codethink) <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINEDavid Hildenbrand4-13/+15
We have two types of users of page isolation: 1. Memory offlining: Offline memory so it can be unplugged. Memory won't be touched. 2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to become the owner of the memory and make use of it. For example, in case we want to offline memory, we can ignore (skip over) PageHWPoison() pages, as the memory won't get used. We can allow to offline memory. In contrast, we don't want to allow to allocate such memory. Let's generalize the approach so we can special case other types of pages we want to skip over in case we offline memory. While at it, also pass the same flags to test_pages_isolated(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Qian Cai <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/page_alloc.c: don't set pages PageReserved() when offliningDavid Hildenbrand2-7/+2
Patch series "mm: Memory offlining + page isolation cleanups", v2. This patch (of 2): We call __offline_isolated_pages() from __offline_pages() after all pages were isolated and are either free (PageBuddy()) or PageHWPoison. Nothing can stop us from offlining memory at this point. In __offline_isolated_pages() we first set all affected memory sections offline (offline_mem_sections(pfn, end_pfn)), to mark the memmap as invalid (pfn_to_online_page() will no longer succeed), and then walk over all pages to pull the free pages from the free lists (to the isolated free lists, to be precise). Note that re-onlining a memory block will result in the whole memmap getting reinitialized, overwriting any old state. We already poision the memmap when offlining is complete to find any access to stale/uninitialized memmaps. So, setting the pages PageReserved() is not helpful. The memap is marked offline and all pageblocks are isolated. As soon as offline, the memmap is stale either way. This looks like a leftover from ancient times where we initialized the memmap when adding memory and not when onlining it (the pages were set PageReserved so re-onling would work as expected). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Dan Williams <[email protected]> Cc: Wei Yang <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/memory_hotplug: remove __online_page_free() and ↵David Hildenbrand2-14/+0
__online_page_increment_counters() Let's drop the now unused functions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dan Williams <[email protected]> Cc: Qian Cai <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Stephen Hemminger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01hv_balloon: use generic_online_page()David Hildenbrand1-2/+1
Let's use the generic onlining function - which will now also take care of calling kernel_map_pages(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Qian Cai <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/memory_hotplug: export generic_online_page()David Hildenbrand2-3/+3
Patch series "mm/memory_hotplug: Export generic_online_page()". Let's replace the __online_page...() functions by generic_online_page(). Hyper-V only wants to delay the actual onlining of un-backed pages, so we can simpy re-use the generic function. This patch (of 3): Let's expose generic_online_page() so online_page_callback users can simply fall back to the generic implementation when actually deciding to online the pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Wei Yang <[email protected]> Cc: Qian Cai <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Stephen Hemminger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/memory_hotplug.c: add a bounds check to __add_pages()Alastair D'Silva1-0/+20
On PowerPC, the address ranges allocated to OpenCAPI LPC memory are allocated from firmware. These address ranges may be higher than what older kernels permit, as we increased the maximum permissable address in commit 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory to 2PB"). It is possible that the addressable range may change again in the future. In this scenario, we end up with a bogus section returned from __section_nr (see the discussion on the thread "mm: Trigger bug on if a section is not found in __section_nr"). Adding a check here means that we fail early and have an opportunity to handle the error gracefully, rather than rumbling on and potentially accessing an incorrect section. Further discussion is also on the thread ("powerpc: Perform a bounds check in arch_add_memory") http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alastair D'Silva <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/hotplug: reorder memblock_[free|remove]() calls in try_remove_memory()Anshuman Khandual1-2/+2
Currently during memory hot add procedure, memory gets into memblock before calling arch_add_memory() which creates its linear mapping. add_memory_resource() { .................. memblock_add_node() .................. arch_add_memory() .................. } But during memory hot remove procedure, removal from memblock happens first before its linear mapping gets teared down with arch_remove_memory() which is not consistent. Resource removal should happen in reverse order as they were added. However this does not pose any problem for now, unless there is an assumption regarding linear mapping. One example was a subtle failure on arm64 platform [1]. Though this has now found a different solution. try_remove_memory() { .................. memblock_free() memblock_remove() .................. arch_remove_memory() .................. } This changes the sequence of resource removal including memblock and linear mapping tear down during memory hot remove which will now be the reverse order in which they were added during memory hot add. The changed removal order looks like the following. try_remove_memory() { .................. arch_remove_memory() .................. memblock_free() memblock_remove() .................. } [1] https://patchwork.kernel.org/patch/11127623/ Memory hot remove now works on arm64 without this because a recent commit 60bb462fc7ad ("drivers/base/node.c: simplify unregister_memory_block_under_nodes()"). This does not fix a serious problem. It just removes an inconsistency while freeing resources during memory hot remove which for now does not pose a real problem. David mentioned that re-ordering should still make sense for consistency purpose (removing stuff in the reverse order they were added). This patch is now detached from arm64 hot-remove series. Michal: : I would just a note that the inconsistency doesn't pose any problem now : but if somebody makes any assumptions about linear mappings then it could : get subtly broken like your example for arm64 which has found a different : solution in the meantime. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/memory-failure.c: use page_shift() in add_to_kill()Yunfeng Ye1-1/+1
page_shift() is supported after the commit 94ad9338109f ("mm: introduce page_shift()"). So replace with page_shift() in add_to_kill() for readability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yunfeng Ye <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm, soft-offline: convert parameter to pfnNaoya Horiguchi4-18/+12
Currently soft_offline_page() receives struct page, and its sibling memory_failure() receives pfn. This discrepancy looks weird and makes precheck on pfn validity tricky. So let's align them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andrew Morton <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/memory-failure.c clean up around tk pre-allocationJane Chu1-27/+13
add_to_kill() expects the first 'tk' to be pre-allocated, it makes subsequent allocations on need basis, this makes the code a bit difficult to read. Move all the allocation internal to add_to_kill() and drop the **tk argument. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jane Chu <[email protected]> Reviewed-by: Dan Williams <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01memfd: add test for COW on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappingsJoel Fernandes (Google)1-0/+36
In this test, the parent and child both have writable private mappings. The test shows that without the patch in this series, the parent and child shared the same memory which is incorrect. In other words, COW needs to be triggered so any writes to child's copy stays local to the child. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Fernandes (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm, memfd: fix COW issue on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappingsNicolas Geoffray1-4/+7
F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE: A private mapping created after the memfd file that gets sealed with F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning children and parent share the same memory, even though the mapping is private. The reason for this is due to the code below: static int shmem_mmap(struct file *file, struct vm_area_struct *vma) { struct shmem_inode_info *info = SHMEM_I(file_inode(file)); if (info->seals & F_SEAL_FUTURE_WRITE) { /* * New PROT_WRITE and MAP_SHARED mmaps are not allowed when * "future write" seal active. */ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) return -EPERM; /* * Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED * read-only mapping, take care to not allow mprotect to revert * protections. */ vma->vm_flags &= ~(VM_MAYWRITE); } ... } And for the mm to know if a mapping is copy-on-write: static inline bool is_cow_mapping(vm_flags_t flags) { return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; } The patch fixes the issue by making the mprotect revert protection happen only for shared mappings. For private mappings, using mprotect will have no effect on the seal behavior. The F_SEAL_FUTURE_WRITE feature was introduced in v5.1 so v5.3.x stable kernels would need a backport. [[email protected]: reflow comment, per Christoph] Link: http://lkml.kernel.org/r/[email protected] Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd") Signed-off-by: Nicolas Geoffray <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01x86/mm/pat: Fix off-by-one bugs in interval tree searchIngo Molnar1-6/+6
There's a bug in the new PAT code, the conversion of memtype_check_conflict() is buggy: 8d04a5f97a5f: ("x86/mm/pat: Convert the PAT tree to a generic interval tree") dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end); found_type = match->type; - node = rb_next(&match->rb); - while (node) { - match = rb_entry(node, struct memtype, rb); - - if (match->start >= end) /* Checked all possible matches */ - goto success; - - if (is_node_overlap(match, start, end) && - match->type != found_type) { + match = memtype_interval_iter_next(match, start, end); + while (match) { + if (match->type != found_type) goto failure; - } - node = rb_next(&match->rb); + match = memtype_interval_iter_next(match, start, end); } Note how the '>= end' condition to end the interval check, got converted into: + match = memtype_interval_iter_next(match, start, end); This is subtly off by one, because the interval trees interfaces require closed interval parameters: include/linux/interval_tree_generic.h /* \ * Iterate over intervals intersecting [start;last] \ * \ * Note that a node's interval intersects [start;last] iff: \ * Cond1: ITSTART(node) <= last \ * and \ * Cond2: start <= ITLAST(node) \ */ \ ... if (ITSTART(node) <= last) { /* Cond1 */ \ if (start <= ITLAST(node)) /* Cond2 */ \ return node; /* node is leftmost match */ \ [start;last] is a closed interval (note that '<= last' check) - while the PAT 'end' parameter is 1 byte beyond the end of the range, because ioremap() and the other mapping APIs usually use the [start,end) half-open interval, derived from 'size'. This is what ioremap() does for example: /* * Mappings have to be page-aligned */ offset = phys_addr & ~PAGE_MASK; phys_addr &= PHYSICAL_PAGE_MASK; size = PAGE_ALIGN(last_addr+1) - phys_addr; retval = reserve_memtype(phys_addr, (u64)phys_addr + size, pcm, &new_pcm); phys_addr+size will be on a page boundary, after the last byte of the mapped interval. So the correct parameter to use in the interval tree searches is not 'end' but 'end-1'. This could have relevance if conflicting PAT ranges are exactly adjacent, for example a future WC region is followed immediately by an already mapped UC- region - in this case memtype_check_conflict() would incorrectly deny the WC memtype region and downgrade the memtype to UC-. BTW., rather annoyingly this downgrading is done silently in memtype_check_insert(): int memtype_check_insert(struct memtype *new, enum page_cache_mode *ret_type) { int err = 0; err = memtype_check_conflict(new->start, new->end, new->type, ret_type); if (err) return err; if (ret_type) new->type = *ret_type; memtype_interval_insert(new, &memtype_rbroot); return 0; } So on such a conflict we'd just silently get UC- in *ret_type, and write it into the new region, never the wiser ... So assuming that the patch below fixes the primary bug the diagnostics side of ioremap() cache attribute downgrades would be another thing to fix. Anyway, I checked all the interval-tree iterations, and most of them are off by one - but I think the one related to memtype_check_conflict() is the one causing this particular performance regression. The only correct interval-tree searches were these two: arch/x86/mm/pat_interval.c: match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX); arch/x86/mm/pat_interval.c: match = memtype_interval_iter_next(match, 0, ULONG_MAX); The ULONG_MAX was hiding the off-by-one in plain sight. :-) Note that the bug was probably benign in the sense of implementing a too strict cache attribute conflict policy and downgrading cache attributes, so AFAICS the worst outcome of this bug would be a performance regression, not any instabilities. Reported-by: kernel test robot <[email protected]> Reported-by: Kenneth R. Crudup <[email protected]> Reported-by: Mariusz Ceier <[email protected]> Tested-by: Mariusz Ceier <[email protected]> Tested-by: Kenneth R. Crudup <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-12-01docs: remove a bunch of stray CRsJonathan Corbet13-59/+59
A pair of documentation patches introduced a bunch of unwanted carriage-return characters into the docs. Remove them, and chase away the ghost of DOS for another day. Fixes: a016e092940f ("docs: admin-guide: dell_rbu: Improve formatting and spelling") Fixes: bdd68860a044 ("Documentation: networking: device drivers: Remove stray asterisks") Signed-off-by: Jonathan Corbet <[email protected]>
2019-12-01mm/memory.c: fix a huge pud insertion race during faultingThomas Hellstrom2-0/+31
A huge pud page can theoretically be faulted in racing with pmd_alloc() in __handle_mm_fault(). That will lead to pmd_alloc() returning an invalid pmd pointer. Fix this by adding a pud_trans_unstable() function similar to pmd_trans_unstable() and check whether the pud is really stable before using the pmd pointer. Race: Thread 1: Thread 2: Comment create_huge_pud() Fallback - not taken. create_huge_pud() Taken. pmd_alloc() Returns an invalid pointer. This will result in user-visible huge page data corruption. Note that this was caught during a code audit rather than a real experienced problem. It looks to me like the only implementation that currently creates huge pud pagetable entries is dev_dax_huge_fault() which doesn't appear to care much about private (COW) mappings or write-tracking which is, I believe, a prerequisite for create_huge_pud() falling back on thread 1, but not in thread 2. Link: http://lkml.kernel.org/r/[email protected] Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Signed-off-by: Thomas Hellstrom <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm: move the backup x_devmap() functions to asm-generic/pgtable.hThomas Hellstrom2-15/+15
The asm-generic/pgtable.h include file appears to be the correct place for the backup x_devmap() inline functions. Moving them here is also necessary if we want to include x_devmap() in the [pmd|pud]_unstable functions. So move the x_devmap() functions to asm-generic/pgtable.h Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Hellstrom <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: use VM_BUG_ON_PAGE() in __page_check_anon_rmap()Yang Shi1-4/+3
The __page_check_anon_rmap() just calls two BUG_ON()s protected by CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: fix outdated comment in page_get_anon_vma()Miles Chen1-3/+4
Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit 5f0d5a3ae7cf ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Cc: Paul E. McKenney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01asm-generic/mm: stub out p{4,u}d_clear_bad() if __PAGETABLE_P{4,U}D_FOLDEDVineet Gupta2-0/+20
This came up when removing __ARCH_HAS_5LEVEL_HACK for ARC as code bloat. With this patch we see the following code reduction. | bloat-o-meter2 vmlinux-D-elide-p4d_free_tlb vmlinux-E-elide-p?d_clear_bad | add/remove: 0/2 grow/shrink: 0/0 up/down: 0/-40 (-40) | function old new delta | pud_clear_bad 20 - -20 | p4d_clear_bad 20 - -20 | Total: Before=4136930, After=4136890, chg -1.000000% Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Will Deacon <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01asm-generic/tlb: stub out pmd_free_tlb() if nopmdVineet Gupta1-1/+1
This came up when removing __ARCH_HAS_5LEVEL_HACK for ARC as code bloat. With this patch we see the following code reduction. | bloat-o-meter2 vmlinux-E-elide-p?d_clear_bad vmlinux-F-elide-pmd_free_tlb | add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-112 (-112) | function old new delta | free_pgd_range 422 310 -112 | Total: Before=4137042, After=4136930, chg -1.000000% Note that pmd folding can be tricky: In 2-level setup (where pmd is conceptually folded) most pmd routines are valid and refer to upper levels. In this patch we can, but see next patch for example where we can't Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01asm-generic/tlb: stub out p4d_free_tlb() if nop4d ...Vineet Gupta3-4/+1
... independent of __ARCH_HAS_5LEVEL_HACK This came up when removing __ARCH_HAS_5LEVEL_HACK for ARC as code bloat. With this patch we see the following code reduction | bloat-o-meter2 vmlinux-C-elide-pud_free_tlb vmlinux-D-elide-p4d_free_tlb | add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-104 (-104) | function old new delta | free_pgd_range 552 422 -130 | Total: Before=4137172, After=4137042, chg -1.000000% Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01asm-generic/tlb: stub out pud_free_tlb() if nopud ...Vineet Gupta3-4/+1
... independent of __ARCH_HAS_4LEVEL_HACK This came up when removing __ARCH_HAS_5LEVEL_HACK for ARC as code bloat. With this patch we see the following code reduction | bloat-o-meter2 vmlinux-B-elide-ARCH_USE_5LEVEL_HACK vmlinux-C-elide-pud_free_tlb | add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-104 (-104) | function old new delta | free_pgd_range 656 552 -104 | Total: Before=4137276, After=4137172, chg -1.000000% Note: The primary change is alternate defintion for pud_free_tlb() but while there also removed empty stubs for __pud_free_tlb, which is anyhow called only from pud_free_tlb() Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01ARC: mm: remove __ARCH_USE_5LEVEL_HACKVineet Gupta3-4/+11
Patch series "elide extraneous generated code for folded p4d/pud/pmd", v3. This series came out of seemingly benign excursion into understanding/removing __ARCH_USE_5LEVEL_HACK from ARC port showing some extraneous code being generated despite folded p4d/pud/pmd | bloat-o-meter2 vmlinux-[AB]* | add/remove: 0/0 grow/shrink: 3/0 up/down: 130/0 (130) | function old new delta | free_pgd_range 548 660 +112 | p4d_clear_bad 2 20 +18 The patches here address that | bloat-o-meter2 vmlinux-[BF]* | add/remove: 0/2 grow/shrink: 0/1 up/down: 0/-386 (-386) | function old new delta | pud_clear_bad 20 - -20 | p4d_clear_bad 20 - -20 | free_pgd_range 660 314 -346 The code savings are not a whole lot, but still worthwhile IMHO. This patch (of 5): With paging code made 5-level compliant, this is no longer needed. ARC has software page walker with 2 lookup levels (pgd -> pte) This was expected to be non functional change but ended with slight code bloat due to needless inclusions of p*d_free_tlb() macros which will be addressed in further patches. | bloat-o-meter2 vmlinux-[AB]* | add/remove: 0/0 grow/shrink: 2/0 up/down: 128/0 (128) | function old new delta | free_pgd_range 546 656 +110 | p4d_clear_bad 2 20 +18 | Total: Before=4137148, After=4137276, chg 0.000000% Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vineet Gupta <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_areaGaowei Pu3-7/+8
get_unmapped_area() returns an address or -errno on failure. Historically we have checked for the failure by offset_in_page() which is correct but quite hard to read. Newer code started using IS_ERR_VALUE which is much easier to read. Convert remaining users of offset_in_page as well. [[email protected]: rewrite changelog] [[email protected]: fix mremap.c and uprobes.c sites also] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gaowei Pu <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Qian Cai <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: reuse mergeable anon_vma as parent when forkWei Yang1-0/+13
In __anon_vma_prepare(), we will try to find anon_vma if it is possible to reuse it. While on fork, the logic is different. Since commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue"), function anon_vma_clone() tries to allocate new anon_vma for child process. But the logic here will allocate a new anon_vma for each vma, even in parent this vma is mergeable and share the same anon_vma with its sibling. This may do better for scalability issue, while it is not necessary to do so especially after interval tree is used. Commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") tries to reuse some anon_vma by counting child anon_vma and attached vmas. While for those mergeable anon_vmas, we can just reuse it and not necessary to go through the logic. After this change, kernel build test reduces 20% anon_vma allocation. Do the same kernel build test, it shows run time in sys reduced 11.6%. Origin: real 2m50.467s user 17m52.002s sys 1m51.953s real 2m48.662s user 17m55.464s sys 1m50.553s real 2m51.143s user 17m59.687s sys 1m53.600s Patched: real 2m39.933s user 17m1.835s sys 1m38.802s real 2m39.321s user 17m1.634s sys 1m39.206s real 2m39.575s user 17m1.420s sys 1m38.845s Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Qian Cai <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/rmap.c: don't reuse anon_vma if we just want a copyWei Yang1-9/+15
Before commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy"), anon_vma_clone() doesn't change dst->anon_vma. While after this commit, anon_vma_clone() will try to reuse an exist one on forking. But this commit go a little bit further for the case not forking. anon_vma_clone() is called from __vma_split(), __split_vma(), copy_vma() and anon_vma_fork(). For the first three places, the purpose here is get a copy of src and we don't expect to touch dst->anon_vma even it is NULL. While after that commit, it is possible to reuse an anon_vma when dst->anon_vma is NULL. This is not we intend to have. This patch stops reuse of anon_vma for non-fork cases. Link: http://lkml.kernel.org/r/[email protected] Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Signed-off-by: Wei Yang <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Qian Cai <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/mmap.c: rb_parent is not necessary in __vma_link_list()Wei Yang4-9/+5
Now we use rb_parent to get next, while this is not necessary. When prev is NULL, this means vma should be the first element in the list. Then next should be current first one (mm->mmap), no matter whether we have parent or not. After removing it, the code shows the beauty of symmetry. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: Andrew Morton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/mmap.c: extract __vma_unlink_list() as counterpart for __vma_link_list()Wei Yang4-18/+17
Just make the code a little easier to read. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/mmap.c: __vma_unlink_prev() is not necessary nowWei Yang1-8/+1
The third parameter of __vma_unlink_common() could differentiate these two types. __vma_unlink_prev() is not necessary now. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/mmap.c: prev could be retrieved from vma->vm_prevWei Yang1-13/+7
Currently __vma_unlink_common handles two cases: * has_prev * or not When has_prev is false, it is obvious prev is calculated from vma->vm_prev in __vma_unlink_common. When has_prev is true, the prev is passed through from __vma_unlink_prev in __vma_adjust for non-case 8. And at the beginning next is calculated from vma->vm_next, which implies vma is next->vm_prev. The above statement sounds a little complicated, while to think in another point of view, no matter whether vma and next is swapped, the mmap link list still preserves its property. It is proper to access vma->vm_prev. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/swap.c: piggyback lru_add_drain_all() callsKonstantin Khlebnikov1-1/+15
This is a very slow operation. Right now POSIX_FADV_DONTNEED is the top user because it has to freeze page references when removing it from the cache. invalidate_bdev() calls it for the same reason. Both are triggered from userspace, so it's easy to generate a storm. mlock/mlockall no longer calls lru_add_drain_all - I've seen here serious slowdown on older kernels. There are some less obvious paths in memory migration/CMA/offlining which shouldn't call frequently. The worst case requires a non-trivial workload because lru_add_drain_all() skips cpus where vectors are empty. Something must constantly generate a flow of pages for each cpu. Also cpus must be busy to make scheduling per-cpu works slower. And the machine must be big enough (64+ cpus in our case). In our case that was a massive series of mlock calls in map-reduce while other tasks write logs (and generates flows of new pages in per-cpu vectors). Mlock calls were serialized by mutex and accumulated latency up to 10 seconds or more. The kernel does not call lru_add_drain_all on mlock paths since 4.15, but the same scenario could be triggered by fadvise(POSIX_FADV_DONTNEED) or any other remaining user. There is no reason to do the drain again if somebody else already drained all the per-cpu vectors while we waited for the lock. Piggyback on a drain starting and finishing while we wait for the lock: all pages pending at the time of our entry were drained from the vectors. Callers like POSIX_FADV_DONTNEED retry their operations once after draining per-cpu vectors when pages have unexpected references. Link: http://lkml.kernel.org/r/157019456205.3142.3369423180908482020.stgit@buzz Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm/mmap.c: remove a never-triggered warning in __vma_adjust()Wei Yang1-2/+0
The upper level of "if" makes sure (end >= next->vm_end), which means there are only two possibilities: 1) end == next->vm_end 2) end > next->vm_end remove_next is assigned to be (1 + end > next->vm_end). This means if remove_next is 1, end must equal to next->vm_end. The VM_WARN_ON will never trigger. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01rss_stat: add support to detect RSS updates of external mmJoel Fernandes (Google)5-20/+66
When a process updates the RSS of a different process, the rss_stat tracepoint appears in the context of the process doing the update. This can confuse userspace that the RSS of process doing the update is updated, while in reality a different process's RSS was updated. This issue happens in reclaim paths such as with direct reclaim or background reclaim. This patch adds more information to the tracepoint about whether the mm being updated belongs to the current process's context (curr field). We also include a hash of the mm pointer so that the process who the mm belongs to can be uniquely identified (mm_id field). Also vsprintf.c is refactored a bit to allow reuse of hashing code. [[email protected]: remove unused local `str'] [[email protected]: inline call to ptr_to_hashval] Link: http://lore.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Fernandes (Google) <[email protected]> Reported-by: Ioannis Ilkos <[email protected]> Acked-by: Petr Mladek <[email protected]> [lib/vsprintf.c] Cc: Tim Murray <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Carmen Jackson <[email protected]> Cc: Mayank Gupta <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Steven Rostedt (VMware) <[email protected]> Cc: Minchan Kim <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-12-01mm: emit tracepoint when RSS changesJoel Fernandes (Google)3-3/+38
Useful to track how RSS is changing per TGID to detect spikes in RSS and memory hogs. Several Android teams have been using this patch in various kernel trees for half a year now. Many reported to me it is really useful so I'm posting it upstream. Initial patch developed by Tim Murray. Changes I made from original patch: o Prevent any additional space consumed by mm_struct. Regarding the fact that the RSS may change too often thus flooding the traces - note that, there is some "hysterisis" with this already. That is - We update the counter only if we receive 64 page faults due to SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range, the RSS is updated immediately which can become noisy/flooding. In a previous discussion, we agreed that BPF or ftrace can be used to rate limit the signal if this becomes an issue. Also note that I added wrappers to trace_rss_stat to prevent compiler errors where linux/mm.h is included from tracing code, causing errors such as: CC kernel/trace/power-traces.o In file included from ./include/trace/define_trace.h:102, from ./include/trace/events/kmem.h:342, from ./include/linux/mm.h:31, from ./include/linux/ring_buffer.h:5, from ./include/linux/trace_events.h:6, from ./include/trace/events/power.h:12, from kernel/trace/power-traces.c:15: ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type struct trace_entry ent; \ Link: http://lore.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Co-developed-by: Tim Murray <[email protected]> Signed-off-by: Tim Murray <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Carmen Jackson <[email protected]> Cc: Mayank Gupta <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Steven Rostedt (VMware) <[email protected]> Cc: Minchan Kim <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>