aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2021-03-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds10-164/+234
Merge misc fixes from Andrew Morton: "28 patches. Subsystems affected by this series: mm (memblock, pagealloc, hugetlb, highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and ia64" * emailed patches from Andrew Morton <[email protected]>: (28 commits) zram: fix broken page writeback zram: fix return value on writeback_store mm/memcg: set memcg when splitting page mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls mm/userfaultfd: fix memory corruption due to writeprotect kasan: fix KASAN_STACK dependency for HW_TAGS kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC mm/madvise: replace ptrace attach requirement for process_madvise include/linux/sched/mm.h: use rcu_dereference in in_vfork() kfence: fix reports if constant function prefixes exist kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations kfence: fix printk format for ptrdiff_t linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* MAINTAINERS: exclude uapi directories in API/ABI section binfmt_misc: fix possible deadlock in bm_register_write mm/highmem.c: fix zero_user_segments() with start > end hugetlb: do early cow when page pinned on src mm mm: use is_cow_mapping() across tree where proper ...
2021-03-13mm/memcg: set memcg when splitting pageZhou Guanghui1-0/+1
As described in the split_page() comment, for the non-compound high order page, the sub-pages must be freed individually. If the memcg of the first page is valid, the tail pages cannot be uncharged when be freed. For example, when alloc_pages_exact is used to allocate 1MB continuous physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is set). When make_alloc_exact free the unused 1MB and free_pages_exact free the applied 1MB, actually, only 4KB(one page) is uncharged. Therefore, the memcg of the tail page needs to be set when splitting a page. Michel: There are at least two explicit users of __GFP_ACCOUNT with alloc_exact_pages added recently. See 7efe8ef274024 ("KVM: arm64: Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713 ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not just a theoretical issue. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhou Guanghui <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Rui Xiang <[email protected]> Cc: Tianhong Ding <[email protected]> Cc: Weilong Chen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add ↵Zhou Guanghui2-10/+7
nr_pages argument Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass in page number argument. In this way, the interface name is more common and can be used by potential users. In addition, the complete info(memcg and flag) of the memcg needs to be set to the tail pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhou Guanghui <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Tianhong Ding <[email protected]> Cc: Weilong Chen <[email protected]> Cc: Rui Xiang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm/userfaultfd: fix memory corruption due to writeprotectNadav Amit1-0/+8
Userfaultfd self-test fails occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range() and defers flushes, and since there is insufficient consideration of concurrent deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in wp_page_copy(), this flush takes place after the copy has already been performed, and therefore changes of the page are possible between the time of the copy and the time in which the PTE is flushed. To make matters worse, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current userrfaultfd code might actually cause a demotion of the architectural PTE permission: when userfaultfd_writeprotect() unprotects memory region, it unintentionally *clears* the RW-bit if it was already set. Note that this unprotecting a PTE that is not write-protected is a valid use-case: the userfaultfd monitor might ask to unprotect a region that holds both write-protected and write-unprotected PTEs. The scenario that happens in selftests/vm/userfaultfd is as follows: cpu0 cpu1 cpu2 ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-*unprotect* ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ *clear* “write”-bit ] [ defer TLB flushes ] [ page-fault ] ... wp_page_copy() cow_user_page() [ copy page ] [ write to old page ] ... set_pte_at_notify() A similar scenario can happen: cpu0 cpu1 cpu2 cpu3 ---- ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-protect ] [ deferred TLB flush ] userfaultfd_writeprotect() [ write-unprotect ] [ deferred TLB flush] [ page-fault ] wp_page_copy() cow_user_page() [ copy page ] ... [ write to page ] set_pte_at_notify() This race exists since commit 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made wp_page_copy() more likely to take place, specifically if page_count(page) > 1. To resolve the aforementioned races, check whether there are pending flushes on uffd-write-protected VMAs, and if there are, perform a flush before doing the COW. Further optimizations will follow to avoid during uffd-write-unprotect unnecassary PTE write-protection and TLB flushes. Link: https://lkml.kernel.org/r/[email protected] Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Signed-off-by: Nadav Amit <[email protected]> Suggested-by: Yu Zhao <[email protected]> Reviewed-by: Peter Xu <[email protected]> Tested-by: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Will Deacon <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: <[email protected]> [5.9+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOCAndrey Konovalov1-2/+6
Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called after debug_pagealloc_unmap_pages(). This causes a crash when debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an unmapped page. This patch puts kasan_free_nondeferred_pages() before debug_pagealloc_unmap_pages() and arch_free_page(), which can also make the page unavailable. Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.1614987311.git.andreyknvl@google.com Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS") Signed-off-by: Andrey Konovalov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Marco Elver <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm/madvise: replace ptrace attach requirement for process_madviseSuren Baghdasaryan1-1/+12
process_madvise currently requires ptrace attach capability. PTRACE_MODE_ATTACH gives one process complete control over another process. It effectively removes the security boundary between the two processes (in one direction). Granting ptrace attach capability even to a system process is considered dangerous since it creates an attack surface. This severely limits the usage of this API. The operations process_madvise can perform do not affect the correctness of the operation of the target process; they only affect where the data is physically located (and therefore, how fast it can be accessed). What we want is the ability for one process to influence another process in order to optimize performance across the entire system while leaving the security boundary intact. Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata and CAP_SYS_NICE for influencing process performance. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jeff Vander Stoep <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tim Murray <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: James Morris <[email protected]> Cc: <[email protected]> [5.10+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13kfence: fix reports if constant function prefixes existMarco Elver1-6/+12
Some architectures prefix all functions with a constant string ('.' on ppc64). Add ARCH_FUNC_PREFIX, which may optionally be defined in <asm/kfence.h>, so that get_stack_skipnr() can work properly. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Reported-by: Christophe Leroy <[email protected]> Tested-by: Christophe Leroy <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocationsMarco Elver1-1/+1
cache_alloc_debugcheck_after() performs checks on an object, including adjusting the returned pointer. None of this should apply to KFENCE objects. While for non-bulk allocations, the checks are skipped when we allocate via KFENCE, for bulk allocations cache_alloc_debugcheck_after() is called via cache_alloc_debugcheck_after_bulk(). Fix it by skipping cache_alloc_debugcheck_after() for KFENCE objects. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13kfence: fix printk format for ptrdiff_tMarco Elver1-6/+6
Use %td for ptrdiff_t. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Reported-by: Christophe Leroy <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Jann Horn <[email protected]> Cc: Christophe Leroy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm/highmem.c: fix zero_user_segments() with start > endOGAWA Hirofumi1-5/+12
zero_user_segments() is used from __block_write_begin_int(), for example like the following zero_user_segments(page, 4096, 1024, 512, 918) But new the zero_user_segments() implementation for for HIGHMEM + TRANSPARENT_HUGEPAGE doesn't handle "start > end" case correctly, and hits BUG_ON(). (we can fix __block_write_begin_int() instead though, it is the old and multiple usage) Also it calls kmap_atomic() unnecessarily while start == end == 0. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0060ef3b4e6d ("mm: support THPs in zero_user_segments") Signed-off-by: OGAWA Hirofumi <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13hugetlb: do early cow when page pinned on src mmPeter Xu1-4/+62
This is the last missing piece of the COW-during-fork effort when there're pinned pages found. One can reference 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes", 2020-09-27) for more information, since we do similar things here rather than pte this time, but just for hugetlb. Note that after Jason's recent work on 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork", 2020-12-15) which is safer and easier to understand, we're safe now within the whole copy_page_range() against gup-fast, we don't need the wr-protect trick that proposed in 70e806e4e645 anymore. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: David Gibson <[email protected]> Cc: Gal Pressman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Roland Scheidegger <[email protected]> Cc: VMware Graphics <[email protected]> Cc: Wei Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm: use is_cow_mapping() across tree where properPeter Xu1-3/+1
After is_cow_mapping() is exported in mm.h, replace some manual checks elsewhere throughout the tree but start to use the new helper. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: VMware Graphics <[email protected]> Cc: Roland Scheidegger <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Gibson <[email protected]> Cc: Gal Pressman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Wei Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm: introduce page_needs_cow_for_dma() for deciding whether cowPeter Xu3-18/+3
We've got quite a few places (pte, pmd, pud) that explicitly checked against whether we should break the cow right now during fork(). It's easier to provide a helper, especially before we work the same thing on hugetlbfs. Since we'll reference is_cow_mapping() in mm.h, move it there too. Actually it suites mm.h more since internal.h is mm/ only, but mm.h is exported to the whole kernel. With that we should expect another patch to use is_cow_mapping() whenever we can across the kernel since we do use it quite a lot but it's always done with raw code against VM_* flags. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: David Gibson <[email protected]> Cc: Gal Pressman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Roland Scheidegger <[email protected]> Cc: VMware Graphics <[email protected]> Cc: Wei Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13hugetlb: break earlier in add_reservation_in_range() when we canPeter Xu1-1/+1
All the regions maintained in hugetlb reserved map is inclusive on "from" but exclusive on "to". We can break earlier even if rg->from==t because it already means no possible intersection. This does not need a Fixes in all cases because when it happens (rg->from==t) we'll not break out of the loop while we should, however the next thing we'd do is still add the last file_region we'd need and quit the loop in the next round. So this change is not a bugfix (since the old code should still run okay iiuc), but we'd better still touch it up to make it logically sane. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: David Gibson <[email protected]> Cc: Gal Pressman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Roland Scheidegger <[email protected]> Cc: VMware Graphics <[email protected]> Cc: Wei Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13hugetlb: dedup the code to add a new file_regionPeter Xu1-24/+27
Patch series "mm/hugetlb: Early cow on fork, and a few cleanups", v5. As reported by Gal [1], we still miss the code clip to handle early cow for hugetlb case, which is true. Again, it still feels odd to fork() after using a few huge pages, especially if they're privately mapped to me.. However I do agree with Gal and Jason in that we should still have that since that'll complete the early cow on fork effort at least, and it'll still fix issues where buffers are not well under control and not easy to apply MADV_DONTFORK. The first two patches (1-2) are some cleanups I noticed when reading into the hugetlb reserve map code. I think it's good to have but they're not necessary for fixing the fork issue. The last two patches (3-4) are the real fix. I tested this with a fork() after some vfio-pci assignment, so I'm pretty sure the page copy path could trigger well (page will be accounted right after the fork()), but I didn't do data check since the card I assigned is some random nic. https://github.com/xzpeter/linux/tree/fork-cow-pin-huge [1] https://lore.kernel.org/lkml/[email protected]/ Introduce hugetlb_resv_map_add() helper to add a new file_region rather than duplication the similar code twice in add_reservation_in_range(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Gal Pressman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Wei Zhang <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Gibson <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jan Kara <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Roland Scheidegger <[email protected]> Cc: VMware Graphics <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-13mm/page_alloc.c: refactor initialization of struct page for holes in memory ↵Mike Rapoport1-83/+75
layout There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory. Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved. init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero. Before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") the holes inside a zone were re-initialized during memmap_init() and got their zone/node links right. However, after that commit nothing updates the struct pages representing such holes. On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below: # grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM unset zone link in struct page will trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); in set_pfnblock_flags_mask() when called with a struct page from a range other than E820_TYPE_RAM because there are pages in the range of ZONE_DMA32 but the unset zone link in struct page makes them appear as a part of ZONE_DMA. Interleave initialization of the unavailable pages with the normal initialization of memory map, so that zone and node information will be properly set on struct pages that are not backed by the actual memory. With this change the pages for holes inside a zone will get proper zone/node links and the pages that are not spanned by any node will get links to the adjacent zone/node. The holes between nodes will be prepended to the zone/node above the hole and the trailing pages in the last section that will be appended to the zone/node below. [[email protected]: don't initialize static to zero, use %llu for u64] Link: https://lkml.kernel.org/r/[email protected] Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Mike Rapoport <[email protected]> Reported-by: Qian Cai <[email protected]> Reported-by: Andrea Arcangeli <[email protected]> Reviewed-by: Baoquan He <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Wilson <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Łukasz Majczak <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Sarvela, Tomi P" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-12Merge tag 'arm64-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "We've got a smattering of changes all over the place which we've acrued since -rc1. To my knowledge, there aren't any pending issues at the moment, but there's still plenty of time for something else to crop up... Summary: - Fix booting a 52-bit-VA-aware kernel on Qualcomm Amberwing - Fix pfn_valid() not to reject all ZONE_DEVICE memory - Fix memory tagging setup for hotplugged memory regions - Fix KASAN tagging in page_alloc() when DEBUG_VIRTUAL is enabled - Fix accidental truncation of CPU PMU event counters - Fix error code initialisation when failing probe of DMC620 PMU - Fix return value initialisation for sve-ptrace selftest - Drop broken support for CMDLINE_EXTEND" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: perf/arm_dmc620_pmu: Fix error return code in dmc620_pmu_device_probe() arm64: mm: remove unused __cpu_uses_extended_idmap[_level()] arm64: mm: use a 48-bit ID map when possible on 52-bit VA builds arm64: perf: Fix 64-bit event counter read truncation arm64/mm: Fix __enable_mmu() for new TGRAN range values kselftest: arm64: Fix exit code of sve-ptrace arm64: mte: Map hotplugged memory as Normal Tagged arm64: kasan: fix page_alloc tagging with DEBUG_VIRTUAL arm64/mm: Reorganize pfn_valid() arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory arm64/mm: Drop THP conditionality from FORCE_MAX_ZONEORDER arm64/mm: Drop redundant ARCH_WANT_HUGE_PMD_SHARE arm64: Drop support for CMDLINE_EXTEND arm64: cpufeatures: Fix handling of CONFIG_CMDLINE for idreg overrides
2021-03-10Revert "mm, slub: consider rest of partial list if acquire_slab() fails"Linus Torvalds1-1/+1
This reverts commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf. The kernel test robot reports a huge performance regression due to the commit, and the reason seems fairly straightforward: when there is contention on the page list (which is what causes acquire_slab() to fail), we do _not_ want to just loop and try again, because that will transfer the contention to the 'n->list_lock' spinlock we hold, and just make things even worse. This is admittedly likely a problem only on big machines - the kernel test robot report comes from a 96-thread dual socket Intel Xeon Gold 6252 setup, but the regression there really is quite noticeable: -47.9% regression of stress-ng.rawpkt.ops_per_sec and the commit that was marked as being fixed (7ced37197196: "slub: Acquire_slab() avoid loop") actually did the loop exit early very intentionally (the hint being that "avoid loop" part of that commit message), exactly to avoid this issue. The correct thing to do may be to pick some kind of reasonable middle ground: instead of breaking out of the loop on the very first sign of contention, or trying over and over and over again, the right thing may be to re-try _once_, and then give up on the second failure (or pick your favorite value for "once"..). Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-03-10arm64: mte: Map hotplugged memory as Normal TaggedCatalin Marinas1-1/+1
In a system supporting MTE, the linear map must allow reading/writing allocation tags by setting the memory type as Normal Tagged. Currently, this is only handled for memory present at boot. Hotplugged memory uses Normal non-Tagged memory. Introduce pgprot_mhp() for hotplugged memory and use it in add_memory_resource(). The arm64 code maps pgprot_mhp() to pgprot_tagged(). Note that ZONE_DEVICE memory should not be mapped as Tagged and therefore setting the memory type in arch_add_memory() is not feasible. Signed-off-by: Catalin Marinas <[email protected]> Fixes: 0178dc761368 ("arm64: mte: Use Normal Tagged attributes for the linear map") Reported-by: Patrick Daly <[email protected]> Tested-by: Patrick Daly <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: <[email protected]> # 5.10.x Cc: Will Deacon <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2021-03-02swap: fix swapfile read/write offsetJens Axboe2-5/+13
We're not factoring in the start of the file for where to write and read the swapfile, which leads to very unfortunate side effects of writing where we should not be... Fixes: 48d15436fde6 ("mm: remove get_swap_bio") Signed-off-by: Jens Axboe <[email protected]>
2021-02-26MIPS: make userspace mapping young by defaultHuang Pei1-4/+0
MIPS page fault path(except huge page) takes 3 exceptions (1 TLB Miss + 2 TLB Invalid), butthe second TLB Invalid exception is just triggered by __update_tlb from do_page_fault writing tlb without _PAGE_VALID set. With this patch, user space mapping prot is made young by default (with both _PAGE_VALID and _PAGE_YOUNG set), and it only take 1 TLB Miss + 1 TLB Invalid exception Remove pte_sw_mkyoung without polluting MM code and make page fault delay of MIPS on par with other architecture Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Huang Pei <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Acked-by: <[email protected]> Acked-by: Thomas Bogendoerfer <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: <[email protected]> Cc: Bibo Mao <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Paul Burton <[email protected]> Cc: Li Xuefeng <[email protected]> Cc: Yang Tiezhu <[email protected]> Cc: Gao Juxin <[email protected]> Cc: Fuxin Zhang <[email protected]> Cc: Huacai Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: clarify that only first bug is reported in HW_TAGSAndrey Konovalov1-1/+1
Hwardware tag-based KASAN only reports the first found bug. After that MTE tag checking gets disabled. Clarify this in comments and documentation. Link: https://lkml.kernel.org/r/00383ba88a47c3f8342d12263c24bdf95527b07d.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: inline HW_TAGS helper functionsAndrey Konovalov1-6/+7
Mark all static functions in common.c and kasan.h that are used for hardware tag-based KASAN as inline to avoid unnecessary function calls. Link: https://lkml.kernel.org/r/2c94a2af0657f2b95b9337232339ff5ffa643ab5.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: ensure poisoning size alignmentAndrey Konovalov3-31/+48
A previous changes d99f6a10c161 ("kasan: don't round_up too much") attempted to simplify the code by adding a round_up(size) call into kasan_poison(). While this allows to have less round_up() calls around the code, this results in round_up() being called multiple times. This patch removes round_up() of size from kasan_poison() and ensures that all callers round_up() the size explicitly. This patch also adds WARN_ON() alignment checks for address and size to kasan_poison() and kasan_unpoison(). Link: https://lkml.kernel.org/r/3ffe8d4a246ae67a8b5e91f65bf98cd7cba9d7b9.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan, mm: optimize krealloc poisoningAndrey Konovalov2-8/+24
Currently, krealloc() always calls ksize(), which unpoisons the whole object including the redzone. This is inefficient, as kasan_krealloc() repoisons the redzone for objects that fit into the same buffer. This patch changes krealloc() instrumentation to use uninstrumented __ksize() that doesn't unpoison the memory. Instead, kasan_kreallos() is changed to unpoison the memory excluding the redzone. For objects that don't fit into the old allocation, this patch disables KASAN accessibility checks when copying memory into a new object instead of unpoisoning it. Link: https://lkml.kernel.org/r/9bef90327c9cb109d736c40115684fd32f49e6b0.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan, mm: fail krealloc on freed objectsAndrey Konovalov1-0/+3
Currently, if krealloc() is called on a freed object with KASAN enabled, it allocates and returns a new object, but doesn't copy any memory from the old one as ksize() returns 0. This makes the caller believe that krealloc() succeeded (KASAN report is printed though). This patch adds an accessibility check into __do_krealloc(). If the check fails, krealloc() returns NULL. This check duplicates the one in ksize(); this is fixed in the following patch. This patch also adds a KASAN-KUnit test to check krealloc() behaviour when it's called on a freed object. Link: https://lkml.kernel.org/r/cbcf7b02be0a1ca11de4f833f2ff0b3f2c9b00c8.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: unify large kfree checksAndrey Konovalov1-10/+26
Unify checks in kasan_kfree_large() and in kasan_slab_free_mempool() for large allocations as it's done for small kfree() allocations. With this change, kasan_slab_free_mempool() starts checking that the first byte of the memory that's being freed is accessible. Link: https://lkml.kernel.org/r/14ffc4cd867e0b1ed58f7527e3b748a1b4ad08aa.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: clean up setting free info in kasan_slab_freeAndrey Konovalov1-4/+2
Put kasan_stack_collection_enabled() check and kasan_set_free_info() calls next to each other. The way this was previously implemented was a minor optimization that relied of the the fact that kasan_stack_collection_enabled() is always true for generic KASAN. The confusion that this brings outweights saving a few instructions. Link: https://lkml.kernel.org/r/f838e249be5ab5810bf54a36ef5072cfd80e2da7.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: optimize large kmalloc poisoningAndrey Konovalov1-5/+15
Similarly to kasan_kmalloc(), kasan_kmalloc_large() doesn't need to unpoison the object as it as already unpoisoned by alloc_pages() (or by ksize() for krealloc()). This patch changes kasan_kmalloc_large() to only poison the redzone. Link: https://lkml.kernel.org/r/33dee5aac0e550ad7f8e26f590c9b02c6129b4a3.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan, mm: optimize kmalloc poisoningAndrey Konovalov4-48/+119
For allocations from kmalloc caches, kasan_kmalloc() always follows kasan_slab_alloc(). Currenly, both of them unpoison the whole object, which is unnecessary. This patch provides separate implementations for both annotations: kasan_slab_alloc() unpoisons the whole object, and kasan_kmalloc() only poisons the redzone. For generic KASAN, the redzone start might not be aligned to KASAN_GRANULE_SIZE. Therefore, the poisoning is split in two parts: kasan_poison_last_granule() poisons the unaligned part, and then kasan_poison() poisons the rest. This patch also clarifies alignment guarantees of each of the poisoning functions and drops the unnecessary round_up() call for redzone_end. With this change, the early SLUB cache annotation needs to be changed to kasan_slab_alloc(), as kasan_kmalloc() doesn't unpoison objects now. The number of poisoned bytes for objects in this cache stays the same, as kmem_cache_node->object_size is equal to sizeof(struct kmem_cache_node). Link: https://lkml.kernel.org/r/7e3961cb52be380bc412860332063f5f7ce10d13.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan, mm: don't save alloc stacks twiceAndrey Konovalov2-4/+15
Patch series "kasan: optimizations and fixes for HW_TAGS", v4. This patchset makes the HW_TAGS mode more efficient, mostly by reworking poisoning approaches and simplifying/inlining some internal helpers. With this change, the overhead of HW_TAGS annotations excluding setting and checking memory tags is ~3%. The performance impact caused by tags will be unknown until we have hardware that supports MTE. As a side-effect, this patchset speeds up generic KASAN by ~15%. This patch (of 13): Currently KASAN saves allocation stacks in both kasan_slab_alloc() and kasan_kmalloc() annotations. This patch changes KASAN to save allocation stacks for slab objects from kmalloc caches in kasan_kmalloc() only, and stacks for other slab objects in kasan_slab_alloc() only. This change requires ____kasan_kmalloc() knowing whether the object belongs to a kmalloc cache. This is implemented by adding a flag field to the kasan_info structure. That flag is only set for kmalloc caches via a new kasan_cache_create_kmalloc() annotation. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/7c673ebca8d00f40a7ad6f04ab9a2bddeeae2097.1612546384.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Will Deacon <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Peter Collingbourne <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Branislav Rankov <[email protected]> Cc: Kevin Brodsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kasan: use error_report_end tracepointAlexander Potapenko1-3/+5
Make it possible to trace KASAN error reporting. A good usecase is watching for trace events from the userspace to detect and process memory corruption reports from the kernel. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Suggested-by: Marco Elver <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kfence: use error_report_end tracepointAlexander Potapenko1-0/+2
Make it possible to trace KFENCE error reporting. A good usecase is watching for trace events from the userspace to detect and process memory corruption reports from the kernel. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Suggested-by: Marco Elver <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kfence: report sensitive information based on no_hash_pointersMarco Elver4-23/+14
We cannot rely on CONFIG_DEBUG_KERNEL to decide if we're running a "debug kernel" where we can safely show potentially sensitive information in the kernel log. Instead, simply rely on the newly introduced "no_hash_pointers" to print unhashed kernel pointers, as well as decide if our reports can include other potentially sensitive information such as registers and corrupted bytes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Cc: Timur Tabi <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kfence: add test suiteMarco Elver5-15/+886
Add KFENCE test suite, testing various error detection scenarios. Makes use of KUnit for test organization. Since KFENCE's interface to obtain error reports is via the console, the test verifies that KFENCE outputs expected reports to the console. [[email protected]: fix typo in test] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: show access type in report] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Signed-off-by: Marco Elver <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Co-developed-by: Alexander Potapenko <[email protected]> Reviewed-by: Jann Horn <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joern Engel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kfence, kasan: make KFENCE compatible with KASANAlexander Potapenko4-4/+39
Make KFENCE compatible with KASAN. Currently this helps test KFENCE itself, where KASAN can catch potential corruptions to KFENCE state, or other corruptions that may be a result of freepointer corruptions in the main allocators. [[email protected]: merge fixup] [[email protected]: untag addresses for KFENCE] Link: https://lkml.kernel.org/r/9dc196006921b191d25d10f6e611316db7da2efc.1611946152.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Reviewed-by: Jann Horn <[email protected]> Co-developed-by: Marco Elver <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joern Engel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm, kfence: insert KFENCE hooks for SLUBAlexander Potapenko2-14/+48
Inserts KFENCE hooks into the SLUB allocator. To pass the originally requested size to KFENCE, add an argument 'orig_size' to slab_alloc*(). The additional argument is required to preserve the requested original size for kmalloc() allocations, which uses size classes (e.g. an allocation of 272 bytes will return an object of size 512). Therefore, kmem_cache::size does not represent the kmalloc-caller's requested size, and we must introduce the argument 'orig_size' to propagate the originally requested size to KFENCE. Without the originally requested size, we would not be able to detect out-of-bounds accesses for objects placed at the end of a KFENCE object page if that object is not equal to the kmalloc-size class it was bucketed into. When KFENCE is disabled, there is no additional overhead, since slab_alloc*() functions are __always_inline. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Reviewed-by: Jann Horn <[email protected]> Co-developed-by: Marco Elver <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joern Engel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm, kfence: insert KFENCE hooks for SLABAlexander Potapenko3-10/+35
Inserts KFENCE hooks into the SLAB allocator. To pass the originally requested size to KFENCE, add an argument 'orig_size' to slab_alloc*(). The additional argument is required to preserve the requested original size for kmalloc() allocations, which uses size classes (e.g. an allocation of 272 bytes will return an object of size 512). Therefore, kmem_cache::size does not represent the kmalloc-caller's requested size, and we must introduce the argument 'orig_size' to propagate the originally requested size to KFENCE. Without the originally requested size, we would not be able to detect out-of-bounds accesses for objects placed at the end of a KFENCE object page if that object is not equal to the kmalloc-size class it was bucketed into. When KFENCE is disabled, there is no additional overhead, since slab_alloc*() functions are __always_inline. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Co-developed-by: Marco Elver <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Joern Engel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26kfence: use pt_regs to generate stack trace on faultsMarco Elver3-34/+43
Instead of removing the fault handling portion of the stack trace based on the fault handler's name, just use struct pt_regs directly. Change kfence_handle_page_fault() to take a struct pt_regs, and plumb it through to kfence_report_error() for out-of-bounds, use-after-free, or invalid access errors, where pt_regs is used to generate the stack trace. If the kernel is a DEBUG_KERNEL, also show registers for more information. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Suggested-by: Mark Rutland <[email protected]> Acked-by: Mark Rutland <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: add Kernel Electric-Fence infrastructureAlexander Potapenko5-0/+1197
Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7. This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. This series enables KFENCE for the x86 and arm64 architectures, and adds KFENCE hooks to the SLAB and SLUB allocators. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, the next allocation through the main allocator (SLAB or SLUB) returns a guarded allocation from the KFENCE object pool. At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. The KFENCE memory pool is of fixed size, and if the pool is exhausted no further KFENCE allocations occur. The default config is conservative with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB pages). We have verified by running synthetic benchmarks (sysbench I/O, hackbench) and production server-workload benchmarks that a kernel with KFENCE (using sample intervals 100-500ms) is performance-neutral compared to a non-KFENCE baseline kernel. KFENCE is inspired by GWP-ASan [1], a userspace tool with similar properties. The name "KFENCE" is a homage to the Electric Fence Malloc Debugger [2]. For more details, see Documentation/dev-tools/kfence.rst added in the series -- also viewable here: https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst [1] http://llvm.org/docs/GwpAsan.html [2] https://linux.die.net/man/3/efence This patch (of 9): This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. To detect out-of-bounds writes to memory within the object's page itself, KFENCE also uses pattern-based redzones. The following figure illustrates the page layout: ---+-----------+-----------+-----------+-----------+-----------+--- | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | ---+-----------+-----------+-----------+-----------+-----------+--- Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, a guarded allocation from the KFENCE object pool is returned to the main allocator (SLAB or SLUB). At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. To date, we have verified by running synthetic benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE is performance-neutral compared to the non-KFENCE baseline. For more details, see Documentation/dev-tools/kfence.rst (added later in the series). [[email protected]: fix parameter description for kfence_object_start()] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: avoid stalling work queue task without allocations] Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix potential deadlock due to wake_up()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: add option to use KFENCE without static keys] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: add missing copyright and description headers] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Co-developed-by: Marco Elver <[email protected]> Reviewed-by: Jann Horn <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Joern Engel <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/early_ioremap.c: use __func__ instead of function nameStephen Zhang1-6/+6
It is better to use __func__ instead of function name. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stephen Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/backing-dev.c: use might_alloc()Daniel Vetter1-1/+2
Now that my little helper has landed, use it more. On top of the existing check this also uses lockdep through the fs_reclaim annotations. [[email protected]: include linux/sched/mm.h] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Vetter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/dmapool: use might_alloc()Daniel Vetter1-1/+2
Now that my little helper has landed, use it more. On top of the existing check this also uses lockdep through the fs_reclaim annotations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Vetter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/zsmalloc.c: use page_private() to access page->privateMiaohe Lin1-1/+1
It's recommended to use helper macro page_private() to access the private field of page. Use such helper to eliminate direct access. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26zsmalloc: account the number of compacted pages correctlyRokudo Yan1-6/+11
There exists multiple path may do zram compaction concurrently. 1. auto-compaction triggered during memory reclaim 2. userspace utils write zram<id>/compaction node So, multiple threads may call zs_shrinker_scan/zs_compact concurrently. But pages_compacted is a per zsmalloc pool variable and modification of the variable is not serialized(through under class->lock). There are two issues here: 1. the pages_compacted may not equal to total number of pages freed(due to concurrently add). 2. zs_shrinker_scan may not return the correct number of pages freed(issued by current shrinker). The fix is simple: 1. account the number of pages freed in zs_compact locally. 2. use actomic variable pages_compacted to accumulate total number. Link: https://lkml.kernel.org/r/[email protected] Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages") Signed-off-by: Rokudo Yan <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()Miaohe Lin1-2/+1
We always memset the zspage allocated via cache_alloc_zspage. So it's more convenient to use kmem_cache_zalloc in cache_alloc_zspage than caller do it manually. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: set the sleep_mapped to true for zbud and z3foldTian Tao2-0/+2
zpool driver adds a flag to indicate whether the zpool driver can enter an atomic context after mapping. This patch sets it true for z3fold and zbud. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tian Tao <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reported-by: Mike Galbraith <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/zswap: add the flag can_sleep_mappedTian Tao2-5/+59
Patch series "Fix the compatibility of zsmalloc and zswap". Patch #1 adds a flag to zpool, then zswap used to determine if zpool drivers such as zbud/z3fold/zsmalloc will enter an atomic context after mapping. The difference between zbud/z3fold and zsmalloc is that zsmalloc requires an atomic context that since its map function holds a preempt-disabled, but zbud/z3fold don't require an atomic context. So patch #2 sets flag sleep_mapped to true indicating that zbud/z3fold can sleep after mapping. zsmalloc didn't support sleep after mapping, so don't set that flag to true. This patch (of 2): Add a flag to zpool, named is "can_sleep_mapped", and have it set true for zbud/z3fold, not set this flag for zsmalloc, so its default value is false. Then zswap could go the current path if the flag is true; and if it's false, copy data from src to a temporary buffer, then unmap the handle, take the mutex, process the buffer instead of src to avoid sleeping function called from atomic context. [[email protected]: add return value in zswap_frontswap_load] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix potential memory leak] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix potential uninitialized pointer read on tmp] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix variable 'entry' is uninitialized when used] Link: https://lkml.kernel.org/r/[email protected]: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tian Tao <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reported-by: Mike Galbraith <[email protected]> Cc: Barry Song <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm: zswap: clean up confusing commentRandy Dunlap1-3/+3
Correct wording and change one duplicated word (it) to "it is". Link: https://lkml.kernel.org/r/[email protected] Fixes: 0ab0abcf5115 ("mm/zswap: refactor the get/put routines") Signed-off-by: Randy Dunlap <[email protected]> Cc: Weijie Yang <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-02-26mm/rmap: correct obsolete comment of page_get_anon_vma()Miaohe Lin1-2/+2
Since commit 746b18d421da ("mm: use refcounts for page_lock_anon_vma()"), page_lock_anon_vma() is renamed to page_get_anon_vma() and converted to return a refcount increased anon_vma. But it forgot to change the relevant comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>