aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-10-16fs: do not update nr_thps for mappings which support THPsMatthew Wilcox (Oracle)2-27/+29
The nr_thps counter is to support THPs in the page cache when the filesystem doesn't understand THPs. Eventually it will be removed, but we should still support filesystems which do not understand THPs yet. Move the nr_thp manipulation functions to filemap.h since they're page-cache specific. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Alexander Viro <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Song Liu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16fs: add a filesystem flag for THPsMatthew Wilcox (Oracle)4-1/+10
The page cache needs to know whether the filesystem supports THPs so that it doesn't send THPs to filesystems which can't handle them. Dave Chinner points out that getting from the page mapping to the filesystem type is too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that information in the address space flags. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Alexander Viro <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Song Liu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/vmscan: allow arbitrary sized pages to be paged outMatthew Wilcox (Oracle)1-2/+1
Remove the assumption that a compound page has HPAGE_PMD_NR pins from the page cache. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: "Huang, Ying" <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/page-writeback: support tail pages in wait_for_stable_pageMatthew Wilcox (Oracle)1-0/+1
page->mapping is undefined for tail pages, so operate exclusively on the head page. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/truncate: fix truncation for pages of arbitrary sizeMatthew Wilcox (Oracle)1-3/+3
Remove the assumption that a compound page is HPAGE_PMD_SIZE, and the assumption that any page is PAGE_SIZE. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/rmap: fix assumptions of THP sizeMatthew Wilcox (Oracle)1-5/+5
Ask the page what size it is instead of assuming it's PMD size. Do this for anon pages as well as file pages for when someone decides to support that. Leave the assumption alone for pages which are PMD mapped; we don't currently grow THPs beyond PMD size, so we don't need to change this code yet. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/huge_memory: fix can_split_huge_page assumption of THP sizeMatthew Wilcox (Oracle)1-2/+2
Ask the page how many subpages it has instead of assuming it's PMD size. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: "Huang, Ying" <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/huge_memory: fix page_trans_huge_mapcount assumption of THP sizeMatthew Wilcox (Oracle)1-2/+2
Ask the page what size it is instead of assuming it's PMD size. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/huge_memory: fix split assumption of page sizeKirill A. Shutemov1-7/+8
File THPs may now be of arbitrary size, and we can't rely on that size after doing the split so remember the number of pages before we start the split. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/huge_memory: fix total_mapcount assumption of page sizeKirill A. Shutemov1-4/+5
File THPs may now be of arbitrary order. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/page_owner: change split_page_owner to take a countMatthew Wilcox (Oracle)4-7/+7
The implementation of split_page_owner() prefers a count rather than the old order of the page. When we support a variable size THP, we won't have the order at this point, but we will have the number of pages. So change the interface to what the caller and callee would prefer. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/memory: remove page fault assumption of compound page sizeMatthew Wilcox (Oracle)1-3/+4
A compound page in the page cache will not necessarily be of PMD size, so check explicitly. [[email protected]: fix remove page fault assumption of compound page size] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Huang Ying <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/filemap: fix page cache removal for arbitrary sized THPsMatthew Wilcox (Oracle)1-1/+1
Patch series "Remove assumptions of THP size". There are a number of places in the VM which assume that a THP is a PMD in size. That's true today, and remains true after this patch series, but this is a prerequisite for switching to arbitrary-sized THPs. thp_nr_pages() still returns either HPAGE_PMD_NR or 1, but will be changed later. This patch (of 11): page_cache_free_page() assumes THPs are PMD_SIZE; fix that assumption. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Huang Ying <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/filemap: fix storing to a THP shadow entryMatthew Wilcox (Oracle)1-11/+31
When a THP is removed from the page cache by reclaim, we replace it with a shadow entry that occupies all slots of the XArray previously occupied by the THP. If the user then accesses that page again, we only allocate a single page, but storing it into the shadow entry replaces all entries with that one page. That leads to bugs like page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:2529! https://bugzilla.kernel.org/show_bug.cgi?id=206569 This is hard to reproduce with mainline, but happens regularly with the THP patchset (as so many more THPs are created). This solution is take from the THP patchset. It splits the shadow entry into order-0 pieces at the time that we bring a new page into cache. Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Song Liu <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Qian Cai <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16XArray: add xas_splitMatthew Wilcox (Oracle)4-16/+225
In order to use multi-index entries for huge pages in the page cache, we need to be able to split a multi-index entry (eg if a file is truncated in the middle of a huge page entry). This version does not support splitting more than one level of the tree at a time. This is an acceptable limitation for the page cache as we do not expect to support order-12 pages in the near future. [[email protected]: export xas_split_alloc() to modules] [[email protected]: fix xarray split] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix xarray] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Qian Cai <[email protected]> Cc: Song Liu <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16XArray: add xa_get_orderMatthew Wilcox (Oracle)3-0/+70
Patch series "Fix read-only THP for non-tmpfs filesystems". As described more verbosely in the [3/3] changelog, we can inadvertently put an order-0 page in the page cache which occupies 512 consecutive entries. Users are running into this if they enable the READ_ONLY_THP_FOR_FS config option; see https://bugzilla.kernel.org/show_bug.cgi?id=206569 and Qian Cai has also reported it here: https://lore.kernel.org/lkml/[email protected]/ This is a rather intrusive way of fixing the problem, but has the advantage that I've actually been testing it with the THP patches, which means that it sees far more use than it does upstream -- indeed, Song has been entirely unable to reproduce it. It also has the advantage that it removes a few patches from my gargantuan backlog of THP patches. This patch (of 3): This function returns the order of the entry at the index. We need this because there isn't space in the shadow entry to encode its order. [[email protected]: export xa_get_order to modules] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Qian Cai <[email protected]> Cc: Song Liu <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable: avoid doing memory allocation with pgtable_t mapped.Aneesh Kumar K.V1-3/+8
With highmem, pte_alloc_map() keep the level4 page table mapped using kmap_atomic(). Avoid doing new memory allocation with page table mapped like above. [ 9.409233] BUG: sleeping function called from invalid context at mm/page_alloc.c:4822 [ 9.410557] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper [ 9.411932] no locks held by swapper/1. [ 9.412595] CPU: 0 PID: 1 Comm: swapper Not tainted 5.9.0-rc3-00323-gc50eb1ed654b5 #2 [ 9.413824] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 9.415207] Call Trace: [ 9.415651] ? ___might_sleep.cold+0xa7/0xcc [ 9.416367] ? __alloc_pages_nodemask+0x14c/0x5b0 [ 9.417055] ? swap_migration_tests+0x50/0x293 [ 9.417704] ? debug_vm_pgtable+0x4bc/0x708 [ 9.418287] ? swap_migration_tests+0x293/0x293 [ 9.418911] ? do_one_initcall+0x82/0x3cb [ 9.419465] ? parse_args+0x1bd/0x280 [ 9.419983] ? rcu_read_lock_sched_held+0x36/0x60 [ 9.420673] ? trace_initcall_level+0x1f/0xf3 [ 9.421279] ? trace_initcall_level+0xbd/0xf3 [ 9.421881] ? do_basic_setup+0x9d/0xdd [ 9.422410] ? do_basic_setup+0xc3/0xdd [ 9.422938] ? kernel_init_freeable+0x72/0xa3 [ 9.423539] ? rest_init+0x134/0x134 [ 9.424055] ? kernel_init+0x5/0x12c [ 9.424574] ? ret_from_fork+0x19/0x30 Reported-by: kernel test robot <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable: avoid none pte in pte_clear_testAneesh Kumar K.V1-3/+6
pte_clear_tests operate on an existing pte entry. Make sure that is not a none pte entry. [[email protected]: avoid kernel crash with riscv] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Albert Ou <[email protected]> Cc: Palmer Dabbelt <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/hugetlb: disable hugetlb test on ppc64Aneesh Kumar K.V1-51/+0
The seems to be missing quite a lot of details w.r.t allocating the correct pgtable_t page (huge_pte_alloc()), holding the right lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. Hence disable the test on ppc64. [[email protected]: drop hugetlb_advanced_tests()] Link: https://lore.kernel.org/lkml/[email protected]/#t Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/pmd_clear: don't use pmd/pud_clear on pte entriesAneesh Kumar K.V1-3/+4
pmd_clear() should not be used to clear pmd level pte entries. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/thp: use page table depost/withdraw with THPAneesh Kumar K.V1-3/+7
Architectures like ppc64 use deposited page table while updating the huge pte entries. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/locks: take correct page table lockAneesh Kumar K.V1-13/+22
Make sure we call pte accessors with correct lock held. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/locks: move non page table modifying test togetherAneesh Kumar K.V1-23/+28
This will help in adding proper locks in a later patch Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/set_pte/pmd/pud: don't use set_*_at to update an ↵Aneesh Kumar K.V1-20/+15
existing pte entry set_pte_at() should not be used to set a pte entry at locations that already holds a valid pte entry. Architectures like ppc64 don't do TLB invalidate in set_pte_at() and hence expect it to be used to set locations that are not a valid PTE. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/savedwrite: enable savedwrite test with ↵Aneesh Kumar K.V1-2/+9
CONFIG_NUMA_BALANCING Saved write support was added to track the write bit of a pte after marking the pte protnone. This was done so that AUTONUMA can convert a write pte to protnone and still track the old write bit. When converting it back we set the pte write bit correctly thereby avoiding a write fault again. Hence enable the test only when CONFIG_NUMA_BALANCING is enabled and use protnone protflags. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtables/hugevmap: use the arch helper to identify huge vmap ↵Aneesh Kumar K.V1-2/+12
support. ppc64 supports huge vmap only with radix translation. Hence use arch helper to determine the huge vmap support. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16mm/debug_vm_pgtable/ppc64: avoid setting top bits in radom valueAneesh Kumar K.V1-3/+10
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in random value. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16powerpc/mm: move setting pte specific flags to pfn_pteAneesh Kumar K.V3-16/+9
powerpc used to set the pte specific flags in set_pte_at(). This is different from other architectures. To be consistent with other architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now unused pte_mkpte. We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without setting _PAGE_PTE bit. We will remove that after a few releases. With respect to huge pmd entries, pmd_mkhuge() takes care of adding the _PAGE_PTE bit. [[email protected]: whitespace fix, per Christophe] Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michael Ellerman <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16powerpc/mm: add DEBUG_VM WARN for pmd_clearAneesh Kumar K.V1-0/+14
Patch series "mm/debug_vm_pgtable fixes", v4. This patch series includes fixes for debug_vm_pgtable test code so that they follow page table updates rules correctly. The first two patches introduce changes w.r.t ppc64. Hugetlb test is disabled on ppc64 because that needs larger change to satisfy page table update rules. These tests are broken w.r.t page table update rules and results in kernel crash as below. [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 lr: c0000000005eeeec: pte_update+0x11c/0x190 sp: c000000c6d1e7950 msr: 8000000002029033 current = 0xc000000c6d172c80 paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 kernel BUG at arch/powerpc/mm/pgtable.c:304! [link register ] c0000000005eeeec pte_update+0x11c/0x190 [c000000c6d1e7950] 0000000000000001 (unreliable) [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c With DEBUG_VM disabled [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 [ 20.530183] Faulting instruction address: 0xc0000000000df330 cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] pc: c0000000000df330: memset+0x68/0x104 lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 sp: c000000c6d19f990 msr: 8000000002009033 dar: 0 current = 0xc000000c6d177480 paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c This patch (of 13): With the hash page table, the kernel should not use pmd_clear for clearing huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage. Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16device-dax/kmem: fix resource releaseDan Williams1-14/+34
The conversion to request_mem_region() is broken because it assumes that the range is marked busy prior to release. However, due to the way that the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to let {add,remove}_memory() handle busy) it requires a manual release_resource() to perform cleanup. Given that the actual 'struct resource *' needs to be recalled, not just the range, add that tracking to the kmem driver-data. Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with release_mem_region()") Reported-by: David Hildenbrand <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Brice Goglin <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jia He <[email protected]> Cc: Joao Martins <[email protected]> Cc: Jonathan Cameron <[email protected]> Link: https://lkml.kernel.org/r/160272252925.3136502.17220638073995895400.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <[email protected]>
2020-10-16RDMA/ucma: Fix use after free in destroy id flowMaor Gottlieb1-5/+6
ucma_free_ctx() should call to __destroy_id() on all the connection requests that have not been delivered to user space. Currently it calls on the context itself and cause to use after free. Fixes the trace: BUG: Unable to handle kernel data access on write at 0x5deadbeef0000108 Faulting instruction address: 0xc0080000002428f4 Oops: Kernel access of bad area, sig: 11 [#1] Call Trace: [c000000207f2b680] [c00800000024280c] .__destroy_id+0x28c/0x610 [rdma_ucm] (unreliable) [c000000207f2b750] [c0080000002429c4] .__destroy_id+0x444/0x610 [rdma_ucm] [c000000207f2b820] [c008000000242c24] .ucma_close+0x94/0xf0 [rdma_ucm] [c000000207f2b8c0] [c00000000046fbdc] .__fput+0xac/0x330 [c000000207f2b960] [c00000000015d48c] .task_work_run+0xbc/0x110 [c000000207f2b9f0] [c00000000012fb00] .do_exit+0x430/0xc50 [c000000207f2bae0] [c0000000001303ec] .do_group_exit+0x5c/0xd0 [c000000207f2bb70] [c000000000144a34] .get_signal+0x194/0xe30 [c000000207f2bc60] [c00000000001f6b4] .do_notify_resume+0x124/0x470 [c000000207f2bd60] [c000000000032484] .interrupt_exit_user_prepare+0x1b4/0x240 [c000000207f2be20] [c000000000010034] interrupt_return+0x14/0x1c0 Rename listen_ctx to conn_req_ctx as the poor name was the cause of this bug. Fixes: a1d33b70dbbc ("RDMA/ucma: Rework how new connections are passed through event delivery") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Maor Gottlieb <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-10-16RDMA/rxe: Handle skb_clone() failure in rxe_recv.cBob Pearson1-0/+3
If skb_clone() is unable to allocate memory for a new sk_buff this is not detected by the current code. Check for a NULL return and continue. This is similar to other errors in this loop over QPs attached to the multicast address and consistent with the unreliable UD transport. Fixes: e7ec96fc7932f ("RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()") Addresses-Coverity-ID: 1497804: Null pointer dereferences (NULL_RETURNS) Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bob Pearson <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-10-16RDMA/rxe: Move the definitions for rxe_av.network_type to uAPIJason Gunthorpe2-4/+10
RXE was wrongly using an internal kernel enum as part of its uAPI, split this out into a dedicated uAPI enum just for RXE. It only uses the IPv4 and IPv6 values. This was exposed by changing the internal kernel enum definition which broke RXE. Fixes: 1c15b4f2a42f ("RDMA/core: Modify enum ib_gid_type and enum rdma_network_type") Signed-off-by: Jason Gunthorpe <[email protected]>
2020-10-16RDMA: Explicitly pass in the dma_device to ib_register_deviceJason Gunthorpe17-80/+59
The code in setup_dma_device has become rather convoluted, move all of this to the drivers. Drives now pass in a DMA capable struct device which will be used to setup DMA, or drivers must fully configure the ibdev for DMA and pass in NULL. Other than setting the masks in rvt all drivers were doing this already anyhow. mthca, mlx4 and mlx5 were already setting up maximum DMA segment size for DMA based on their hardweare limits in: __mthca_init_one() dma_set_max_seg_size (1G) __mlx4_init_one() dma_set_max_seg_size (1G) mlx5_pci_init() set_dma_caps() dma_set_max_seg_size (2G) Other non software drivers (except usnic) were extended to UINT_MAX [1, 2] instead of 2G as was before. [1] https://lore.kernel.org/linux-rdma/[email protected]/ [2] https://lore.kernel.org/linux-rdma/[email protected]/ Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/6b2ed339933d066622d5715903870676d8cc523a.1602590106.git.mchehab+huawei@kernel.org Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Parav Pandit <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-10-16PCI/ASPM: Remove struct pcie_link_state.l1ssSaheed O. Bolarinwa2-35/+50
Previously we computed L1.2 parameters in the enumeration path, saved them in struct pcie_link_state.l1ss, and programmed them into the devices whenever we enabled or disabled L1.2 on the link. But these parameters are constant and don't need to be updated when enabling/disabling L1.2. Compute and program the L1.2 parameters once during enumeration and remove the struct pcie_link_state.l1ss member. No functional change intended. [bhelgaas: rework to program L1.2 parameters during enumeration] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_capSaheed O. Bolarinwa1-32/+21
Previously we stored the L1SS Capabilities value in the struct aspm_register_info. We only need this information in one place, so read it there and remove struct aspm_register_info completely, since it's now empty. No functional change intended. [bhelgaas: split up, don't cache l1ss_cap in pci_dev] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Pass L1SS Capabilities value, not struct aspm_register_infoBjorn Helgaas1-9/+8
aspm_calc_l1ss_info() needs only the L1SS Capabilities. It doesn't need anything else from struct aspm_register_info, so pass only the Capabilities value. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_ctl1Saheed O. Bolarinwa1-12/+14
Previously we stored the L1SS Control 1 register in the struct aspm_register_info. We only need this information in one place, so read it there and remove it from struct aspm_register_info. No functional change intended. [bhelgaas: split ctl1/ctl2] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_ctl2 (unused)Bjorn Helgaas1-4/+1
We never use the aspm_register_info.l1ss_ctl2 value, so remove it. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_cap_ptrSaheed O. Bolarinwa3-21/+19
Save the L1 Substates Capability pointer in struct pci_dev. Then we don't have to keep track of it in the struct aspm_register_info and struct pcie_link_state, which makes the code easier to read. No functional change intended. [bhelgaas: split to a separate patch] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.latency_encodingSaheed O. Bolarinwa1-14/+10
Previously we stored L0s and L1 Exit Latency information from the Link Capabilities register in the struct aspm_register_info. We only need these latencies when we already have the Link Capabilities values, so use those directly and remove the latencies from struct aspm_register_info. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.enabledSaheed O. Bolarinwa1-8/+6
Previously we stored the "ASPM Control" bits from the Link Control register in the struct aspm_register_info. Read PCI_EXP_LNKCTL directly when needed. This means we can use the PCI_EXP_LNKCTL_ASPM_* bits directly instead of the similar but different PCIE_LINK_STATE_* bits. No functional change intended. [bhelgaas: drop get_aspm_enable() and read LNKCTL once directly] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.supportSaheed O. Bolarinwa2-11/+16
Previously we stored the "ASPM Support" field from the Link Capabilities register in the struct aspm_register_info. Read the Link Capabilities directly when needed and remove it from the struct aspm_register_info. No functional change intended. [bhelgaas: remove pci_dev cached copy since LNKCAP isn't truly read-only, add PCI_EXP_LNKCAP_ASPM_L0S & PCI_EXP_LNKCAP_ASPM_L1, check them directly instead of adding aspm_support()] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Saheed O. Bolarinwa <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Use 'parent' and 'child' for readabilityBjorn Helgaas1-4/+5
Other users of link->pdev and link->downstream, e.g., pcie_aspm_cap_init(), pcie_config_aspm_l1ss(), and pcie_config_aspm_link(), use "parent" and "child" as local names. Do the same in aspm_calc_l1ss_info() for readability. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Move LTR path check to where it's usedBjorn Helgaas1-9/+8
pcie_get_aspm_reg() mostly reads ASPM-related registers, but in some cases it also updates the value read from PCI_L1SS_CAP based on LTR properties. Move this update to the point where the value is used to make the code more readable. No functional change intended, although previously we could clear PCI_L1SS_CAP_ASPM_L1_2 for both ends of the link, and now we'll only do it for the downstream end of a link. This shouldn't matter because we always test that bit by ANDing l1ss_cap for the upstream and downstream ends. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16PCI/ASPM: Move pci_clear_and_set_dword() earlierBjorn Helgaas1-11/+11
Move pci_clear_and_set_dword() earlier in file to prepare for future patch. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]>
2020-10-16lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED valuesJason Gunthorpe1-4/+8
The main intention of the max_segment argument to __sg_alloc_table_from_pages() is to match the DMA layer segment size set by dma_set_max_seg_size(). Restricting the input to be page aligned makes it impossible to just connect the DMA layer to this API. The only reason for a page alignment here is because the algorithm will overshoot the max_segment if it is not a multiple of PAGE_SIZE. Simply fix the alignment before starting and don't expose this implementation detail to the callers. A future patch will completely remove SCATTERLIST_MAX_SEGMENT. Signed-off-by: Jason Gunthorpe <[email protected]>
2020-10-16Merge branch 'dynamic_sg' into rdma.git for-nextJason Gunthorpe577-3494/+5626
From Maor Gottlieb says: ==================== This series extends __sg_alloc_table_from_pages to allow chaining of new pages to an already initialized SG table. This allows for drivers to utilize the optimization of merging contiguous pages without a need to pre allocate all the pages and hold them in a very large temporary buffer prior to the call to SG table initialization. The last patch changes the Infiniband core to use the new API. It removes duplicate functionality from the code and benefits from the optimization of allocating dynamic SG table from pages. In huge pages system of 2MB page size, without this change, the SG table would contain x512 SG entries. ==================== * branch 'dynamic_sg': RDMA/umem: Move to allocate SG table from pages lib/scatterlist: Add support in dynamic allocation of SG table from pages tools/testing/scatterlist: Show errors in human readable form tools/testing/scatterlist: Rejuvenate bit-rotten test
2020-10-16tracing: Remove __init from __trace_early_add_new_event()Masami Hiramatsu1-1/+1
The commit 720dee53ad8d ("tracing/boot: Initialize per-instance event list in early boot") removes __init from __trace_early_add_events() but __trace_early_add_new_event() still has __init and will cause a section mismatch. Remove __init from __trace_early_add_new_event() as same as __trace_early_add_events(). Link: https://lore.kernel.org/lkml/CAHk-=wjU86UhovK4XuwvCqTOfc+nvtpAuaN2PJBz15z=w=u0Xg@mail.gmail.com/ Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-10-16afs: Don't assert on unpurgeable server recordsDavid Howells2-1/+8
Don't give an assertion failure on unpurgeable afs_server records - which kills the thread - but rather emit a trace line when we are purging a record (which only happens during network namespace removal or rmmod) and print a notice of the problem. Signed-off-by: David Howells <[email protected]>