Age | Commit message (Collapse) | Author | Files | Lines |
|
The nr_thps counter is to support THPs in the page cache when the
filesystem doesn't understand THPs. Eventually it will be removed, but we
should still support filesystems which do not understand THPs yet. Move
the nr_thp manipulation functions to filemap.h since they're page-cache
specific.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The page cache needs to know whether the filesystem supports THPs so that
it doesn't send THPs to filesystems which can't handle them. Dave Chinner
points out that getting from the page mapping to the filesystem type is
too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
information in the address space flags.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the assumption that a compound page has HPAGE_PMD_NR pins from the
page cache.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: "Huang, Ying" <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
page->mapping is undefined for tail pages, so operate exclusively on the
head page.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the assumption that a compound page is HPAGE_PMD_SIZE, and the
assumption that any page is PAGE_SIZE.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Ask the page what size it is instead of assuming it's PMD size. Do this
for anon pages as well as file pages for when someone decides to support
that. Leave the assumption alone for pages which are PMD mapped; we don't
currently grow THPs beyond PMD size, so we don't need to change this code
yet.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Ask the page how many subpages it has instead of assuming it's PMD size.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: "Huang, Ying" <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Ask the page what size it is instead of assuming it's PMD size.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
File THPs may now be of arbitrary size, and we can't rely on that size
after doing the split so remember the number of pages before we start the
split.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
File THPs may now be of arbitrary order.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The implementation of split_page_owner() prefers a count rather than the
old order of the page. When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A compound page in the page cache will not necessarily be of PMD size,
so check explicitly.
[[email protected]: fix remove page fault assumption of compound page size]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Remove assumptions of THP size".
There are a number of places in the VM which assume that a THP is a PMD in
size. That's true today, and remains true after this patch series, but
this is a prerequisite for switching to arbitrary-sized THPs.
thp_nr_pages() still returns either HPAGE_PMD_NR or 1, but will be changed
later.
This patch (of 11):
page_cache_free_page() assumes THPs are PMD_SIZE; fix that assumption.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Huang Ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When a THP is removed from the page cache by reclaim, we replace it with a
shadow entry that occupies all slots of the XArray previously occupied by
the THP. If the user then accesses that page again, we only allocate a
single page, but storing it into the shadow entry replaces all entries
with that one page. That leads to bugs like
page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
------------[ cut here ]------------
kernel BUG at mm/filemap.c:2529!
https://bugzilla.kernel.org/show_bug.cgi?id=206569
This is hard to reproduce with mainline, but happens regularly with the
THP patchset (as so many more THPs are created). This solution is take
from the THP patchset. It splits the shadow entry into order-0 pieces at
the time that we bring a new page into cache.
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Song Liu <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Qian Cai <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In order to use multi-index entries for huge pages in the page cache, we
need to be able to split a multi-index entry (eg if a file is truncated in
the middle of a huge page entry). This version does not support splitting
more than one level of the tree at a time. This is an acceptable
limitation for the page cache as we do not expect to support order-12
pages in the near future.
[[email protected]: export xas_split_alloc() to modules]
[[email protected]: fix xarray split]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix xarray]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Song Liu <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Fix read-only THP for non-tmpfs filesystems".
As described more verbosely in the [3/3] changelog, we can inadvertently
put an order-0 page in the page cache which occupies 512 consecutive
entries. Users are running into this if they enable the
READ_ONLY_THP_FOR_FS config option; see
https://bugzilla.kernel.org/show_bug.cgi?id=206569 and Qian Cai has also
reported it here:
https://lore.kernel.org/lkml/[email protected]/
This is a rather intrusive way of fixing the problem, but has the
advantage that I've actually been testing it with the THP patches, which
means that it sees far more use than it does upstream -- indeed, Song has
been entirely unable to reproduce it. It also has the advantage that it
removes a few patches from my gargantuan backlog of THP patches.
This patch (of 3):
This function returns the order of the entry at the index. We need this
because there isn't space in the shadow entry to encode its order.
[[email protected]: export xa_get_order to modules]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Song Liu <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With highmem, pte_alloc_map() keep the level4 page table mapped using
kmap_atomic(). Avoid doing new memory allocation with page table mapped
like above.
[ 9.409233] BUG: sleeping function called from invalid context at mm/page_alloc.c:4822
[ 9.410557] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper
[ 9.411932] no locks held by swapper/1.
[ 9.412595] CPU: 0 PID: 1 Comm: swapper Not tainted 5.9.0-rc3-00323-gc50eb1ed654b5 #2
[ 9.413824] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 9.415207] Call Trace:
[ 9.415651] ? ___might_sleep.cold+0xa7/0xcc
[ 9.416367] ? __alloc_pages_nodemask+0x14c/0x5b0
[ 9.417055] ? swap_migration_tests+0x50/0x293
[ 9.417704] ? debug_vm_pgtable+0x4bc/0x708
[ 9.418287] ? swap_migration_tests+0x293/0x293
[ 9.418911] ? do_one_initcall+0x82/0x3cb
[ 9.419465] ? parse_args+0x1bd/0x280
[ 9.419983] ? rcu_read_lock_sched_held+0x36/0x60
[ 9.420673] ? trace_initcall_level+0x1f/0xf3
[ 9.421279] ? trace_initcall_level+0xbd/0xf3
[ 9.421881] ? do_basic_setup+0x9d/0xdd
[ 9.422410] ? do_basic_setup+0xc3/0xdd
[ 9.422938] ? kernel_init_freeable+0x72/0xa3
[ 9.423539] ? rest_init+0x134/0x134
[ 9.424055] ? kernel_init+0x5/0x12c
[ 9.424574] ? ret_from_fork+0x19/0x30
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pte_clear_tests operate on an existing pte entry. Make sure that is not a
none pte entry.
[[email protected]: avoid kernel crash with riscv]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The seems to be missing quite a lot of details w.r.t allocating the
correct pgtable_t page (huge_pte_alloc()), holding the right lock
(huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.
ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
Hence disable the test on ppc64.
[[email protected]: drop hugetlb_advanced_tests()]
Link: https://lore.kernel.org/lkml/[email protected]/#t
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pmd_clear() should not be used to clear pmd level pte entries.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Architectures like ppc64 use deposited page table while updating the huge
pte entries.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make sure we call pte accessors with correct lock held.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This will help in adding proper locks in a later patch
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
existing pte entry
set_pte_at() should not be used to set a pte entry at locations that
already holds a valid pte entry. Architectures like ppc64 don't do TLB
invalidate in set_pte_at() and hence expect it to be used to set locations
that are not a valid PTE.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
CONFIG_NUMA_BALANCING
Saved write support was added to track the write bit of a pte after
marking the pte protnone. This was done so that AUTONUMA can convert a
write pte to protnone and still track the old write bit. When converting
it back we set the pte write bit correctly thereby avoiding a write fault
again. Hence enable the test only when CONFIG_NUMA_BALANCING is enabled
and use protnone protflags.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
support.
ppc64 supports huge vmap only with radix translation. Hence use arch
helper to determine the huge vmap support.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that
bit in random value.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
powerpc used to set the pte specific flags in set_pte_at(). This is
different from other architectures. To be consistent with other
architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
unused pte_mkpte.
We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without
setting _PAGE_PTE bit. We will remove that after a few releases.
With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
_PAGE_PTE bit.
[[email protected]: whitespace fix, per Christophe]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Christophe Leroy <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Michael Ellerman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/debug_vm_pgtable fixes", v4.
This patch series includes fixes for debug_vm_pgtable test code so that
they follow page table updates rules correctly. The first two patches
introduce changes w.r.t ppc64.
Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
page table update rules.
These tests are broken w.r.t page table update rules and results in kernel
crash as below.
[ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0]
pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380
lr: c0000000005eeeec: pte_update+0x11c/0x190
sp: c000000c6d1e7950
msr: 8000000002029033
current = 0xc000000c6d172c80
paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01
pid = 1, comm = swapper/0
kernel BUG at arch/powerpc/mm/pgtable.c:304!
[link register ] c0000000005eeeec pte_update+0x11c/0x190
[c000000c6d1e7950] 0000000000000001 (unreliable)
[c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190
[c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8
[c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338
[c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0
[c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4
[c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160
[c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
With DEBUG_VM disabled
[ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000
[ 20.530183] Faulting instruction address: 0xc0000000000df330
cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700]
pc: c0000000000df330: memset+0x68/0x104
lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
sp: c000000c6d19f990
msr: 8000000002009033
dar: 0
current = 0xc000000c6d177480
paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01
pid = 1, comm = swapper/0
[link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
[c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
[c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378
[c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244
[c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0
[c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4
[c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160
[c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c
This patch (of 13):
With the hash page table, the kernel should not use pmd_clear for clearing
huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christophe Leroy <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The conversion to request_mem_region() is broken because it assumes that
the range is marked busy prior to release. However, due to the way that
the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to let
{add,remove}_memory() handle busy) it requires a manual release_resource()
to perform cleanup.
Given that the actual 'struct resource *' needs to be recalled, not just
the range, add that tracking to the kmem driver-data.
Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with release_mem_region()")
Reported-by: David Hildenbrand <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Brice Goglin <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jia He <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Link: https://lkml.kernel.org/r/160272252925.3136502.17220638073995895400.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ucma_free_ctx() should call to __destroy_id() on all the connection requests
that have not been delivered to user space. Currently it calls on the
context itself and cause to use after free.
Fixes the trace:
BUG: Unable to handle kernel data access on write at 0x5deadbeef0000108
Faulting instruction address: 0xc0080000002428f4
Oops: Kernel access of bad area, sig: 11 [#1]
Call Trace:
[c000000207f2b680] [c00800000024280c] .__destroy_id+0x28c/0x610 [rdma_ucm] (unreliable)
[c000000207f2b750] [c0080000002429c4] .__destroy_id+0x444/0x610 [rdma_ucm]
[c000000207f2b820] [c008000000242c24] .ucma_close+0x94/0xf0 [rdma_ucm]
[c000000207f2b8c0] [c00000000046fbdc] .__fput+0xac/0x330
[c000000207f2b960] [c00000000015d48c] .task_work_run+0xbc/0x110
[c000000207f2b9f0] [c00000000012fb00] .do_exit+0x430/0xc50
[c000000207f2bae0] [c0000000001303ec] .do_group_exit+0x5c/0xd0
[c000000207f2bb70] [c000000000144a34] .get_signal+0x194/0xe30
[c000000207f2bc60] [c00000000001f6b4] .do_notify_resume+0x124/0x470
[c000000207f2bd60] [c000000000032484] .interrupt_exit_user_prepare+0x1b4/0x240
[c000000207f2be20] [c000000000010034] interrupt_return+0x14/0x1c0
Rename listen_ctx to conn_req_ctx as the poor name was the cause of this
bug.
Fixes: a1d33b70dbbc ("RDMA/ucma: Rework how new connections are passed through event delivery")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
If skb_clone() is unable to allocate memory for a new sk_buff this is not
detected by the current code.
Check for a NULL return and continue. This is similar to other errors in
this loop over QPs attached to the multicast address and consistent with
the unreliable UD transport.
Fixes: e7ec96fc7932f ("RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()")
Addresses-Coverity-ID: 1497804: Null pointer dereferences (NULL_RETURNS)
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bob Pearson <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
RXE was wrongly using an internal kernel enum as part of its uAPI, split
this out into a dedicated uAPI enum just for RXE. It only uses the IPv4
and IPv6 values.
This was exposed by changing the internal kernel enum definition which
broke RXE.
Fixes: 1c15b4f2a42f ("RDMA/core: Modify enum ib_gid_type and enum rdma_network_type")
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
The code in setup_dma_device has become rather convoluted, move all of
this to the drivers. Drives now pass in a DMA capable struct device which
will be used to setup DMA, or drivers must fully configure the ibdev for
DMA and pass in NULL.
Other than setting the masks in rvt all drivers were doing this already
anyhow.
mthca, mlx4 and mlx5 were already setting up maximum DMA segment size for
DMA based on their hardweare limits in:
__mthca_init_one()
dma_set_max_seg_size (1G)
__mlx4_init_one()
dma_set_max_seg_size (1G)
mlx5_pci_init()
set_dma_caps()
dma_set_max_seg_size (2G)
Other non software drivers (except usnic) were extended to UINT_MAX [1, 2]
instead of 2G as was before.
[1] https://lore.kernel.org/linux-rdma/[email protected]/
[2] https://lore.kernel.org/linux-rdma/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/6b2ed339933d066622d5715903870676d8cc523a.1602590106.git.mchehab+huawei@kernel.org
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Parav Pandit <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Previously we computed L1.2 parameters in the enumeration path, saved them
in struct pcie_link_state.l1ss, and programmed them into the devices
whenever we enabled or disabled L1.2 on the link. But these parameters are
constant and don't need to be updated when enabling/disabling L1.2.
Compute and program the L1.2 parameters once during enumeration and remove
the struct pcie_link_state.l1ss member. No functional change intended.
[bhelgaas: rework to program L1.2 parameters during enumeration]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Previously we stored the L1SS Capabilities value in the struct
aspm_register_info.
We only need this information in one place, so read it there and remove
struct aspm_register_info completely, since it's now empty. No functional
change intended.
[bhelgaas: split up, don't cache l1ss_cap in pci_dev]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
aspm_calc_l1ss_info() needs only the L1SS Capabilities. It doesn't need
anything else from struct aspm_register_info, so pass only the Capabilities
value. No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Previously we stored the L1SS Control 1 register in the struct
aspm_register_info.
We only need this information in one place, so read it there and remove it
from struct aspm_register_info. No functional change intended.
[bhelgaas: split ctl1/ctl2]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
We never use the aspm_register_info.l1ss_ctl2 value, so remove it. No
functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Save the L1 Substates Capability pointer in struct pci_dev. Then we don't
have to keep track of it in the struct aspm_register_info and struct
pcie_link_state, which makes the code easier to read. No functional change
intended.
[bhelgaas: split to a separate patch]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Previously we stored L0s and L1 Exit Latency information from the Link
Capabilities register in the struct aspm_register_info.
We only need these latencies when we already have the Link Capabilities
values, so use those directly and remove the latencies from struct
aspm_register_info. No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Previously we stored the "ASPM Control" bits from the Link Control register
in the struct aspm_register_info.
Read PCI_EXP_LNKCTL directly when needed. This means we can use the
PCI_EXP_LNKCTL_ASPM_* bits directly instead of the similar but different
PCIE_LINK_STATE_* bits. No functional change intended.
[bhelgaas: drop get_aspm_enable() and read LNKCTL once directly]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Previously we stored the "ASPM Support" field from the Link Capabilities
register in the struct aspm_register_info.
Read the Link Capabilities directly when needed and remove it from the
struct aspm_register_info. No functional change intended.
[bhelgaas: remove pci_dev cached copy since LNKCAP isn't truly read-only,
add PCI_EXP_LNKCAP_ASPM_L0S & PCI_EXP_LNKCAP_ASPM_L1, check them directly
instead of adding aspm_support()]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Saheed O. Bolarinwa <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Other users of link->pdev and link->downstream, e.g., pcie_aspm_cap_init(),
pcie_config_aspm_l1ss(), and pcie_config_aspm_link(), use "parent" and
"child" as local names.
Do the same in aspm_calc_l1ss_info() for readability. No functional change
intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
pcie_get_aspm_reg() mostly reads ASPM-related registers, but in some cases
it also updates the value read from PCI_L1SS_CAP based on LTR properties.
Move this update to the point where the value is used to make the code more
readable.
No functional change intended, although previously we could clear
PCI_L1SS_CAP_ASPM_L1_2 for both ends of the link, and now we'll only do it
for the downstream end of a link. This shouldn't matter because we always
test that bit by ANDing l1ss_cap for the upstream and downstream ends.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
Move pci_clear_and_set_dword() earlier in file to prepare for future patch.
No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
|
|
The main intention of the max_segment argument to
__sg_alloc_table_from_pages() is to match the DMA layer segment size set
by dma_set_max_seg_size().
Restricting the input to be page aligned makes it impossible to just
connect the DMA layer to this API.
The only reason for a page alignment here is because the algorithm will
overshoot the max_segment if it is not a multiple of PAGE_SIZE. Simply fix
the alignment before starting and don't expose this implementation detail
to the callers.
A future patch will completely remove SCATTERLIST_MAX_SEGMENT.
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
From Maor Gottlieb says:
====================
This series extends __sg_alloc_table_from_pages to allow chaining of new
pages to an already initialized SG table.
This allows for drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in a very
large temporary buffer prior to the call to SG table initialization.
The last patch changes the Infiniband core to use the new API. It removes
duplicate functionality from the code and benefits from the optimization
of allocating dynamic SG table from pages.
In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
====================
* branch 'dynamic_sg':
RDMA/umem: Move to allocate SG table from pages
lib/scatterlist: Add support in dynamic allocation of SG table from pages
tools/testing/scatterlist: Show errors in human readable form
tools/testing/scatterlist: Rejuvenate bit-rotten test
|
|
The commit 720dee53ad8d ("tracing/boot: Initialize per-instance event
list in early boot") removes __init from __trace_early_add_events()
but __trace_early_add_new_event() still has __init and will cause a
section mismatch.
Remove __init from __trace_early_add_new_event() as same as
__trace_early_add_events().
Link: https://lore.kernel.org/lkml/CAHk-=wjU86UhovK4XuwvCqTOfc+nvtpAuaN2PJBz15z=w=u0Xg@mail.gmail.com/
Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Don't give an assertion failure on unpurgeable afs_server records - which
kills the thread - but rather emit a trace line when we are purging a
record (which only happens during network namespace removal or rmmod) and
print a notice of the problem.
Signed-off-by: David Howells <[email protected]>
|