aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2017-02-22mm/swap: add cluster lockHuang, Ying1-0/+6
This patch is to reduce the lock contention of swap_info_struct->lock via using a more fine grained lock in swap_cluster_info for some swap operations. swap_info_struct->lock is heavily contended if multiple processes reclaim pages simultaneously. Because there is only one lock for each swap device. While in common configuration, there is only one or several swap devices in the system. The lock protects almost all swap related operations. In fact, many swap operations only access one element of swap_info_struct->swap_map array. And there is no dependency between different elements of swap_info_struct->swap_map. So a fine grained lock can be used to allow parallel access to the different elements of swap_info_struct->swap_map. In this patch, a spinlock is added to swap_cluster_info to protect the elements of swap_info_struct->swap_map in the swap cluster and the fields of swap_cluster_info. This reduced locking contention for swap_info_struct->swap_map access greatly. Because of the added spinlock, the size of swap_cluster_info increases from 4 bytes to 8 bytes on the 64 bit and 32 bit system. This will use additional 4k RAM for every 1G swap space. Because the size of swap_cluster_info is much smaller than the size of the cache line (8 vs 64 on x86_64 architecture), there may be false cache line sharing between spinlocks in swap_cluster_info. To avoid the false sharing in the first round of the swap cluster allocation, the order of the swap clusters in the free clusters list is changed. So that, the swap_cluster_info sharing the same cache line will be placed as far as possible. After the first round of allocation, the order of the clusters in free clusters list is expected to be random. So the false sharing should be not serious. Compared with a previous implementation using bit_spin_lock, the sequential swap out throughput improved about 3.2%. Test was done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case created 32 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used. [[email protected]: v5] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: initialize spinlock for swap_cluster_info] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: annotate nested locking for cluster lock] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/dbb860bbd825b1aaba18988015e8963f263c3f0d.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <[email protected]> Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> escreveu: Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22powerpc: do not make the entire heap executableDenys Vlasenko1-0/+1
On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt, or with a toolchain which defaults to it) look like this: [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4 [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4 [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4 Which results in an ELF load header: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000 This is all correct, the load region containing the PLT is marked as executable. Note that the PLT starts at 0002b00c but the file mapping ends at 0002aff8, so the PLT falls in the 0 fill section described by the load header, and after a page boundary. Unfortunately the generic ELF loader ignores the X bit in the load headers when it creates the 0 filled non-file backed mappings. It assumes all of these mappings are RW BSS sections, which is not the case for PPC. gcc/ld has an option (--secure-plt) to not do this, this is said to incur a small performance penalty. Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire brk area* with executable rights for all binaries, even --secure-plt ones. Stop doing that. Teach the ELF loader to check the X bit in the relevant load header and create 0 filled anonymous mappings that are executable if the load header requests that. Test program showing the difference in /proc/$PID/maps: int main() { char buf[16*1024]; char *p = malloc(123); /* make "[heap]" mapping appear */ int fd = open("/proc/self/maps", O_RDONLY); int len = read(fd, buf, sizeof(buf)); write(1, buf, len); printf("%p\n", p); return 0; } Compiled using: gcc -mbss-plt -m32 -Os test.c -otest Unpatched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10690000-106c0000 rwxp 00000000 00:00 0 [heap] f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack] 0x10690008 Patched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10180000-101b0000 rw-p 00000000 00:00 0 [heap] ^^^^ this has changed f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ff860000-ff890000 rw-p 00000000 00:00 0 [stack] 0x10180008 The patch was originally posted in 2012 by Jason Gunthorpe and apparently ignored: https://lkml.org/lkml/2012/9/30/138 Lightly run-tested. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Denys Vlasenko <[email protected]> Acked-by: Kees Cook <[email protected]> Acked-by: Michael Ellerman <[email protected]> Tested-by: Jason Gunthorpe <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Florian Weimer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm: page_alloc: skip over regions of invalid pfns where possiblePaul Burton1-0/+1
When using a sparse memory model memmap_init_zone() when invoked with the MEMMAP_EARLY context will skip over pages which aren't valid - ie. which aren't in a populated region of the sparse memory map. However if the memory map is extremely sparse then it can spend a long time linearly checking each PFN in a large non-populated region of the memory map & skipping it in turn. When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient information to quickly discover the next valid PFN given an invalid one by searching through the list of memory regions & skipping forwards to the first PFN covered by the memory region to the right of the non-populated region. Implement this in order to speed up memmap_init_zone() for systems with extremely sparse memory maps. James said "I have tested this patch on a virtual model of a Samurai CPU with a sparse memory map. The kernel boot time drops from 109 to 62 seconds. " Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Paul Burton <[email protected]> Tested-by: James Hartley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, compaction: add vmstats for kcompactd workDavid Rientjes1-0/+1
A "compact_daemon_wake" vmstat exists that represents the number of times kcompactd has woken up. This doesn't represent how much work it actually did, though. It's useful to understand how much compaction work is being done by kcompactd versus other methods such as direct compaction and explicitly triggered per-node (or system) compaction. This adds two new vmstats: "compact_daemon_migrate_scanned" and "compact_daemon_free_scanned" to represent the number of pages kcompactd has scanned as part of its migration scanner and freeing scanner, respectively. These values are still accounted for in the general "compact_migrate_scanned" and "compact_free_scanned" for compatibility. It could be argued that explicitly triggered compaction could also be tracked separately, and that could be added if others find it useful. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm: un-export wake_up_page functionsNicholas Piggin1-10/+2
These are no longer used outside mm/filemap.c, so un-export them and make them static where possible. These were exported specifically for NFS use in commit a4796e37c12e ("MM: export page_wakeup functions"). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Anna Schumaker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, vmscan: add mm_vmscan_inactive_list_is_low tracepointMichal Hocko1-0/+40
Currently we have tracepoints for both active and inactive LRU lists reclaim but we do not have any which would tell us why we we decided to age the active list. Without that it is quite hard to diagnose active/inactive lists balancing. Add mm_vmscan_inactive_list_is_low tracepoint to tell us this information. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, vmscan: enhance mm_vmscan_lru_shrink_inactive tracepointMichal Hocko1-3/+26
mm_vmscan_lru_shrink_inactive will currently report the number of scanned and reclaimed pages. This doesn't give us an idea how the reclaim went except for the overall effectiveness though. Export and show other counters which will tell us why we couldn't reclaim some pages. - nr_dirty, nr_writeback, nr_congested and nr_immediate tells us how many pages are blocked due to IO - nr_activate tells us how many pages were moved to the active list - nr_ref_keep reports how many pages are kept on the LRU due to references (mostly for the file pages which are about to go for another round through the inactive list) - nr_unmap_fail - how many pages failed to unmap All these are rather low level so they might change in future but the tracepoint is already implementation specific so no tools should be depending on its stability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepointMichal Hocko2-6/+14
mm_vmscan_lru_isolate currently prints only whether the LRU we isolate from is file or anonymous but we do not know which LRU this is. It is useful to know whether the list is active or inactive, since we are using the same function to isolate pages from both of them and it's hard to distinguish otherwise. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, vmscan: show the number of skipped pages in mm_vmscan_lru_isolateMichal Hocko1-2/+6
mm_vmscan_lru_isolate shows the number of requested, scanned and taken pages. This is mostly OK but on 32b systems the number of scanned pages is quite misleading because it includes both the scanned and skipped pages. Moreover the skipped part is scaled based on the number of taken pages. Let's report the exact numbers without any additional logic and add the number of skipped pages. This should make the reported data much more easier to interpret. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, vmscan: add active list aging tracepointMichal Hocko1-0/+36
Our reclaim process has several tracepoints to tell us more about how things are progressing. We are, however, missing a tracepoint to track active list aging. Introduce mm_vmscan_lru_shrink_active which reports the number of - nr_taken is number of isolated pages from the active list - nr_referenced pages which tells us that we are hitting referenced pages which are deactivated. If this is a large part of the reported nr_deactivated pages then we might be hitting into the active list too early because they might be still part of the working set. This might help to debug performance issues. - nr_active pages which tells us how many pages are kept on the active list - mostly exec file backed pages. A high number can indicate that we might be trashing on executables. [[email protected]: update] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, vmscan: remove unused mm_vmscan_memcg_isolateMichal Hocko1-30/+1
Patch series "vm, vmscan: enahance vmscan tracepoints", v2. While debugging [2] I've realized that there is some room for improvements in the tracepoints set we offer currently. I had hard times to make any conclusion from the existing ones. The resulting problem turned out to be active list aging [3] and we are missing at least two tracepoints to debug such a problem. Some existing tracepoints could export more information to see _why_ the reclaim progress cannot be made not only _how much_ we could reclaim. The later could be seen quite reasonably from the vmstat counters already. It can be argued that we are showing too many implementation details in those tracepoints but I consider them way too lowlevel already to be usable by any kernel independent userspace. I would be _really_ surprised if anything but debugging tools have used them. Any feedback is highly appreciated. [1] http://lkml.kernel.org/r/[email protected] [2] http://lkml.kernel.org/r/[email protected] [3] http://lkml.kernel.org/r/[email protected] This patch (of 8): The trace point is not used since 925b7673cce3 ("mm: make per-memcg LRU lists exclusive") so it can be removed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_SHMEMAndrea Arcangeli1-1/+7
Userland developers asked to be notified immediately by the UFFDIO_API ioctl if shmem missing mode is supported by userfaultfd in the running kernel. This avoids the need to run UFFDIO_REGISTER on a shmem virtual memory range to find out. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: shmem: allow registration of shared memory rangesMike Rapoport1-1/+1
Expand the userfaultfd_register/unregister routines to allow shared memory VMAs. Currently, there is no UFFDIO_ZEROPAGE and write-protection support for shared memory VMAs, which is reflected in ioctl methods supported by uffdio_register. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: shmem: introduce vma_is_shmemMike Rapoport1-0/+10
Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to ensure compatibility of a VMA with userfault. Introduction of vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be used with userfaultfd. Current implementation presumes usage of vma_is_shmem only by slow path routines in userfaultfd, therefore the vma_is_shmem is not made inline to leave the few remaining free bits in vm_flags. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd supportMike Rapoport1-0/+11
shmem_mcopy_atomic_pte is the low level routine that implements the userfaultfd UFFDIO_COPY command. It is based on the existing mcopy_atomic_pte routine with modifications for shared memory pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_HUGETLBFSAndrea Arcangeli1-3/+25
Userland developers asked to be notified immediately by the UFFDIO_API ioctl if hugetlbfs missing mode is supported by userfaultfd in the running kernel. This avoids the need to run UFFDIO_REGISTER on a hugetlbfs virtual memory range to find out. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRYAndrea Arcangeli1-2/+3
Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features will work on hugetlbfs. This is required for fully functional userfaultfd non-present support on hugetlbfs. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: allow registration of ranges containing huge pagesMike Kravetz1-0/+3
Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB vmas. huge page alignment checking is performed after a VM_HUGETLB vma is encountered. Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not return that as a valid ioctl method for huge page ranges. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processingMike Kravetz1-1/+2
The new routine copy_huge_page_from_user() uses kmap_atomic() to map PAGE_SIZE pages. However, this prevents page faults in the subsequent call to copy_from_user(). This is OK in the case where the routine is copied with mmap_sema held. However, in another case we want to allow page faults. So, add a new argument allow_pagefault to indicate if the routine should allow page faults. [[email protected]: unmap the correct pointer] Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda [[email protected]: kunmap() takes a page*, per Hugh] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd supportMike Kravetz1-0/+7
hugetlb_mcopy_atomic_pte is the low level routine that implements the userfaultfd UFFDIO_COPY command. It is based on the existing mcopy_atomic_pte routine with modifications for huge pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd ↵Mike Kravetz1-0/+3
support userfaultfd UFFDIO_COPY allows user level code to copy data to a page at fault time. The data is copied from user space to a newly allocated huge page. The new routine copy_huge_page_from_user performs this copy. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED requestPavel Emelyanov2-1/+21
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Emelyanov <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete()Andrea Arcangeli1-2/+2
Optimize the mremap_userfaultfd_complete() interface to pass only the vm_userfaultfd_ctx pointer through the stack as a microoptimization. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: Hillf Danton <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: non-cooperative: add mremap() eventPavel Emelyanov2-1/+27
The event denotes that an area [start:end] moves to different location. Length change isn't reported as "new" addresses, if they appear on the uffd reader side they will not contain any data and the latter can just zeromap them. Waiting for the event ACK is also done outside of mmap sem, as for fork event. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Emelyanov <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: non-cooperative: Add fork() eventPavel Emelyanov2-10/+18
When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. [[email protected]: build warning fix] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Emelyanov <[email protected]> Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22userfaultfd: document _IOR/_IOWAndrea Arcangeli2-1/+15
Patch series "userfaultfd tmpfs/hugetlbfs/non-cooperative", v2 These userfaultfd features are finished and are ready for larger exposure in -mm and upstream merging. 1) tmpfs non present userfault 2) hugetlbfs non present userfault 3) non cooperative userfault for fork/madvise/mremap qemu development code is already exercising 2) and container postcopy live migration needs 3). 1) is not currently used but there's a self test and we know some qemu user for various reasons uses tmpfs as backing for KVM so it'll need it too to use postcopy live migration with tmpfs memory. All review feedback from the previous submit has been handled and the fixes are included. There's no outstanding issue AFIK. Upstream code just did a s/fe/vmf/ conversion in the page faults and this has been converted as well incrementally. In addition to the previous submits, this also wakes up stuck userfaults during UFFDIO_UNREGISTER. The non cooperative testcase actually reproduced this problem by getting stuck instead of quitting clean in some rare case as it could call UFFDIO_UNREGISTER while some userfault could be still in flight. The other option would have been to keep leaving it up to userland to serialize itself and to patch the testcase instead but the wakeup during unregister I think is preferable. David also asked the UFFD_FEATURE_MISSING_HUGETLBFS and UFFD_FEATURE_MISSING_SHMEM feature flags to be added so QEMU can avoid to probe if the hugetlbfs/shmem missing support is available by calling UFFDIO_REGISTER. QEMU already checks HUGETLBFS_MAGIC with fstatfs so if UFFD_FEATURE_MISSING_HUGETLBFS is also set, it knows UFFDIO_REGISTER will succeed (or if it fails, it's for some other more concerning reason). There's no reason to worry about adding too many feature flags. There are 64 available and worst case we've to bump the API if someday we're really going to run out of them. The round-trip network latency of hugetlbfs userfaults during postcopy live migration is still of the order of dozen milliseconds on 10GBit if at 2MB hugepage granularity so it's working perfectly and it should provide for higher bandwidth or lower CPU usage (which makes it interesting to add an option in the future to support THP granularity too for anonymous memory, UFFDIO_COPY would then have to create THP if alignment/len allows for it). 1GB hugetlbfs granularity will require big changes in hugetlbfs to work so it's deferred for later. This patch (of 42): This adds proper documentation (inline) to avoid the risk of further misunderstandings about the semantics of _IOW/_IOR and it also reminds whoever will bump the UFFDIO_API in the future, to change the two ioctl to _IOW. This was found while implementing strace support for those ioctl, otherwise we could have never found it by just reviewing kernel code and testing it. _IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so it's only worth fixing if the UFFDIO_API is bumped someday. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: "Dmitry V. Levin" <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22oom, trace: add compaction retry tracepointMichal Hocko2-0/+65
Higher order requests oom debugging is currently quite hard. We do have some compaction points which can tell us how the compaction is operating but there is no trace point to tell us about compaction retry logic. This patch adds a one which will have the following format bash-3126 [001] .... 1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0 we can see that the order 9 request is not retried even though we are in the highest compaction priority mode becase the last compaction attempt was withdrawn. This means that compaction_zonelist_suitable must have returned false and there is no suitable zone to compact for this request and so no need to retry further. another example would be <...>-3137 [001] .... 81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0 in this case the order-9 compaction failed to find any suitable block. We do not retry anymore because this is a costly request and those do not go below COMPACT_PRIO_SYNC_LIGHT priority. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22oom, trace: add oom detection tracepointsMichal Hocko1-0/+42
should_reclaim_retry is the central decision point for declaring the OOM. It might be really useful to expose data used for this decision making when debugging an unexpected oom situations. Say we have an OOM report: [ 52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 [ 52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G W 4.8.0-oomtrace3-00006-gb21338b386d2 #1024 Now we can check the tracepoint data to see how we have ended up in this situation: mem_eater-3148 [003] .... 52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1 mem_eater-3148 [003] .... 52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1 mem_eater-3148 [003] .... 52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1 mem_eater-3148 [003] .... 52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1 mem_eater-3148 [003] .... 52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1 mem_eater-3148 [003] .... 52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1 mem_eater-3148 [003] .... 52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1 mem_eater-3148 [003] .... 52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0 mem_eater-3148 [003] .... 52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0 The above shows that we can quickly deduce that the reclaim stopped making any progress (see no_progress_loops increased in each round) and while there were still some 51 reclaimable pages they couldn't be dropped for some reason (vmscan trace points would tell us more about that part). available will represent reclaimable + free_pages scaled down per no_progress_loops factor. This is essentially an optimistic estimate of how much memory we would have when reclaiming everything. This can be compared to min_wmark to get a rought idea but the wmark_check tells the result of the watermark check which is more precise (includes lowmem reserves, considers the order etc.). As we can see no zone is eligible in the end and that is why we have triggered the oom in this situation. Please note that higher order requests might fail on the wmark_check even when there is much more memory available than min_wmark - e.g. when the memory is fragmented. A follow up tracepoint will help to debug those situations. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, trace: extract COMPACTION_STATUS and ZONE_TYPE to a common headerMichal Hocko2-57/+67
COMPACTION_STATUS resp. ZONE_TYPE are currently used to translate enum compact_result resp. struct zone index into their symbolic names for an easier post processing. The follow up patch would like to reuse this as well. The code involves some preprocessor black magic which is better not duplicated elsewhere so move it to a common mm tracing relate header. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22slab: use memcg_kmem_cache_wq for slab destruction operationsTejun Heo1-0/+1
If there's contention on slab_mutex, queueing the per-cache destruction work item on the system_wq can unnecessarily create and tie up a lot of kworkers. Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it global and use that workqueue for the destruction work items too. While at it, convert the workqueue from an unbound workqueue to a per-cpu one with concurrency limited to 1. It's generally preferable to use per-cpu workqueues and concurrency limit of 1 is safe enough. This is suggested by Joonsoo Kim. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jay Vana <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22slab: remove synchronous synchronize_sched() from memcg cache deactivation pathTejun Heo1-0/+6
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slub uses synchronize_sched() to deactivate a memcg cache. synchronize_sched() is an expensive and slow operation and doesn't scale when a huge number of caches are destroyed back-to-back. While there used to be a simple batching mechanism, the batching was too restricted to be helpful. This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub can use to schedule sched RCU callback instead of performing synchronize_sched() synchronously while holding cgroup_mutex. While this adds online cpus, mems and slab_mutex operations, operating on these locks back-to-back from the same kworker, which is what's gonna happen when there are many to deactivate, isn't expensive at all and this gets rid of the scalability problem completely. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jay Vana <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22slab: implement slab_root_caches listTejun Heo1-0/+3
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slab_caches currently lists all caches including root and memcg ones. This is the only data structure which lists the root caches and iterating root caches can only be done by walking the list while skipping over memcg caches. As there can be a huge number of memcg caches, this can become very expensive. This also can make /proc/slabinfo behave very badly. seq_file processes reads in 4k chunks and seeks to the previous Nth position on slab_caches list to resume after each chunk. With a lot of memcg cache churns on the list, reading /proc/slabinfo can become very slow and its content often ends up with duplicate and/or missing entries. This patch adds a new list slab_root_caches which lists only the root caches. When memcg is not enabled, it becomes just an alias of slab_caches. memcg specific list operations are collected into memcg_[un]link_cache(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jay Vana <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22slab: link memcg kmem_caches on their associated memory cgroupTejun Heo2-0/+4
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. While a memcg kmem_cache is listed on its root cache's ->children list, there is no direct way to iterate all kmem_caches which are assocaited with a memory cgroup. The only way to iterate them is walking all caches while filtering out caches which don't match, which would be most of them. This makes memcg destruction operations O(N^2) where N is the total number of slab caches which can be huge. This combined with the synchronous RCU operations can tie up a CPU and affect the whole machine for many hours when memory reclaim triggers offlining and destruction of the stale memcgs. This patch adds mem_cgroup->kmem_caches list which goes through memcg_cache_params->kmem_caches_node of all kmem_caches which are associated with the memcg. All memcg specific iterations, including stat file access, are updated to use the new list instead. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jay Vana <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22slab: reorganize memcg_cache_paramsTejun Heo1-9/+24
We're going to change how memcg caches are iterated. In preparation, clean up and reorganize memcg_cache_params. * The shared ->list is replaced by ->children in root and ->children_node in children. * ->is_root_cache is removed. Instead ->root_cache is moved out of the child union and now used by both root and children. NULL indicates root cache. Non-NULL a memcg one. This patch doesn't cause any observable behavior changes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22slub: separate out sysfs_slab_release() from sysfs_slab_remove()Tejun Heo1-2/+2
Separate out slub sysfs removal and release, and call the former earlier from __kmem_cache_shutdown(). There's no reason to defer sysfs removal through RCU and this will later allow us to remove sysfs files way earlier during memory cgroup offline instead of release. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, dax: change pmd_fault() to take only vmf parameterDave Jiang3-34/+29
pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22mm, dax: make pmd_fault() and friends be the same as fault()Dave Jiang3-14/+11
Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [[email protected]: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22dax: add tracepoints to dax_pmd_insert_mapping()Ross Zwisler2-0/+57
Add tracepoints to dax_pmd_insert_mapping(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: big-1504 [001] .... 326.960743: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1504 [001] .... 326.960753: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1504 [001] .... 326.960981: dax_pmd_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10505000 length 0x200000 pfn 0x100600 DEV|MAP radix_entry 0xc000e big-1504 [001] .... 326.960986: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22dax: add tracepoints to dax_pmd_load_hole()Ross Zwisler1-0/+42
Add tracepoints to dax_pmd_load_hole(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: read_big-1478 [004] .... 238.242188: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 read_big-1478 [004] .... 238.242191: dax_pmd_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 read_big-1478 [004] .... 238.242390: dax_pmd_load_hole: dev 259:0 ino 0x1003 shared address 0x10400000 zero_page ffffea0002c20000 radix_entry 0x1e read_big-1478 [004] .... 238.242392: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22dax: add tracepoint infrastructure, PMD tracingRoss Zwisler2-0/+93
Tracepoints are the standard way to capture debugging and tracing information in many parts of the kernel, including the XFS and ext4 filesystems. Create a tracepoint header for FS DAX and add the first DAX tracepoints to the PMD fault handler. This allows the tracing for DAX to be done in the same way as the filesystem tracing so that developers can look at them together and get a coherent idea of what the system is doing. I added both an entry and exit tracepoint because future patches will add tracepoints to child functions of dax_iomap_pmd_fault() like dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to be wrapped by the parent function tracepoints so the code flow is more easily understood. Having entry and exit tracepoints for faults also allows us to easily see what filesystems functions were called during the fault. These filesystem functions get executed via iomap_begin() and iomap_end() calls, for example, and will have their own tracepoints. For PMD faults we primarily want to understand the type of mapping, the fault flags, the faulting address and whether it fell back to 4k faults. If it fell back to 4k faults the tracepoints should let us understand why. I named the new tracepoint header file "fs_dax.h" to allow for device DAX to have its own separate tracing header in the same directory at some point. Here is an example output for these events from a successful PMD fault: big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Suggested-by: Dave Chinner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22tracing: add __print_flags_u64()Ross Zwisler2-0/+15
Patch series "DAX tracepoints, mm argument simplification", v4. This contains both my DAX tracepoint code and Dave Jiang's MM argument simplifications. Dave's code was written with my tracepoint code as a baseline, so it seemed simplest to keep them together in a single series. This patch (of 7): Add __print_flags_u64() and the helper trace_print_flags_seq_u64() in the same spirit as __print_symbolic_u64() and trace_print_symbols_seq_u64(). These functions allow us to print symbols associated with flags that are 64 bits wide even on 32 bit machines. These will be used by the DAX code so that we can print the flags set in a pfn_t such as PFN_SG_CHAIN, PFN_SG_LAST, PFN_DEV and PFN_MAP. Without this new function I was getting errors like the following when compiling for i386: include/linux/pfn_t.h:13:22: warning: large integer implicitly truncated to unsigned type [-Woverflow] #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1)) ^ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22uapi: fix linux/if.h userspace compilation errorsDmitry V. Levin1-0/+4
Include <sys/socket.h> (guarded by ifndef __KERNEL__) to fix the following linux/if.h userspace compilation errors: /usr/include/linux/if.h:234:19: error: field 'ifru_addr' has incomplete type struct sockaddr ifru_addr; /usr/include/linux/if.h:235:19: error: field 'ifru_dstaddr' has incomplete type struct sockaddr ifru_dstaddr; /usr/include/linux/if.h:236:19: error: field 'ifru_broadaddr' has incomplete type struct sockaddr ifru_broadaddr; /usr/include/linux/if.h:237:19: error: field 'ifru_netmask' has incomplete type struct sockaddr ifru_netmask; /usr/include/linux/if.h:238:20: error: field 'ifru_hwaddr' has incomplete type struct sockaddr ifru_hwaddr; This also fixes userspace compilation of the following uapi headers: linux/atmbr2684.h linux/gsmmux.h linux/if_arp.h linux/if_bonding.h linux/if_frad.h linux/if_pppox.h linux/if_tunnel.h linux/netdevice.h linux/route.h linux/wireless.h As no uapi header provides a definition of struct sockaddr, inclusion of <sys/socket.h> seems to be the most conservative and the only safe fix available. All current users of <linux/if.h> are very likely to be including <sys/socket.h> already because the latter is the sole provider of struct sockaddr definition in libc, so adding a uapi header with a definition of struct sockaddr would create a potential conflict with <sys/socket.h>. Replacing struct sockaddr in the definition of struct ifreq with a different type would create a potential incompatibility with current users of struct ifreq who might rely on ifru_addr et al members being of type struct sockaddr. Signed-off-by: Dmitry V. Levin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-02-22Merge tag 'tty-4.11-rc1' of ↵Linus Torvalds7-37/+287
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Here is the big tty/serial driver patchset for 4.11-rc1. Not much here, but a lot of little fixes and individual serial driver updates all over the subsystem. Majority are for the sh-sci driver and platform (the arch-specific changes have acks from the maintainer). The start of the "serial bus" code is here as well, but nothing is converted to use it yet. That work is still ongoing, hopefully will start to show up across different subsystems for 4.12 (bluetooth is one major place that will be used.) All of these have been in linux-next for a while with no reported issues" * tag 'tty-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (109 commits) tty: pl011: Work around QDF2400 E44 stuck BUSY bit atmel_serial: Use the fractional divider when possible tty: Remove extra include in HVC console tty framework serial: exar: Enable MSI support serial: exar: Move register defines from uapi header to consumer site serial: pci: Remove unused pci_boards entries serial: exar: Move Commtech adapters to 8250_exar as well serial: exar: Fix feature control register constants serial: exar: Fix initialization of EXAR registers for ports > 0 serial: exar: Fix mapping of port I/O resources serial: sh-sci: fix hardware RX trigger level setting tty/serial: atmel: ensure state is restored after suspending serial: 8250_dw: Avoid "too much work" from bogus rx timeout interrupt serdev: ttyport: check whether tty_init_dev() fails serial: 8250_pci: make pciserial_detach_ports() static ARM: dts: STiH410-b2260: Enable HW flow-control ARM: dts: STiH407-family: Use new Pinctrl groups ARM: dts: STiH407-pinctrl: Add Pinctrl group for HW flow-control ARM: dts: STiH410-b2260: Identify the UART RTS line dt-bindings: serial: Update 'uart-has-rtscts' description ...
2017-02-22Merge tag 'staging-4.11-rc1' of ↵Linus Torvalds10-158/+323
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging/iio driver updates from Greg KH: "Here is the big staging and iio driver patchsets for 4.11-rc1. We almost broke even this time around, with only a few thousand lines added overall, as we removed the old and obsolete i4l code, but added some new drivers for the RPi platform, as well as adding some new IIO drivers. All of these have been in linux-next for a while with no reported issues" * tag 'staging-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (669 commits) Staging: vc04_services: Fix the "space prohibited" code style errors Staging: vc04_services: Fix the "wrong indent" code style errors staging: octeon: Use net_device_stats from struct net_device Staging: rtl8192u: ieee80211: ieee80211.h - style fix Staging: rtl8192u: ieee80211: ieee80211_tx.c - style fix Staging: rtl8192u: ieee80211: rtl819x_BAProc.c - style fix Staging: rtl8192u: ieee80211: ieee80211_module.c - style fix Staging: rtl8192u: ieee80211: rtl819x_TSProc.c - style fix Staging: rtl8192u: r8192U.h - style fix Staging: rtl8192u: r8192U_core.c - style fix Staging: rtl8192u: r819xU_cmdpkt.c - style fix staging: rtl8192u: blank lines aren't necessary before a close brace '}' staging: rtl8192u: Adding space after enum and struct definition staging: rtl8192u: Adding space after struct definition Staging: ks7010: Add required and preferred spaces around operators Staging: ks7010: ks*: Remove redundant blank lines Staging: ks7010: ks*: Add missing blank lines after declarations staging: visorbus, replace init_timer with setup_timer staging: vt6656: rxtx.c Removed multiple dereferencing staging: vt6656: Alignment match open parenthesis ...
2017-02-22Merge tag 'driver-core-4.11-rc1' of ↵Linus Torvalds3-19/+11
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "small" driver core patches for 4.11-rc1. Not much here, some firmware documentation and self-test updates, a debugfs code formatting issue, and a new feature for call_usermodehelper to make it more robust on systems that want to lock it down in a more secure way. All of these have been linux-next for a while now with no reported issues" * tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: handle null pointers while printing node name and path Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper() Make static usermode helper binaries constant kmod: make usermodehelper path a const string firmware: revamp firmware documentation selftests: firmware: send expected errors to /dev/null selftests: firmware: only modprobe if driver is missing platform: Print the resource range if device failed to claim kref: prefer atomic_inc_not_zero to atomic_add_unless debugfs: improve formatting of debugfs_real_fops()
2017-02-22Merge tag 'char-misc-4.11-rc1' of ↵Linus Torvalds11-134/+285
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big char/misc driver patchset for 4.11-rc1. Lots of different driver subsystems updated here: rework for the hyperv subsystem to handle new platforms better, mei and w1 and extcon driver updates, as well as a number of other "minor" driver updates. All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (169 commits) goldfish: Sanitize the broken interrupt handler x86/platform/goldfish: Prevent unconditional loading vmbus: replace modulus operation with subtraction vmbus: constify parameters where possible vmbus: expose hv_begin/end_read vmbus: remove conditional locking of vmbus_write vmbus: add direct isr callback mode vmbus: change to per channel tasklet vmbus: put related per-cpu variable together vmbus: callback is in softirq not workqueue binder: Add support for file-descriptor arrays binder: Add support for scatter-gather binder: Add extra size to allocator binder: Refactor binder_transact() binder: Support multiple /dev instances binder: Deal with contexts in debugfs binder: Support multiple context managers binder: Split flat_binder_object auxdisplay: ht16k33: remove private workqueue auxdisplay: ht16k33: rework input device initialization ...
2017-02-22Merge tag 'usb-4.11-rc1' of ↵Linus Torvalds2-3/+12
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB/PHY updates from Greg KH: "Here is the big USB and PHY driver updates for 4.11-rc1. Nothing major, just the normal amount of churn in the usb gadget and dwc and xhci controllers, new device ids, new phy drivers, a new usb-serial driver, and a few other minor changes in different USB drivers. All have been in linux-next for a long time with no reported issues" * tag 'usb-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (265 commits) usb: cdc-wdm: remove logically dead code USB: serial: keyspan: drop header file USB: serial: io_edgeport: drop io-tables header file usb: musb: add code comment for clarification usb: misc: add USB251xB/xBi Hi-Speed Hub Controller Driver usb: misc: usbtest: remove redundant check on retval < 0 USB: serial: upd78f0730: sort device ids USB: serial: upd78f0730: add ID for EVAL-ADXL362Z ohci-hub: fix typo in dbg_port macro usb: musb: dsps: Manage CPPI 4.1 DMA interrupt in DSPS usb: musb: tusb6010: Clean up tusb_omap_dma structure usb: musb: cppi_dma: Clean up cppi41_dma_controller structure usb: musb: cppi_dma: Clean up cppi structure usb: musb: cppi41: Detect aborted transfers in cppi41_dma_callback() usb: musb: dma: Add a DMA completion platform callback drivers: usb: usbip: Add missing break statement to switch usb: mtu3: remove redundant dev_err call in get_ssusb_rscs() USB: serial: mos7840: fix another NULL-deref at open USB: serial: console: clean up sanity checks USB: serial: console: fix uninitialised spinlock ...
2017-02-22Merge tag 'arm64-upstream' of ↵Linus Torvalds4-11/+47
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: - Errata workarounds for Qualcomm's Falkor CPU - Qualcomm L2 Cache PMU driver - Qualcomm SMCCC firmware quirk - Support for DEBUG_VIRTUAL - CPU feature detection for userspace via MRS emulation - Preliminary work for the Statistical Profiling Extension - Misc cleanups and non-critical fixes * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits) arm64/kprobes: consistently handle MRS/MSR with XZR arm64: cpufeature: correctly handle MRS to XZR arm64: traps: correctly handle MRS/MSR with XZR arm64: ptrace: add XZR-safe regs accessors arm64: include asm/assembler.h in entry-ftrace.S arm64: fix warning about swapper_pg_dir overflow arm64: Work around Falkor erratum 1003 arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2 arm64: arch_timer: document Hisilicon erratum 161010101 arm64: use is_vmalloc_addr arm64: use linux/sizes.h for constants arm64: uaccess: consistently check object sizes perf: add qcom l2 cache perf events driver arm64: remove wrong CONFIG_PROC_SYSCTL ifdef ARM: smccc: Update HVC comment to describe new quirk parameter arm64: do not trace atomic operations ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device() ACPI/IORT: Fix iort_node_get_id() mapping entries indexing arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA perf: xgene: Include module.h ...
2017-02-22Merge tag 'arc-4.11-rc1' of ↵Linus Torvalds1-0/+17
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC updates from Vineet Gupta: - Intc imporvements [Yuriy] - VDK platform updates [Alexey] * tag 'arc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: [plat-*] ARC_HAS_COH_CACHES no longer relevant ARCv2: intc: Delete useless comments in Device Trees ARCv2: IDU-intc: Delete deprecated parameters in Device Trees ARCv2: IDU-intc: mask all common interrupts by default ARCv2: IDU-intc: Use build registers for getting numbers of interrupts ARCv2: intc: Set default priority for all core interrupts ARCv2: intc: Use runtime value of irq count for setting up intc ARCv2: intc: Rework the build time irq count information ARC: [intc-*]: confine NR_CPU_IRQS to intc code ARCv2: intc: Use ARC_REG_STATUS32 for addressing STATUS32 reg arc: vdk: Add support of UIO arc: vdk: Add support of MMC controller arc: vdk: Disable halt on reset
2017-02-22Merge tag 'powerpc-4.11-1' of ↵Linus Torvalds2-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for direct mapped LPC on POWER9, giving Linux direct access to devices that may be on there such as a UART. - Memory hotplug support for the Power9 Radix MMU. - Add new AUX vectors describing the processor's cache geometry, to be used by glibc. - The ability for a guest to ask the hypervisor to resize the guest's hash table, and in addition support for doing so automatically when memory is hotplugged into/out-of the guest. This allows the hash table to be sized based on the current memory usage of the guest, rather than the maximum possible memory usage. - Implementation of optprobes (kprobe optimisation) for powerpc. In addition there's the topic branch shared with the KVM tree, which includes support for guests to use the Radix MMU on Power9. Thanks to: Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun" * tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits) powerpc/mm/radix: Skip ptesync in pte update helpers powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm powerpc/mm/radix: Update pte update sequence for pte clear case powerpc/mm: Update PROTFAULT handling in the page fault path powerpc/xmon: Fix data-breakpoint powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n powerpc/pseries: Fix typo in parameter description powerpc/kprobes: Remove kprobe_exceptions_notify() kprobes: Introduce weak variant of kprobe_exceptions_notify() powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL powerpc/powernv: Fix opal_exit tracepoint opcode powerpc: Add a prototype for mcount() so it can be versioned powerpc: Drop GPL from of_node_to_nid() export to match other arches powerpc/kprobes: Optimize kprobe in kretprobe_trampoline() powerpc/kprobes: Implement Optprobes powerpc/kprobes: Fixes for kprobe_lookup_name() on BE powerpc: Add helper to check if offset is within relative branch range powerpc/bpf: Introduce __PPC_SH64() ...