aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-09-08fs, proc: unconditional cond_resched when reading smapsDavid Rientjes1-2/+3
If there are large numbers of hugepages to iterate while reading /proc/pid/smaps, the page walk never does cond_resched(). On archs without split pmd locks, there can be significant and observable contention on mm->page_table_lock which cause lengthy delays without rescheduling. Always reschedule in smaps_pte_range() if necessary since the pagewalk iteration can be expensive. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08proc: uninline proc_create()Alexey Dobriyan2-7/+9
Save some code from ~320 invocations all clearing last argument. add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657) function old new delta proc_create - 17 +17 __ksymtab_proc_create - 16 +16 __kstrtab_proc_create - 12 +12 yam_init_driver 301 298 -3 ... cifs_proc_init 249 228 -21 via_fb_pci_probe 2304 2280 -24 Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08fs, proc: remove priv argument from is_stackMichal Hocko2-8/+5
Commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks") removed the priv parameter user in is_stack so the argument is redundant. Drop it. [[email protected]: remove unused variable] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Cc: Andy Lutomirski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/mempolicy.c: remove BUG_ON() checks for VMA inside mpol_misplaced()Anshuman Khandual1-5/+0
VMA and its address bounds checks are too late in this function. They must have been verified earlier in the page fault sequence. Hence just remove them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/swapfile.c: fix swapon frontswap_map memory leak on errorDavid Rientjes1-0/+1
Free frontswap_map if an error is encountered before enable_swap_info(). Signed-off-by: David Rientjes <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: kvfree the swap cluster info if the swap file is unsatisfactoryDarrick J. Wong1-1/+1
If initializing a small swap file fails because the swap file has a problem (holes, etc.) then we need to free the cluster info as part of cleanup. Unfortunately a previous patch changed the code to use kvzalloc but did not change all the vfree calls to use kvfree. Found by running generic/357 from xfstests. Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attemptTetsuo Handa1-1/+2
We are by error initializing alloc_flags before gfp_allowed_mask is applied. This could cause problems after pm_restrict_gfp_mask() is called during suspend operation. Apply gfp_allowed_mask before initializing alloc_flags so that the first allocation attempt uses correct flags. Link: http://lkml.kernel.org/r/[email protected] Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath") Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08tools/testing/selftests/kcmp/kcmp_test.c: add KCMP_EPOLL_TFD testingCyrill Gorcunov1-2/+58
KCMP's KCMP_EPOLL_TFD mode merged in commit 0791e3644e5ef2 ("kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest for it yet (except in criu development list). Thus add it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Cyrill Gorcunov <[email protected]> Cc: Andrey Vagin <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/sparse.c: fix typo in online_mem_sectionsMichal Hocko1-1/+1
online_mem_sections() accidentally marks online only the first section in the given range. This is a typo which hasn't been noticed because I haven't tested large 2GB blocks previously. All users of pfn_to_online_page would get confused on the the rest of the pfn range in the block. All we need to fix this is to use iterator (pfn) rather than start_pfn. Link: http://lkml.kernel.org/r/[email protected] Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes") Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/memory.c: fix mem_cgroup_oom_disable() call missingLaurent Dufour1-5/+5
Seen while reading the code, in handle_mm_fault(), in the case arch_vma_access_permitted() is failing the call to mem_cgroup_oom_disable() is not made. To fix that, move the call to mem_cgroup_oom_enable() after calling arch_vma_access_permitted() as it should not have entered the memcg OOM. Link: http://lkml.kernel.org/r/[email protected] Fixes: bae473a423f6 ("mm: introduce fault_env") Signed-off-by: Laurent Dufour <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: memcontrol: use per-cpu stocks for socket memory unchargingRoman Gushchin1-2/+4
We've noticed a quite noticeable performance overhead on some hosts with significant network traffic when socket memory accounting is enabled. Perf top shows that socket memory uncharging path is hot: 2.13% [kernel] [k] page_counter_cancel 1.14% [kernel] [k] __sk_mem_reduce_allocated 1.14% [kernel] [k] _raw_spin_lock 0.87% [kernel] [k] _raw_spin_lock_irqsave 0.84% [kernel] [k] tcp_ack 0.84% [kernel] [k] ixgbe_poll 0.83% < workload > 0.82% [kernel] [k] enqueue_entity 0.68% [kernel] [k] __fget 0.68% [kernel] [k] tcp_delack_timer_handler 0.67% [kernel] [k] __schedule 0.60% < workload > 0.59% [kernel] [k] __inet6_lookup_established 0.55% [kernel] [k] __switch_to 0.55% [kernel] [k] menu_select 0.54% libc-2.20.so [.] __memcpy_avx_unaligned To address this issue, the existing per-cpu stock infrastructure can be used. refill_stock() can be called from mem_cgroup_uncharge_skmem() to move charge to a per-cpu stock instead of calling atomic page_counter_uncharge(). To prevent the uncontrolled growth of per-cpu stocks, refill_stock() will explicitly drain the cached charge, if the cached value exceeds CHARGE_BATCH. This allows significantly optimize the load: 1.21% [kernel] [k] _raw_spin_lock 1.01% [kernel] [k] ixgbe_poll 0.92% [kernel] [k] _raw_spin_lock_irqsave 0.90% [kernel] [k] enqueue_entity 0.86% [kernel] [k] tcp_ack 0.85% < workload > 0.74% perf-11120.map [.] 0x000000000061bf24 0.73% [kernel] [k] __schedule 0.67% [kernel] [k] __fget 0.63% [kernel] [k] __inet6_lookup_established 0.62% [kernel] [k] menu_select 0.59% < workload > 0.59% [kernel] [k] __switch_to 0.57% libc-2.20.so [.] __memcpy_avx_unaligned Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: fadvise: avoid fadvise for fs without backing deviceShakeel Butt1-3/+3
The fadvise() manpage is silent on fadvise()'s effect on memory-based filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs, sysfs, kernfs). The current implementaion of fadvise is mostly a noop for such filesystems except for FADV_DONTNEED which will trigger expensive remote LRU cache draining. This patch makes the noop of fadvise() on such file systems very explicit. However this change has two side effects for ramfs and one for tmpfs. First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed pages of ramfs (allocated through read, readahead & read fault) and tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could create such clean zero'ed pages for ramfs. This change removes those possibilities. One of our generic libraries does fadvise(FADV_DONTNEED). Recently we observed high latency in fadvise() and noticed that the users have started using tmpfs files and the latency was due to expensive remote LRU cache draining. For normal tmpfs files (have data written on them), fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache draining. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/zsmalloc.c: change stat type parameter to intMatthias Kaehlcke1-3/+6
zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however some callers pass an enum fullness_group value. Change the type to int to reflect the actual use of the functions and get rid of 'enum-conversion' warnings Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthias Kaehlcke <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Doug Anderson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/mlock.c: use page_zone() instead of page_zone_id()Joonsoo Kim1-6/+4
page_zone_id() is a specialized function to compare the zone for the pages that are within the section range. If the section of the pages are different, page_zone_id() can be different even if their zone is the same. This wrong usage doesn't cause any actual problem since __munlock_pagevec_fill() would be called again with failed index. However, it's better to use more appropriate function here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: consider the number in local CPUs when reading NUMA statsKemi Wang2-3/+12
To avoid deviation, the per cpu number of NUMA stats in vm_numa_stat_diff[] is included when a user *reads* the NUMA stats. Since NUMA stats does not be read by users frequently, and kernel does not need it to make a decision, it will not be a problem to make the readers more expensive. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kemi Wang <[email protected]> Reported-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tim Chen <[email protected]> Cc: Ying Huang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: update NUMA counter threshold sizeKemi Wang2-20/+11
There is significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (suggested by Dave Hansen). This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter(suggested by Ying Huang). The rationality is that these statistics counters don't affect the kernel's decision, unlike other VM counters, so it's not a problem to use a large threshold. With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default (base) 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kemi Wang <[email protected]> Reported-by: Jesper Dangaard Brouer <[email protected]> Suggested-by: Dave Hansen <[email protected]> Suggested-by: Ying Huang <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: change the call sites of numa statistics itemsKemi Wang5-27/+220
Patch series "Separate NUMA statistics from zone statistics", v2. Each page allocation updates a set of per-zone statistics with a call to zone_statistics(). As discussed in 2017 MM summit, these are a substantial source of overhead in the page allocator and are very rarely consumed. This significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (pointed out by Dave Hansen). A link to the MM summit slides: http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf To mitigate this overhead, this patchset separates NUMA statistics from zone statistics framework, and update NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter (suggested by Ying Huang). The rationality is that these statistics counters don't need to be read often, unlike other VM counters, so it's not a problem to use a large threshold and make readers more expensive. With this patchset, we see 31.3% drop of CPU cycles(537-->369, see below) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Meanwhile, this patchset keeps the same style of virtual memory statistics with little end-user-visible effects (only move the numa stats to show behind zone page stats, see the first patch for details). I did an experiment of single page allocation and reclaim concurrently using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server (88 processors with 126G memory) with different size of threshold of pcp counter. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics This patch (of 3): In this patch, NUMA statistics is separated from zone statistics framework, all the call sites of NUMA stats are changed to use numa-stats-specific functions, it does not have any functionality change except that the number of NUMA stats is shown behind zone page stats when users *read* the zone info. E.g. cat /proc/zoneinfo ***Base*** ***With this patch*** nr_free_pages 3976 nr_free_pages 3976 nr_zone_inactive_anon 0 nr_zone_inactive_anon 0 nr_zone_active_anon 0 nr_zone_active_anon 0 nr_zone_inactive_file 0 nr_zone_inactive_file 0 nr_zone_active_file 0 nr_zone_active_file 0 nr_zone_unevictable 0 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_zone_write_pending 0 nr_mlock 0 nr_mlock 0 nr_page_table_pages 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_kernel_stack 0 nr_bounce 0 nr_bounce 0 nr_zspages 0 nr_zspages 0 numa_hit 0 *nr_free_cma 0* numa_miss 0 numa_hit 0 numa_foreign 0 numa_miss 0 numa_interleave 0 numa_foreign 0 numa_local 0 numa_interleave 0 numa_other 0 numa_local 0 *nr_free_cma 0* numa_other 0 ... ... vm stats threshold: 10 vm stats threshold: 10 ... ... The next patch updates the numa stats counter size and threshold. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kemi Wang <[email protected]> Reported-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ying Huang <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Tim Chen <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/memory.c: remove reduntant check for write accessAnshuman Khandual1-3/+2
Flags argument has been copied into vmf.flags and it is not changed in between. Hence a single write access check can be used for both PUD and PMD. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08userfaultfd: non-cooperative: closing the uffd without triggering SIGBUSAndrea Arcangeli1-1/+19
This is an enhancement to avoid a non cooperative userfaultfd manager having to unregister all regions before it can close the uffd after all userfaultfd activity completed. The UFFDIO_UNREGISTER would serialize against the handle_userfault by taking the mmap_sem for writing, but we can simply repeat the page fault if we detect the uffd was closed and so the regular page fault paths should takeover. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Acked-by: Mike Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: remove useless vma parameter to offset_il_nodeLaurent Dufour1-4/+3
While reading the code I found that offset_il_node() has a vm_area_struct pointer parameter which is unused. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Laurent Dufour <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm: fix build when HMM is disabledJérôme Glisse1-2/+1
Combinatorial Kconfig is painfull. Withi this patch all below combination build. 1) 2) CONFIG_HMM_MIRROR=y 3) CONFIG_DEVICE_PRIVATE=y 4) CONFIG_DEVICE_PUBLIC=y 5) CONFIG_HMM_MIRROR=y CONFIG_DEVICE_PUBLIC=y 6) CONFIG_HMM_MIRROR=y CONFIG_DEVICE_PRIVATE=y 7) CONFIG_DEVICE_PRIVATE=y CONFIG_DEVICE_PUBLIC=y 8) CONFIG_HMM_MIRROR=y CONFIG_DEVICE_PRIVATE=y CONFIG_DEVICE_PUBLIC=y Link: http://lkml.kernel.org/r/[email protected] Reported-by: Randy Dunlap <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm: avoid bloating arch that do not make use of HMMJérôme Glisse8-32/+54
This moves all new code including new page migration helper behind kernel Kconfig option so that there is no codee bloat for arch or user that do not want to use HMM or any of its associated features. arm allyesconfig (without all the patchset, then with and this patch): text data bss dec hex filename 83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux 83722364 46511131 27582964 157816459 968168b vmlinux [[email protected]: struct hmm is only use by HMM mirror functionality] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix build (arm multi_v7_defconfig)] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Cc: Dan Williams <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm: add new helper to hotplug CDM memory regionJérôme Glisse2-5/+86
Unlike unaddressable memory, coherent device memory has a real resource associated with it on the system (as CPU can address it). Add a new helper to hotplug such memory within the HMM framework. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Reviewed-by: Balbir Singh <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/device-public-memory: device memory cache coherent with CPUJérôme Glisse14-47/+159
Platform with advance system bus (like CAPI or CCIX) allow device memory to be accessible from CPU in a cache coherent fashion. Add a new type of ZONE_DEVICE to represent such memory. The use case are the same as for the un-addressable device memory but without all the corners cases. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Balbir Singh <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/migrate: allow migrate_vma() to alloc new page on empty entryJérôme Glisse2-9/+205
This allows callers of migrate_vma() to allocate new page for empty CPU page table entry (pte_none or back by zero page). This is only for anonymous memory and it won't allow new page to be instanced if the userfaultfd is armed. This is useful to device driver that want to migrate a range of virtual address and would rather allocate new memory than having to fault later on. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/migrate: support un-addressable ZONE_DEVICE page in migrationJérôme Glisse4-30/+165
Allow to unmap and restore special swap entry of un-addressable ZONE_DEVICE memory. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/migrate: migrate_vma() unmap page from vma while collecting pagesJérôme Glisse1-29/+112
Common case for migration of virtual address range is page are map only once inside the vma in which migration is taking place. Because we already walk the CPU page table for that range we can directly do the unmap there and setup special migration swap entry. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/migrate: new memory migration helper for use with device memoryJérôme Glisse2-0/+596
This patch add a new memory migration helpers, which migrate memory backing a range of virtual address of a process to different memory (which can be allocated through special allocator). It differs from numa migration by working on a range of virtual address and thus by doing migration in chunk that can be large enough to use DMA engine or special copy offloading engine. Expected users are any one with heterogeneous memory where different memory have different characteristics (latency, bandwidth, ...). As an example IBM platform with CAPI bus can make use of this feature to migrate between regular memory and CAPI device memory. New CPU architecture with a pool of high performance memory not manage as cache but presented as regular memory (while being faster and with lower latency than DDR) will also be prime user of this patch. Migration to private device memory will be useful for device that have large pool of such like GPU, NVidia plans to use HMM for that. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPYJérôme Glisse9-15/+86
Introduce a new migration mode that allow to offload the copy to a device DMA engine. This changes the workflow of migration and not all address_space migratepage callback can support this. This is intended to be use by migrate_vma() which itself is use for thing like HMM (see include/linux/hmm.h). No additional per-filesystem migratepage testing is needed. I disables MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i added comment in those to explain why (part of this patch). The commit message is unclear it should say that any callback that wish to support this new mode need to be aware of the difference in the migration flow from other mode. Some of these callbacks do extra locking while copying (aio, zsmalloc, balloon, ...) and for DMA to be effective you want to copy multiple pages in one DMA operations. But in the problematic case you can not easily hold the extra lock accross multiple call to this callback. Usual flow is: For each page { 1 - lock page 2 - call migratepage() callback 3 - (extra locking in some migratepage() callback) 4 - migrate page state (freeze refcount, update page cache, buffer head, ...) 5 - copy page 6 - (unlock any extra lock of migratepage() callback) 7 - return from migratepage() callback 8 - unlock page } The new mode MIGRATE_SYNC_NO_COPY: 1 - lock multiple pages For each page { 2 - call migratepage() callback 3 - abort in all problematic migratepage() callback 4 - migrate page state (freeze refcount, update page cache, buffer head, ...) } // finished all calls to migratepage() callback 5 - DMA copy multiple pages 6 - unlock all the pages To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a new callback migratepages() (for instance) that deals with multiple pages in one transaction. Because the problematic cases are not important for current usage I did not wanted to complexify this patchset even more for no good reason. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memoryJérôme Glisse2-1/+102
This introduce a dummy HMM device class so device driver can use it to create hmm_device for the sole purpose of registering device memory. It is useful to device driver that want to manage multiple physical device memory under same struct device umbrella. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm/devmem: device memory hotplug using ZONE_DEVICEJérôme Glisse2-1/+533
This introduce a simple struct and associated helpers for device driver to use when hotpluging un-addressable device memory as ZONE_DEVICE. It will find a unuse physical address range and trigger memory hotplug for it which allocates and initialize struct page for the device memory. Device driver should use this helper during device initialization to hotplug the device memory. It should only need to remove the memory once the device is going offline (shutdown or hotremove). There should not be any userspace API to hotplug memory expect maybe for host device driver to allow to add more memory to a guest device driver. Device's memory is manage by the device driver and HMM only provides helpers to that effect. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Signed-off-by: Balbir Singh <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/memcontrol: support MEMORY_DEVICE_PRIVATEJérôme Glisse2-4/+49
HMM pages (private or public device pages) are ZONE_DEVICE page and thus need special handling when it comes to lru or refcount. This patch make sure that memcontrol properly handle those when it face them. Those pages are use like regular pages in a process address space either as anonymous page or as file back page. So from memcg point of view we want to handle them like regular page for now at least. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/memcontrol: allow to uncharge page without using page->lru fieldJérôme Glisse1-76/+92
HMM pages (private or public device pages) are ZONE_DEVICE page and thus you can not use page->lru fields of those pages. This patch re-arrange the uncharge to allow single page to be uncharge without modifying the lru field of the struct page. There is no change to memcontrol logic, it is the same as it was before this patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/ZONE_DEVICE: special case put_page() for device private pagesJérôme Glisse4-10/+67
A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have any user. For device private pages this is important to catch and thus we need to special case put_page() for this. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memoryJérôme Glisse11-6/+309
HMM (heterogeneous memory management) need struct page to support migration from system main memory to device memory. Reasons for HMM and migration to device memory is explained with HMM core patch. This patch deals with device memory that is un-addressable memory (ie CPU can not access it). Hence we do not want those struct page to be manage like regular memory. That is why we extend ZONE_DEVICE to support different types of memory. A persistent memory type is define for existing user of ZONE_DEVICE and a new device un-addressable type is added for the un-addressable memory type. There is a clear separation between what is expected from each memory type and existing user of ZONE_DEVICE are un-affected by new requirement and new use of the un-addressable type. All specific code path are protect with test against the memory type. Because memory is un-addressable we use a new special swap type for when a page is migrated to device memory (this reduces the number of maximum swap file). The main two additions beside memory type to ZONE_DEVICE is two callbacks. First one, page_free() is call whenever page refcount reach 1 (which means the page is free as ZONE_DEVICE page never reach a refcount of 0). This allow device driver to manage its memory and associated struct page. The second callback page_fault() happens when there is a CPU access to an address that is back by a device page (which are un-addressable by the CPU). This callback is responsible to migrate the page back to system main memory. Device driver can not block migration back to system memory, HMM make sure that such page can not be pin into device memory. If device is in some error condition and can not migrate memory back then a CPU page fault to device memory should end with SIGBUS. [[email protected]: fix warning] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/memory_hotplug: introduce add_pagesMichal Hocko3-7/+30
There are new users of memory hotplug emerging. Some of them require different subset of arch_add_memory. There are some which only require allocation of struct pages without mapping those pages to the kernel address space. We currently have __add_pages for that purpose. But this is rather lowlevel and not very suitable for the code outside of the memory hotplug. E.g. x86_64 wants to update max_pfn which should be done by the caller. Introduce add_pages() which should care about those details if they are needed. Each architecture should define its implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the currently existing __add_pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm/mirror: device page fault handlerJérôme Glisse2-12/+271
This handles page fault on behalf of device driver, unlike handle_mm_fault() it does not trigger migration back to system memory for device memory. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm/mirror: helper to snapshot CPU page tableJérôme Glisse2-2/+338
This does not use existing page table walker because we want to share same code for our page fault handler. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm/mirror: mirror process address space on device with HMM helpersJérôme Glisse3-15/+260
This is a heterogeneous memory management (HMM) process address space mirroring. In a nutshell this provide an API to mirror process address space on a device. This boils down to keeping CPU and device page table synchronize (we assume that both device and CPU are cache coherent like PCIe device can be). This patch provide a simple API for device driver to achieve address space mirroring thus avoiding each device driver to grow its own CPU page table walker and its own CPU page table synchronization mechanism. This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more hardware in the future. [[email protected]: fix hmm for "mmu_notifier kill invalidate_page callback"] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm/hmm: heterogeneous memory management (HMM for short)Jérôme Glisse6-1/+249
HMM provides 3 separate types of functionality: - Mirroring: synchronize CPU page table and device page table - Device memory: allocating struct page for device memory - Migration: migrating regular memory to device memory This patch introduces some common helpers and definitions to all of those 3 functionality. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Signed-off-by: Evgeny Baskakov <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Mark Hairgrove <[email protected]> Signed-off-by: Sherry Cheung <[email protected]> Signed-off-by: Subhash Gutti <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08hmm: heterogeneous memory management documentationJérôme Glisse2-0/+391
Patch series "HMM (Heterogeneous Memory Management)", v25. Heterogeneous Memory Management (HMM) (description and justification) Today device driver expose dedicated memory allocation API through their device file, often relying on a combination of IOCTL and mmap calls. The device can only access and use memory allocated through this API. This effectively split the program address space into object allocated for the device and useable by the device and other regular memory (malloc, mmap of a file, share memory, â) only accessible by CPU (or in a very limited way by a device by pinning memory). Allowing different isolated component of a program to use a device thus require duplication of the input data structure using device memory allocator. This is reasonable for simple data structure (array, grid, image, â) but this get extremely complex with advance data structure (list, tree, graph, â) that rely on a web of memory pointers. This is becoming a serious limitation on the kind of work load that can be offloaded to device like GPU. New industry standard like C++, OpenCL or CUDA are pushing to remove this barrier. This require a shared address space between GPU device and CPU so that GPU can access any memory of a process (while still obeying memory protection like read only). This kind of feature is also appearing in various other operating systems. HMM is a set of helpers to facilitate several aspects of address space sharing and device memory management. Unlike existing sharing mechanism that rely on pining pages use by a device, HMM relies on mmu_notifier to propagate CPU page table update to device page table. Duplicating CPU page table is only one aspect necessary for efficiently using device like GPU. GPU local memory have bandwidth in the TeraBytes/ second range but they are connected to main memory through a system bus like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it is necessary to allow migration of process memory from main system memory to device memory. Issue is that on platform that only have PCIE the device memory is not accessible by the CPU with the same properties as main memory (cache coherency, atomic operations, ...). To allow migration from main memory to device memory HMM provides a set of helper to hotplug device memory as a new type of ZONE_DEVICE memory which is un-addressable by CPU but still has struct page representing it. This allow most of the core kernel logic that deals with a process memory to stay oblivious of the peculiarity of device memory. When page backing an address of a process is migrated to device memory the CPU page table entry is set to a new specific swap entry. CPU access to such address triggers a migration back to system memory, just like if the page was swap on disk. HMM also blocks any one from pinning a ZONE_DEVICE page so that it can always be migrated back to system memory if CPU access it. Conversely HMM does not migrate to device memory any page that is pin in system memory. To allow efficient migration between device memory and main memory a new migrate_vma() helpers is added with this patchset. It allows to leverage device DMA engine to perform the copy operation. This feature will be use by upstream driver like nouveau mlx5 and probably other in the future (amdgpu is next suspect in line). We are actively working on nouveau and mlx5 support. To test this patchset we also worked with NVidia close source driver team, they have more resources than us to test this kind of infrastructure and also a bigger and better userspace eco-system with various real industry workload they can be use to test and profile HMM. The expected workload is a program builds a data set on the CPU (from disk, from network, from sensors, â). Program uses GPU API (OpenCL, CUDA, ...) to give hint on memory placement for the input data and also for the output buffer. Program call GPU API to schedule a GPU job, this happens using device driver specific ioctl. All this is hidden from programmer point of view in case of C++ compiler that transparently offload some part of a program to GPU. Program can keep doing other stuff on the CPU while the GPU is crunching numbers. It is expected that CPU will not access the same data set as the GPU while GPU is working on it, but this is not mandatory. In fact we expect some small memory object to be actively access by both GPU and CPU concurrently as synchronization channel and/or for monitoring purposes. Such object will stay in system memory and should not be bottlenecked by system bus bandwidth (rare write and read access from both CPU and GPU). As we are relying on device driver API, HMM does not introduce any new syscall nor does it modify any existing ones. It does not change any POSIX semantics or behaviors. For instance the child after a fork of a process that is using HMM will not be impacted in anyway, nor is there any data hazard between child COW or parent COW of memory that was migrated to device prior to fork. HMM assume a numbers of hardware features. Device must allow device page table to be updated at any time (ie device job must be preemptable). Device page table must provides memory protection such as read only. Device must track write access (dirty bit). Device must have a minimum granularity that match PAGE_SIZE (ie 4k). Reviewer (just hint): Patch 1 HMM documentation Patch 2 introduce core infrastructure and definition of HMM, pretty small patch and easy to review Patch 3 introduce the mirror functionality of HMM, it relies on mmu_notifier and thus someone familiar with that part would be in better position to review Patch 4 is an helper to snapshot CPU page table while synchronizing with concurrent page table update. Understanding mmu_notifier makes review easier. Patch 5 is mostly a wrapper around handle_mm_fault() Patch 6 add new add_pages() helper to avoid modifying each arch memory hot plug function Patch 7 add a new memory type for ZONE_DEVICE and also add all the logic in various core mm to support this new type. Dan Williams and any core mm contributor are best people to review each half of this patchset Patch 8 special case HMM ZONE_DEVICE pages inside put_page() Kirill and Dan Williams are best person to review this Patch 9 allow to uncharge a page from memory group without using the lru list field of struct page (best reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko) Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko) Patch 11 add helper to hotplug un-addressable device memory as new type of ZONE_DEVICE memory (new type introducted in patch 3 of this serie). This is boiler plate code around memory hotplug and it also pick a free range of physical address for the device memory. Note that the physical address do not point to anything (at least as far as the kernel knows). Patch 12 introduce a new hmm_device class as an helper for device driver that want to expose multiple device memory under a common fake device driver. This is usefull for multi-gpu configuration. Anyone familiar with device driver infrastructure can review this. Boiler plate code really. Patch 13 add a new migrate mode. Any one familiar with page migration is welcome to review. Patch 14 introduce a new migration helper (migrate_vma()) that allow to migrate a range of virtual address of a process using device DMA engine to perform the copy. It is not limited to do copy from and to device but can also do copy between any kind of source and destination memory. Again anyone familiar with migration code should be able to verify the logic. Patch 15 optimize the new migrate_vma() by unmapping pages while we are collecting them. This can be review by any mm folks. Patch 16 add unaddressable memory migration to helper introduced in patch 7, this can be review by anyone familiar with migration code Patch 17 add a feature that allow device to allocate non-present page on the GPU when migrating a range of address to device memory. This is an helper for device driver to avoid having to first allocate system memory before migration to device memory Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device memory (CDM) Patch 19 add an helper to hotplug CDM memory Previous patchset posting : v1 http://lwn.net/Articles/597289/ v2 https://lkml.org/lkml/2014/6/12/559 v3 https://lkml.org/lkml/2014/6/13/633 v4 https://lkml.org/lkml/2014/8/29/423 v5 https://lkml.org/lkml/2014/11/3/759 v6 http://lwn.net/Articles/619737/ v7 http://lwn.net/Articles/627316/ v8 https://lwn.net/Articles/645515/ v9 https://lwn.net/Articles/651553/ v10 https://lwn.net/Articles/654430/ v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424 v12 http://www.kernelhub.org/?msg=972982&p=2 v13 https://lwn.net/Articles/706856/ v14 https://lkml.org/lkml/2016/12/8/344 v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html v16 http://www.spinics.net/lists/linux-mm/msg119814.html v17 https://lkml.org/lkml/2017/1/27/847 v18 https://lkml.org/lkml/2017/3/16/596 v19 https://lkml.org/lkml/2017/4/5/831 v20 https://lwn.net/Articles/720715/ v21 https://lkml.org/lkml/2017/4/24/747 v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html v23 https://www.mail-archive.com/[email protected]/msg1404788.html v24 https://lwn.net/Articles/726691/ This patch (of 19): This adds documentation for HMM (Heterogeneous Memory Management). It presents the motivation behind it, the features necessary for it to be useful and and gives an overview of how this is implemented. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérôme Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Nellans <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Evgeny Baskakov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mark Hairgrove <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sherry Cheung <[email protected]> Cc: Subhash Gutti <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: memory_hotplug: memory hotremove supports thp migrationNaoya Horiguchi2-2/+17
This patch enables thp migration for memory hotremove. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: migrate: move_pages() supports thp migrationNaoya Horiguchi1-13/+32
This patch enables thp migration for move_pages(2). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: mempolicy: mbind and migrate_pages support thp migrationNaoya Horiguchi1-29/+79
This patch enables thp migration for mbind(2) and migrate_pages(2). Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: soft-dirty: keep soft-dirty bits over thp migrationNaoya Horiguchi5-15/+92
Soft dirty bit is designed to keep tracked over page migration. This patch makes it work in the same manner for thp migration too. Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: thp: check pmd migration entry in common pathZi Yan9-27/+147
When THP migration is being used, memory management code needs to handle pmd migration entries properly. This patch uses !pmd_present() or is_swap_pmd() (depending on whether pmd_none() needs separate code or not) to check pmd migration entries at the places where a pmd entry is present. Since pmd-related code uses split_huge_page(), split_huge_pmd(), pmd_trans_huge(), pmd_trans_unstable(), or pmd_none_or_trans_huge_or_clear_bad(), this patch: 1. adds pmd migration entry split code in split_huge_pmd(), 2. takes care of pmd migration entries whenever pmd_trans_huge() is present, 3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware. Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable() is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change them. Until this commit, a pmd entry should be: 1. pointing to a pte page, 2. is_swap_pmd(), 3. pmd_trans_huge(), 4. pmd_devmap(), or 5. pmd_none(). Signed-off-by: Zi Yan <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: thp: enable thp migration in generic pathZi Yan7-13/+212
Add thp migration's core code, including conversions between a PMD entry and a swap entry, setting PMD migration entry, removing PMD migration entry, and waiting on PMD migration entries. This patch makes it possible to support thp migration. If you fail to allocate a destination page as a thp, you just split the source thp as we do now, and then enter the normal page migration. If you succeed to allocate destination thp, you enter thp migration. Subsequent patches actually enable thp migration for each caller of page migration by allowing its get_new_page() callback to allocate thps. [[email protected]: fix gcc-4.9.0 -Wmissing-braces warning] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix x86_64 allnoconfig warning] Signed-off-by: Zi Yan <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATIONNaoya Horiguchi3-0/+17
Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration functionality to x86_64, which should be safer at the first step. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: thp: introduce separate TTU flag for thp freezingNaoya Horiguchi3-5/+7
TTU_MIGRATION is used to convert pte into migration entry until thp split completes. This behavior conflicts with thp migration added later patches, so let's introduce a new TTU flag specifically for freezing. try_to_unmap() is used both for thp split (via freeze_page()) and page migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for head page is like below (assuming anonymous thp): (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \ TTU_MIGRATION | TTU_SPLIT_HUGE_PMD) and ttu_flag given for tail pages is: (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \ TTU_MIGRATION) __unmap_and_move() calls try_to_unmap() with ttu_flag: (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS) Now I'm trying to insert a branch for thp migration at the top of try_to_unmap_one() like below static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { ... /* PMD-mapped THP migration entry */ if (!pvmw.pte && (flags & TTU_MIGRATION)) { if (!PageAnon(page)) continue; set_pmd_migration_entry(&pvmw, page); continue; } ... } so try_to_unmap() for tail pages called by thp split can go into thp migration code path (which converts *pmd* into migration entry), while the expectation is to freeze thp (which converts *pte* into migration entry.) I detected this failure as a "bad page state" error in a testcase where split_huge_page() is called from queue_pages_pte_range(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1Naoya Horiguchi2-8/+14
_PAGE_PSE is used to distinguish between a truly non-present (_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and should be treated as present. But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would cause confusion between one of those PMDs undergoing a THP split, and a soft-dirty PMD. Dropping _PAGE_PSE check in pmd_present() does not work well, because it can hurt optimization of tlb handling in thp split. Thus, we need to move the bit. In the current kernel, bits 1-4 are not used in non-present format since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1. Bit 7 is used as reserved (always clear), so please don't use it for other purpose. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Zi Yan <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Nellans <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>