Age | Commit message (Collapse) | Author | Files | Lines |
|
microblaze architectures can be configured for either little or big endian
formats. Add a choice option for the user to select the correct endian
format(default to big endian).
Also update the Makefile so toolchain can compile for the format it is
configured for.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Babu Moger <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]> (powerpc)
Cc: Peter Zijlstra <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Define CPU_BIG_ENDIAN or warn for inconsistencies", v3.
While working on enabling queued rwlock on SPARC, found this following
code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN to
clear a byte.
static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
{
return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
}
Problem is many of the fixed big endian architectures don't define
CPU_BIG_ENDIAN and clears the wrong byte.
Define CPU_BIG_ENDIAN for all the fixed big endian architecture to fix it.
Also found few more references of this config parameter in
drivers/of/base.c
drivers/of/fdt.c
drivers/tty/serial/earlycon.c
drivers/tty/serial/serial_core.c
Be aware that this may cause regressions if someone has worked-around
problems in the above code already. Remove the work-around.
Here is our original discussion
https://lkml.org/lkml/2017/5/24/620
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Babu Moger <[email protected]>
Suggested-by: Arnd Bergmann <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: David S. Miller <[email protected]>
Acked-by: Stafford Horne <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Michael Ellerman <[email protected]> (powerpc)
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
First, number of CPUs can't be negative number.
Second, different signnnedness leads to suboptimal code in the following
cases:
1)
kmalloc(nr_cpu_ids * sizeof(X));
"int" has to be sign extended to size_t.
2)
while (loff_t *pos < nr_cpu_ids)
MOVSXD is 1 byte longed than the same MOV.
Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".
Code savings on allyesconfig kernel: -3KB
add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25
...
pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116
Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Where possible, call memset16(), memmove() or memcpy() instead of using
open-coded loops. I don't like the calling convention that uses a byte
count instead of a count of u16s, but it's a little late to change that.
Reduces code size of fbcon.o by almost 400 bytes on my laptop build.
[[email protected]: fix build]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Miller <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memset32() can be used to initialise these three arrays. Minor code
footprint reduction.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
zram was the motivation for creating memset_l(). Minchan Kim sees a 7%
performance improvement on x86 with 100MB of non-zero deduplicatable
data:
perf stat -r 10 dd if=/dev/zram0 of=/dev/null
vanilla: 0.232050465 seconds time elapsed ( +- 0.51% )
memset_l: 0.217219387 seconds time elapsed ( +- 0.07% )
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Tested-by: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Alpha already had an optimised fill-memory-with-16-bit-quantity
assembler routine called memsetw(). It has a slightly different calling
convention from memset16() in that it takes a byte count, not a count of
words. That's the same convention used by ARM's __memset routines, so
rename Alpha's routine to match and add a memset16() wrapper around it.
Then convert Alpha's scr_memsetw() to call memset16() instead of
memsetw().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reuse the existing optimised memset implementation to implement an
optimised memset32 and memset64.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Russell King <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These are single instructions on x86. There's no 64-bit instruction for
x86-32, but we don't yet have any user for memset64() on 32-bit
architectures, so don't bother to implement it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
[[email protected]: minor tweaks]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Multibyte memset variations", v4.
A relatively common idiom we're missing is a function to fill an area of
memory with a pattern which is larger than a single byte. I first
noticed this with a zram patch which wanted to fill a page with an
'unsigned long' value. There turn out to be quite a few places in the
kernel which can benefit from using an optimised function rather than a
loop; sometimes text size, sometimes speed, and sometimes both. The
optimised PowerPC version (not included here) improves performance by
about 30% on POWER8 on just the raw memset_l().
Most of the extra lines of code come from the three testcases I added.
This patch (of 8):
memset16(), memset32() and memset64() are like memset(), but allow the
caller to fill the destination with a value larger than a single byte.
memset_l() and memset_p() allow the caller to use unsigned long and
pointer values respectively.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: David Miller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Ravnborg <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This macro is useful to avoid link error on 32-bit systems.
We have the same definition in two drivers, so move it to
include/linux/kernel.h
While we are here, refactor DIV_ROUND_UP_ULL() by using
DIV_ROUND_DOWN_ULL().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Mark Brown <[email protected]>
Cc: Cyrille Pitchen <[email protected]>
Cc: Jaroslav Kysela <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Liam Girdwood <[email protected]>
Cc: Boris Brezillon <[email protected]>
Cc: Marek Vasut <[email protected]>
Cc: Brian Norris <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: David Woodhouse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If there are large numbers of hugepages to iterate while reading
/proc/pid/smaps, the page walk never does cond_resched(). On archs
without split pmd locks, there can be significant and observable
contention on mm->page_table_lock which cause lengthy delays without
rescheduling.
Always reschedule in smaps_pte_range() if necessary since the pagewalk
iteration can be expensive.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Save some code from ~320 invocations all clearing last argument.
add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657)
function old new delta
proc_create - 17 +17
__ksymtab_proc_create - 16 +16
__kstrtab_proc_create - 12 +12
yam_init_driver 301 298 -3
...
cifs_proc_init 249 228 -21
via_fb_pci_probe 2304 2280 -24
Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks")
removed the priv parameter user in is_stack so the argument is
redundant. Drop it.
[[email protected]: remove unused variable]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
VMA and its address bounds checks are too late in this function. They
must have been verified earlier in the page fault sequence. Hence just
remove them.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Free frontswap_map if an error is encountered before enable_swap_info().
Signed-off-by: David Rientjes <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]> [4.12+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If initializing a small swap file fails because the swap file has a
problem (holes, etc.) then we need to free the cluster info as part of
cleanup. Unfortunately a previous patch changed the code to use kvzalloc
but did not change all the vfree calls to use kvfree.
Found by running generic/357 from xfstests.
Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia
Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]> [4.12+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We are by error initializing alloc_flags before gfp_allowed_mask is
applied. This could cause problems after pm_restrict_gfp_mask() is called
during suspend operation. Apply gfp_allowed_mask before initializing
alloc_flags so that the first allocation attempt uses correct flags.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
KCMP's KCMP_EPOLL_TFD mode merged in commit 0791e3644e5ef2 ("kcmp: add
KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest
for it yet (except in criu development list). Thus add it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
online_mem_sections() accidentally marks online only the first section
in the given range. This is a typo which hasn't been noticed because I
haven't tested large 2GB blocks previously. All users of
pfn_to_online_page would get confused on the the rest of the pfn range
in the block.
All we need to fix this is to use iterator (pfn) rather than start_pfn.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Seen while reading the code, in handle_mm_fault(), in the case
arch_vma_access_permitted() is failing the call to
mem_cgroup_oom_disable() is not made.
To fix that, move the call to mem_cgroup_oom_enable() after calling
arch_vma_access_permitted() as it should not have entered the memcg OOM.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: bae473a423f6 ("mm: introduce fault_env")
Signed-off-by: Laurent Dufour <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We've noticed a quite noticeable performance overhead on some hosts with
significant network traffic when socket memory accounting is enabled.
Perf top shows that socket memory uncharging path is hot:
2.13% [kernel] [k] page_counter_cancel
1.14% [kernel] [k] __sk_mem_reduce_allocated
1.14% [kernel] [k] _raw_spin_lock
0.87% [kernel] [k] _raw_spin_lock_irqsave
0.84% [kernel] [k] tcp_ack
0.84% [kernel] [k] ixgbe_poll
0.83% < workload >
0.82% [kernel] [k] enqueue_entity
0.68% [kernel] [k] __fget
0.68% [kernel] [k] tcp_delack_timer_handler
0.67% [kernel] [k] __schedule
0.60% < workload >
0.59% [kernel] [k] __inet6_lookup_established
0.55% [kernel] [k] __switch_to
0.55% [kernel] [k] menu_select
0.54% libc-2.20.so [.] __memcpy_avx_unaligned
To address this issue, the existing per-cpu stock infrastructure can be
used.
refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
charge to a per-cpu stock instead of calling atomic
page_counter_uncharge().
To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
will explicitly drain the cached charge, if the cached value exceeds
CHARGE_BATCH.
This allows significantly optimize the load:
1.21% [kernel] [k] _raw_spin_lock
1.01% [kernel] [k] ixgbe_poll
0.92% [kernel] [k] _raw_spin_lock_irqsave
0.90% [kernel] [k] enqueue_entity
0.86% [kernel] [k] tcp_ack
0.85% < workload >
0.74% perf-11120.map [.] 0x000000000061bf24
0.73% [kernel] [k] __schedule
0.67% [kernel] [k] __fget
0.63% [kernel] [k] __inet6_lookup_established
0.62% [kernel] [k] menu_select
0.59% < workload >
0.59% [kernel] [k] __switch_to
0.57% libc-2.20.so [.] __memcpy_avx_unaligned
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The fadvise() manpage is silent on fadvise()'s effect on memory-based
filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
sysfs, kernfs). The current implementaion of fadvise is mostly a noop
for such filesystems except for FADV_DONTNEED which will trigger
expensive remote LRU cache draining. This patch makes the noop of
fadvise() on such file systems very explicit.
However this change has two side effects for ramfs and one for tmpfs.
First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
pages of ramfs (allocated through read, readahead & read fault) and
tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could
create such clean zero'ed pages for ramfs. This change removes those
possibilities.
One of our generic libraries does fadvise(FADV_DONTNEED). Recently we
observed high latency in fadvise() and noticed that the users have
started using tmpfs files and the latency was due to expensive remote
LRU cache draining. For normal tmpfs files (have data written on them),
fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
draining.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Greg Thelen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
some callers pass an enum fullness_group value. Change the type to int to
reflect the actual use of the functions and get rid of 'enum-conversion'
warnings
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthias Kaehlcke <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Doug Anderson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
page_zone_id() is a specialized function to compare the zone for the pages
that are within the section range. If the section of the pages are
different, page_zone_id() can be different even if their zone is the same.
This wrong usage doesn't cause any actual problem since
__munlock_pagevec_fill() would be called again with failed index.
However, it's better to use more appropriate function here.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To avoid deviation, the per cpu number of NUMA stats in
vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
Since NUMA stats does not be read by users frequently, and kernel does not
need it to make a decision, it will not be a problem to make the readers
more expensive.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Ying Huang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).
This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
as a small threshold greatly increases the update frequency of the global
counter from local per cpu counter(suggested by Ying Huang).
The rationality is that these statistics counters don't affect the
kernel's decision, unlike other VM counters, so it's not a problem to use
a large threshold.
With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.
Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench
Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default (base)
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Separate NUMA statistics from zone statistics", v2.
Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed. This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).
A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang). The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.
With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark. Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).
I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.
Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
This patch (of 3):
In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.
E.g. cat /proc/zoneinfo
***Base*** ***With this patch***
nr_free_pages 3976 nr_free_pages 3976
nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
nr_zone_active_anon 0 nr_zone_active_anon 0
nr_zone_inactive_file 0 nr_zone_inactive_file 0
nr_zone_active_file 0 nr_zone_active_file 0
nr_zone_unevictable 0 nr_zone_unevictable 0
nr_zone_write_pending 0 nr_zone_write_pending 0
nr_mlock 0 nr_mlock 0
nr_page_table_pages 0 nr_page_table_pages 0
nr_kernel_stack 0 nr_kernel_stack 0
nr_bounce 0 nr_bounce 0
nr_zspages 0 nr_zspages 0
numa_hit 0 *nr_free_cma 0*
numa_miss 0 numa_hit 0
numa_foreign 0 numa_miss 0
numa_interleave 0 numa_foreign 0
numa_local 0 numa_interleave 0
numa_other 0 numa_local 0
*nr_free_cma 0* numa_other 0
... ...
vm stats threshold: 10 vm stats threshold: 10
... ...
The next patch updates the numa stats counter size and threshold.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kemi Wang <[email protected]>
Reported-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ying Huang <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Flags argument has been copied into vmf.flags and it is not changed in
between. Hence a single write access check can be used for both PUD and
PMD.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is an enhancement to avoid a non cooperative userfaultfd manager
having to unregister all regions before it can close the uffd after all
userfaultfd activity completed.
The UFFDIO_UNREGISTER would serialize against the handle_userfault by
taking the mmap_sem for writing, but we can simply repeat the page fault
if we detect the uffd was closed and so the regular page fault paths
should takeover.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While reading the code I found that offset_il_node() has a vm_area_struct
pointer parameter which is unused.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Laurent Dufour <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Combinatorial Kconfig is painfull. Withi this patch all below combination
build.
1)
2)
CONFIG_HMM_MIRROR=y
3)
CONFIG_DEVICE_PRIVATE=y
4)
CONFIG_DEVICE_PUBLIC=y
5)
CONFIG_HMM_MIRROR=y
CONFIG_DEVICE_PUBLIC=y
6)
CONFIG_HMM_MIRROR=y
CONFIG_DEVICE_PRIVATE=y
7)
CONFIG_DEVICE_PRIVATE=y
CONFIG_DEVICE_PUBLIC=y
8)
CONFIG_HMM_MIRROR=y
CONFIG_DEVICE_PRIVATE=y
CONFIG_DEVICE_PUBLIC=y
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This moves all new code including new page migration helper behind kernel
Kconfig option so that there is no codee bloat for arch or user that do
not want to use HMM or any of its associated features.
arm allyesconfig (without all the patchset, then with and this patch):
text data bss dec hex filename
83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
83722364 46511131 27582964 157816459 968168b vmlinux
[[email protected]: struct hmm is only use by HMM mirror functionality]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix build (arm multi_v7_defconfig)]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Unlike unaddressable memory, coherent device memory has a real resource
associated with it on the system (as CPU can address it). Add a new
helper to hotplug such memory within the HMM framework.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Reviewed-by: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion. Add a new type of
ZONE_DEVICE to represent such memory. The use case are the same as for
the un-addressable device memory but without all the corners cases.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This allows callers of migrate_vma() to allocate new page for empty CPU
page table entry (pte_none or back by zero page). This is only for
anonymous memory and it won't allow new page to be instanced if the
userfaultfd is armed.
This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Common case for migration of virtual address range is page are map only
once inside the vma in which migration is taking place. Because we
already walk the CPU page table for that range we can directly do the
unmap there and setup special migration swap entry.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory (which
can be allocated through special allocator). It differs from numa
migration by working on a range of virtual address and thus by doing
migration in chunk that can be large enough to use DMA engine or special
copy offloading engine.
Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As an
example IBM platform with CAPI bus can make use of this feature to migrate
between regular memory and CAPI device memory. New CPU architecture with
a pool of high performance memory not manage as cache but presented as
regular memory (while being faster and with lower latency than DDR) will
also be prime user of this patch.
Migration to private device memory will be useful for device that have
large pool of such like GPU, NVidia plans to use HMM for that.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Introduce a new migration mode that allow to offload the copy to a device
DMA engine. This changes the workflow of migration and not all
address_space migratepage callback can support this.
This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).
No additional per-filesystem migratepage testing is needed. I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch). The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.
Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations. But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.
Usual flow is:
For each page {
1 - lock page
2 - call migratepage() callback
3 - (extra locking in some migratepage() callback)
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
5 - copy page
6 - (unlock any extra lock of migratepage() callback)
7 - return from migratepage() callback
8 - unlock page
}
The new mode MIGRATE_SYNC_NO_COPY:
1 - lock multiple pages
For each page {
2 - call migratepage() callback
3 - abort in all problematic migratepage() callback
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
} // finished all calls to migratepage() callback
5 - DMA copy multiple pages
6 - unlock all the pages
To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.
Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory. It
is useful to device driver that want to manage multiple physical device
memory under same struct device umbrella.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This introduce a simple struct and associated helpers for device driver to
use when hotpluging un-addressable device memory as ZONE_DEVICE. It will
find a unuse physical address range and trigger memory hotplug for it
which allocates and initialize struct page for the device memory.
Device driver should use this helper during device initialization to
hotplug the device memory. It should only need to remove the memory once
the device is going offline (shutdown or hotremove). There should not be
any userspace API to hotplug memory expect maybe for host device driver to
allow to add more memory to a guest device driver.
Device's memory is manage by the device driver and HMM only provides
helpers to that effect.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
HMM pages (private or public device pages) are ZONE_DEVICE page and thus
need special handling when it comes to lru or refcount. This patch make
sure that memcontrol properly handle those when it face them. Those pages
are use like regular pages in a process address space either as anonymous
page or as file back page. So from memcg point of view we want to handle
them like regular page for now at least.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.
There is no change to memcontrol logic, it is the same as it was
before this patch.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer have
any user. For device private pages this is important to catch and thus we
need to special case put_page() for this.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory. Reasons for HMM and
migration to device memory is explained with HMM core patch.
This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support
different types of memory.
A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type. There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type. All specific code
path are protect with test against the memory type.
Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).
The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.
The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.
If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.
[[email protected]: fix warning]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be done
by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the
currently existing __add_pages.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Jérôme Glisse <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Evgeny Baskakov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mark Hairgrove <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sherry Cheung <[email protected]>
Cc: Subhash Gutti <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This handles page fault on behalf of device driver, unlike
handle_mm_fault() it does not trigger migration back to system memory for
device memory.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This does not use existing page table walker because we want to share
same code for our page fault handler.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: Evgeny Baskakov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Mark Hairgrove <[email protected]>
Signed-off-by: Sherry Cheung <[email protected]>
Signed-off-by: Subhash Gutti <[email protected]>
Cc: Aneesh Kumar <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Nellans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Bob Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|