Age | Commit message (Collapse) | Author | Files | Lines |
|
One consequence of the reorganization of struct printf_spec to make
field_width 24 bits was that number() gained about 180 bytes. Since
spec is never passed to other functions, we can help gcc make number()
lose most of that extra weight by using local variables for the field
width and precision.
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Maurizio Lombardi reported a problem [1] with the %pb extension: It
doesn't work for sufficiently large bitmaps, since the size is stashed
in the field_width field of the struct printf_spec, which is currently
an s16. Concretely, this manifested itself in
/sys/bus/pseudo/drivers/scsi_debug/map being empty, since the bitmap
printer got a size of 0, which is the 16 bit truncation of the actual
bitmap size.
We do want to keep struct printf_spec at 8 bytes so that it can cheaply
be passed by value. The qualifier field is only used for internal
bookkeeping in format_decode, so we might as well use a local variable
for that. This gives us an additional 8 bits, which we can then use for
the field width.
To stay in 8 bytes, we need to do a little rearranging and make the type
member a bitfield as well. For consistency, change all the members to
bit fields. gcc doesn't generate much worse code with these changes (in
fact, bloat-o-meter says we save 300 bytes - which I think is a little
surprising).
I didn't find a BUILD_BUG/compiletime_assertion/... which would work
outside function context, so for now I just open-coded it.
[1] http://thread.gmane.org/gmane.linux.kernel/2034835
[[email protected]: avoid open-coded BUILD_BUG_ON]
Signed-off-by: Rasmus Villemoes <[email protected]>
Reported-by: Maurizio Lombardi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If the string corresponding to a %s specifier can change under us, we
might end up copying a \0 byte to the output buffer. There might be
callers who expect the output buffer to contain a genuine C string whose
length is exactly the snprintf return value (assuming truncation hasn't
happened or has been checked for).
We can avoid this by only passing over the source string once, stopping
the first time we meet a nul byte (or when we reach the given
precision), and then letting widen_string() handle left/right space
padding. As a small bonus, this code reuse also makes the generated
code slightly smaller.
Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is pure code movement, making sure the widen_string() helper is
defined before the string() function.
Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pull out the logic in dentry_name() which handles field width space
padding, in preparation for reusing it from string(). Rename the
widen() helper to move_right(), since it is used for handling the
!(flags & LEFT) case.
Signed-off-by: Rasmus Villemoes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Maurizio Lombardi <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
@console_may_schedule tracks whether console_sem was acquired through
lock or trylock. If the former, we're inside a sleepable context and
console_conditional_schedule() performs cond_resched(). This allows
console drivers which use console_lock for synchronization to yield
while performing time-consuming operations such as scrolling.
However, the actual console outputting is performed while holding
irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule
before starting outputting lines. Also, only a few drivers call
console_conditional_schedule() to begin with. This means that when a
lot of lines need to be output by console_unlock(), for example on a
console registration, the task doing console_unlock() may not yield for
a long time on a non-preemptible kernel.
If this happens with a slow console devices, for example a serial
console, the outputting task may occupy the cpu for a very long time.
Long enough to trigger softlockup and/or RCU stall warnings, which in
turn pile more messages, sometimes enough to trigger the next cycle of
warnings incapacitating the system.
Fix it by making console_unlock() insert cond_resched() between lines if
@console_may_schedule.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Calvin Owens <[email protected]>
Acked-by: Jan Kara <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Boot consoles are typically replaced by proper consoles during the boot
process. This can be problematic if the boot console data is part of
the init section that is reclaimed late during boot. If the proper
console does not register before this point in time, the boot console
will need to be removed (so that the freed memory is not accessed),
leaving the system without output for some time.
There are various reasons why the proper console may not register early
enough, such as deferred probe or the driver being a loadable module.
If that happens, there is some amount of time where no console messages
are visible to the user, which in turn can mean that they won't see
crashes or other potentially useful information.
To avoid this situation, only remove the boot console when it resides in
the init section. Code exists to replace the boot console by the proper
console when it is registered, keeping a seamless transition between the
boot and proper consoles.
Signed-off-by: Thierry Reding <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a helper to check if an object (given an address and a size) is part
of a section (given beginning and end addresses). For convenience, also
provide a helper that performs this check for __init data using the
__init_begin and __init_end limits.
Signed-off-by: Thierry Reding <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
stop_machine.o is only built if CONFIG_SMP=y, so this ifdef always
evaluates to true.
[[email protected]: remove now-unneeded ifdef]
Reported-by: Valentin Rothberg <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
uselib hasn't been used since libc5; glibc does not use it. Deprecate
uselib a bit more, by making the default y only if libc5 was widely used
on the plaform.
This makes arm64 kernel built with defconfig slightly smaller
bloat-o-meter:
add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
function old new delta
kernel_config_data 18164 18162 -2
uselib_flags 20 - -20
padzero 216 192 -24
sys_uselib 380 - -380
load_elf_library 964 - -964
Signed-off-by: Riku Voipio <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
IS_ERR_VALUE() already contains it and so we need to add this only to
the !ptr check. That will allow users of IS_ERR_OR_NULL(), to not add
this compiler flag.
Signed-off-by: Viresh Kumar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As illustrated by commit a3afe70b83fd ("[S390] latencytop s390
support."), HAVE_LATENCYTOP_SUPPORT is defined by an architecture to
advertise an implementation of save_stack_trace_tsk.
However, as of 9212ddb5eada ("stacktrace: provide save_stack_trace_tsk()
weak alias") a dummy implementation is provided if STACKTRACE=y. Given
that LATENCYTOP already depends on STACKTRACE_SUPPORT and selects
STACKTRACE, we can remove HAVE_LATENCYTOP_SUPPORT altogether.
Signed-off-by: Will Deacon <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Helge Deller <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As all new_valid_dev() checks have been removed it's time to drop
new_valid_dev() itself.
No functional change.
Signed-off-by: Yaowei Bai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
New_valid_dev() always returns true, so that's unnecessary to perform
new_valid_dev() checks in some filesystems. Most checks of
new_valid_dev() have been removed so let's drop this last one and then
we can remove new_valid_dev() from the source code.
No functional change.
Signed-off-by: Yaowei Bai <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rewrite abs() so that its return type does not depend on the
architecture and no unexpected type conversion happen inside of it. The
only conversion is from unsigned to signed type. char is left as a
return type but treated as a signed type regradless of it's actual
signedness.
With the old version, int arguments were promoted to long and depending
on architecture a long argument might result in s64 or long return type
(which may or may not be the same).
This came after some back and forth with Nicolas. The current macro has
different return type (for the same input type) depending on
architecture which might be midly iritating.
An alternative version would promote to int like so:
#define abs(x) __abs_choose_expr(x, long long, \
__abs_choose_expr(x, long, \
__builtin_choose_expr( \
sizeof(x) <= sizeof(int), \
({ int __x = (x); __x<0?-__x:__x; }), \
((void)0))))
I have no preference but imagine Linus might. :] Nicolas argument against
is that promoting to int causes iconsistent behaviour:
int main(void) {
unsigned short a = 0, b = 1, c = a - b;
unsigned short d = abs(a - b);
unsigned short e = abs(c);
printf("%u %u\n", d, e); // prints: 1 65535
}
Then again, no sane person expects consistent behaviour from C integer
arithmetic. ;)
Note:
__builtin_types_compatible_p(unsigned char, char) is always false, and
__builtin_types_compatible_p(signed char, char) is also always false.
Signed-off-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Nicolas Pitre <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Wey-Yi Guy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which
should point to nowhere. Redefine them using POISON_POINTER_DELTA
arithmetics to make sure they really point to non-mappable area declared
by the target architecture.
Signed-off-by: Vasily Kulikov <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Solar Designer <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.
In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.
Fixed by only freeing after synchronize_rcu().
Signed-off-by: Martijn Coenen <[email protected]>
Cc: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently memory_failure() doesn't handle non anonymous thp case,
because we can hardly expect the error handling to be successful, and it
can just hit some corner case which results in BUG_ON or something
severe like that. This is also the case for soft offline code, so let's
make it in the same way.
Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
but it's unnecessary because get_any_page() is already called when
running on this code, which takes a refcount of the target page
regardress of the flag. So this patch also removes it.
[[email protected]: fix build]
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
soft_offline_page() has some deeply indented code, that's the sign of
demand for cleanup. So let's do this. No functionality change.
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Page faults can race with fallocate hole punch. If a page fault happens
between the unmap and remove operations, the page is not removed and
remains within the hole. This is not the desired behavior. The race is
difficult to detect in user level code as even in the non-race case, a
page within the hole could be faulted back in before fallocate returns.
If userfaultfd is expanded to support hugetlbfs in the future, this race
will be easier to observe.
If this race is detected and a page is mapped, the remove operation
(remove_inode_hugepages) will unmap the page before removing. The unmap
within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
so that no other faults will be processed until the page is removed.
The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
remove_inode_hugepages to satisfy the new reference.
[[email protected]: move hugetlb_vmdelete_list()]
Signed-off-by: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
argument end is of type pgoff_t. It was being converted to a vaddr
offset and passed to unmap_hugepage_range. However, end was also being
used as an argument to the vma_interval_tree_foreach controlling loop.
In addition, the conversion of end to vaddr offset was incorrect.
hugetlb_vmtruncate_list is called as part of a file truncate or
fallocate hole punch operation.
When truncating a hugetlbfs file, this bug could prevent some pages from
being unmapped. This is possible if there are multiple vmas mapping the
file, and there is a sufficiently sized hole between the mappings. The
size of the hole between two vmas (A,B) must be such that the starting
virtual address of B is greater than (ending virtual address of A <<
PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
pages are not properly unmapped during truncate, the following BUG is
hit:
kernel BUG at fs/hugetlbfs/inode.c:428!
In the fallocate hole punch case, this bug could prevent pages from
being unmapped as in the truncate case. However, for hole punch the
result is that unmapped pages will not be removed during the operation.
For hole punch, it is also possible that more pages than desired will be
unmapped. This unnecessary unmapping will cause page faults to
reestablish the mappings on subsequent page access.
Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: <[email protected]> [4.3]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Both s390 and powerpc have hit the issue of swapoff hanging, when
CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
quite as x86_64 had them. I think it would be much clearer if
HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
determine whether the MEM_SOFT_DIRTY option should be offered, and the
actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.
But won't embark on that change myself: instead make swapoff more
robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
whether the bit in question is defined as 0 or the asm-generic fallback
is used, unless soft dirty is fully turned on.
Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Cyrill Gorcunov <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Dmitry Vyukov has reported[1] possible deadlock (triggered by his
syzkaller fuzzer):
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);
lock(&hugetlbfs_i_mmap_rwsem_key);
lock(&mapping->i_mmap_rwsem);
Both traces points to mm_take_all_locks() as a source of the problem.
It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.
huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
rest of VMAs.
The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
[1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
VM_HUGETLB VMAs, and avoid useless overhead of minor faults.
Signed-off-by: Liang Chen <[email protected]>
Signed-off-by: Gavin Guo <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove unused struct zone *z variable which appeared in 86051ca5eaf5
("mm: fix usemap initialization").
Signed-off-by: Alexander Kuleshov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since can_do_mlock only return 1 or 0, so make it boolean.
No functional change.
[[email protected]: update declaration in mm.h]
Signed-off-by: Wang Xiaoqiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Just cleanup, no functional change.
Signed-off-by: Wang Xiaoqiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mem_cgroup_css_from_page()
In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use "IS_ALIGNED" to judge the alignment, rather than directly judging.
Signed-off-by: Wang Xiaoqiang <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
allmodconfig produces following warning for me:
WARNING: vmlinux.o(.text.unlikely+0x10314): Section mismatch in reference from the function movable_node_is_enabled() to the variable .meminit.data:movable_node_enabled
The function movable_node_is_enabled() references
the variable __meminitdata movable_node_enabled.
This is often because movable_node_is_enabled lacks a __meminitdata
annotation or the annotation of movable_node_enabled is wrong.
Let's mark the function with __meminit. It fixes the warning.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
By passing a non-null flag we allow fixup_user_fault to retry, which
enables userfaultfd. As during these retries we might drop the mmap_sem
we need to check if that happened and redo the complete chain of
actions.
Signed-off-by: Dominik Dingel <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "Jason J. Herne" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Dominik Dingel <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.
The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.
This patchset does not take care of the futex code. I will now look
closer at this.
This patch (of 2):
With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.
This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.
Signed-off-by: Dominik Dingel <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "Jason J. Herne" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Dominik Dingel <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that the get_user_pages() path knows how to handle dax-pmd mappings,
remove the protections that disabled dax-pmd support.
Tests available from github.com/pmem/ndctl:
make TESTS="lib/test-dax.sh lib/test-mmap.sh" check
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a wide gamut of conditions that can trigger the dax pmd path to
fallback to pte mappings. Ideally we'd have a syscall interface to
determine mapping characteristics after the fact. In the meantime
provide debug messages.
Signed-off-by: Dan Williams <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().
Signed-off-by: Dan Williams <[email protected]>
Tested-by: Logan Gunthorpe <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.
Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.
[[email protected]: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use. Unlike get_page() it may fail if the device is, or
is in the process of being, disabled. While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.
devm_memremap_pages() now requires a reference counter to be specified
at init time. For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".
ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list. That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert. This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.
Signed-off-by: Dan Williams <[email protected]>
Tested-by: Logan Gunthorpe <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown. Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).
Signed-off-by: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Similar to the conversion of vm_insert_mixed() use pfn_t in the
vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
pfn is backed by a devm_memremap_pages() mapping.
Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers
_PAGE_DEVMAP to be set in the resulting pte.
There are no functional changes to the gpu drivers as a result of this
conversion.
Signed-off-by: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Airlie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
_PAGE_DEVMAP is a hardware-unused pte bit that will later be used in the
get_user_pages() path to identify pfns backed by the dynamic allocation
established by devm_memremap_pages. Upon seeing that bit the gup path
will lookup and pin the allocation while the pages are in use.
Since the _PAGE_DEVMAP bit is > 32 it must be cast to u64 instead of a
pteval_t to allow pmd_flags() usage in the realmode boot code to build.
Signed-off-by: Dan Williams <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Take into account that the pmd_t type is a array inside a struct, so it
needs two levels of brackets to initialize. Otherwise, a usage of __pmd
generates a warning:
include/linux/mm.h:986:2: warning: missing braces around initializer [-Wmissing-braces]
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Inlude asm/pgtable.h to get the definition for pud_t to fix:
include/linux/hugetlb.h:203:29: error: unknown type name 'pud_t'
Signed-off-by: Dan Williams <[email protected]>
Cc: Liviu Dudau <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Switch avr32/include/asm/page.h to use the common defintions for
pfn_to_page(), page_to_pfn(), and ARCH_PFN_OFFSET.
Signed-off-by: Dan Williams <[email protected]>
Cc: Haavard Skinnemoen <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the new vmem_altmap capability to enable the pmem driver to arrange
for a struct page memmap to be established in persistent memory.
[[email protected]: mn10300: declare __pfn_to_phys() to fix build error]
Signed-off-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.
Signed-off-by: Dan Williams <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:
1/ We want to augment vmemmap_populate() (called via arch_add_memory())
to allocate memmap storage from pre-allocated pages reserved by the
device driver. At vmemmap_alloc_block_buf() time it grabs device pages
rather than page allocator pages. This is in support of
devm_memremap_pages() mappings where the memmap is too large to fit in
main memory (i.e. large persistent memory devices).
2/ Taking a reference against the mapping when inserting device pages
into the address_space radix of a given inode. This facilitates
unmap_mapping_range() and truncate_inode_pages() operations when the
driver is tearing down the mapping.
3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
reference against the mapping so that the driver teardown path can
revoke and drain usage of device pages.
Signed-off-by: Dan Williams <[email protected]>
Tested-by: Logan Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Prevent userspace from trying and failing to online ZONE_DEVICE pages
which are meant to never be onlined.
For example on platforms with a udev rule like the following:
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
...will generate futile attempts to online the ZONE_DEVICE sections.
Example kernel messages:
Built 1 zonelists in Node order, mobility grouping on. Total pages: 1004747
Policy zone: Normal
online_pages [mem 0x248000000-0x24fffffff] failed
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags. These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory". Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.
The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target). However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.
Signed-off-by: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|