| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull Writeback error handling fixes from Jeff Layton:
"The main rationale for all of these changes is to tighten up writeback
error reporting to userland. There are many ways now that writeback
errors can be lost, such that fsync/fdatasync/msync return 0 when
writeback actually failed.
This pile contains a small set of cleanups and writeback error
handling fixes that I was able to break off from the main pile (#2).
Two of the patches in this pile are trivial. The exceptions are the
patch to fix up error handling in write_one_page, and the patch to
make JFS pay attention to write_one_page errors"
* tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove call_fsync helper function
mm: clean up error handling in write_one_page
JFS: do not ignore return code from write_one_page()
mm: drop "wait" parameter from write_one_page()
|
|
The callers all set it to 1.
Also, make it clear that this function will not set any sort of AS_*
error, and that the caller must do so if necessary. No existing caller
uses this on normal files, so none of them need it.
Also, add __must_check here since, in general, the callers need to handle
an error here in some fashion.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jeff Layton <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <[email protected]>
Original-patch-by: Michal Hocko <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Tested-by: Helge Deller <[email protected]> # parisc
Signed-off-by: Linus Torvalds <[email protected]>
|
|
KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a
special case. (check_user_page_hwpoison())
When huge pages are involved, this doesn't work so well.
get_user_pages() calls follow_hugetlb_page(), which stops early if it
receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
-EFAULT to the caller. The step to map this to -EHWPOISON based on the
FOLL_ flags is missing. The hwpoison special case is skipped, and
-EFAULT is returned to user-space, causing Qemu or kvmtool to exit.
Instead, move this VM_FAULT_ to errno mapping code into a header file
and use it from faultin_page() and follow_hugetlb_page().
With this, KVM works as expected.
This isn't a problem for arm64 today as we haven't enabled
MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
too, so I think this should be a fix. This doesn't apply earlier than
stable's v4.11.1 due to all sorts of cleanup.
[[email protected]: add vm_fault_to_errno() call to faultin_page()]
suggested.
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: James Morse <[email protected]>
Acked-by: Punit Agrawal <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: <[email protected]> [4.11.1+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation. This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously. There is no guarantee something like that happens
though.
This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]> # Xen bits
Acked-by: Kees Cook <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Andreas Dilger <[email protected]> # Lustre
Acked-by: Christian Borntraeger <[email protected]> # KVM/s390
Acked-by: Dan Williams <[email protected]> # nvdim
Acked-by: David Sterba <[email protected]> # btrfs
Acked-by: Ilya Dryomov <[email protected]> # Ceph
Acked-by: Tariq Toukan <[email protected]> # mlx4
Acked-by: Leon Romanovsky <[email protected]> # mlx5
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Anton Vorontsov <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Santosh Raspatur <[email protected]>
Cc: Hariprasad S <[email protected]>
Cc: Yishai Hadas <[email protected]>
Cc: Oleg Drokin <[email protected]>
Cc: "Yan, Zheng" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "kvmalloc", v5.
There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.
As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.
Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.
[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
This patch (of 9):
Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.
This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.
While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.
[[email protected]: f2fs fixup]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]> [ext4 part]
Acked-by: Vlastimil Babka <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
On SPARSEMEM systems page poisoning is enabled after buddy is up,
because of the dependency on page extension init. This causes the pages
released by free_all_bootmem not to be poisoned. This either delays or
misses the identification of some issues because the pages have to
undergo another cycle of alloc-free-alloc for any corruption to be
detected.
Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
flag. Since all the free pages will now be poisoned, the flag need not
be verified before checking the poison during an alloc.
[[email protected]: fix Kconfig]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vinayak Menon <[email protected]>
Acked-by: Laura Abbott <[email protected]>
Tested-by: Laura Abbott <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Akinobu Mita <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
reference to fix pmem crash
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.
The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)
But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.
One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)
So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:
Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.
This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.
This speeds up things while also making the pmem refcounting more robust going
forward.
Suggested-by: Kirill Shutemov <[email protected]>
Tested-by: Kirill Shutemov <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Yang Li has reported that drain_all_pages triggers a WARN_ON which means
that this function is called earlier than the mm_percpu_wq is
initialized on arm64 with CMA configured:
WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
Modules linked in:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
Hardware name: Freescale Layerscape 2088A RDB Board (DT)
task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
PC is at drain_all_pages+0x244/0x25c
LR is at start_isolate_page_range+0x14c/0x1f0
[...]
drain_all_pages+0x244/0x25c
start_isolate_page_range+0x14c/0x1f0
alloc_contig_range+0xec/0x354
cma_alloc+0x100/0x1fc
dma_alloc_from_contiguous+0x3c/0x44
atomic_pool_init+0x7c/0x208
arm64_dma_init+0x44/0x4c
do_one_initcall+0x38/0x128
kernel_init_freeable+0x1a0/0x240
kernel_init+0x10/0xfc
ret_from_fork+0x10/0x20
Fix this by moving the whole setup_vmstat which is an initcall right now
to init_mm_internals which will be called right after the WQ subsystem
is initialized.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Yang Li <[email protected]>
Tested-by: Yang Li <[email protected]>
Tested-by: Xiaolong Ye <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
get_user_pages_fast() function
This is a preparation patch for the transition of x86 to the generic GUP_fast()
implementation.
Prepare generic GUP_fast() to handle dev_pagemap(). At the moment, it's
only implemented on x86. On non-x86, the new code will be compiled out.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aneesh Kumar K . V <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dann Frazier <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Convert all non-architecture-specific code to 5-level paging.
It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We are going to switch core MM to 5-level paging abstraction.
This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.
In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
linux/mm.h, since they are already provided in linux/shmem_fs.h. But
shmem_fs.h must then provide the inline stub for shmem_mapping() when
CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If madvise(2) advice will result in the underlying vma being split and
the number of areas mapped by the process will exceed
/proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.
EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
is temporarily unavailable. It indicates that userspace should retry
the advice in the near future. This is important for advice such as
MADV_DONTNEED which is often used by malloc implementations to free
memory back to the system: we really do want to free memory back when
madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
or mempolicies) cannot be allocated.
Encountering /proc/sys/vm/max_map_count is not a temporary failure,
however, so return ENOMEM to indicate this is a more serious issue. A
followup patch to the man page will specify this behavior.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Rientjes <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.
Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.
The event notification occurs after the mmap_sem has been released.
[[email protected]: fix nommu build]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix nommu build]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Dr. David Alan Gilbert" <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations. More than one kernel oops was introduced due to
difficulties of getting the placement correctly.
Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry. This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.
Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <[email protected]>
Tested-by: Dan Williams <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Nilesh Choudhury <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current transparent hugepage code only supports PMDs. This patch
adds support for transparent use of PUDs with DAX. It does not include
support for anonymous pages. x86 support code also added.
Most of this patch simply parallels the work that was done for huge
PMDs. The only major difference is how the new ->pud_entry method in
mm_walk works. The ->pmd_entry method replaces the ->pte_entry method,
whereas the ->pud_entry method works along with either ->pmd_entry or
->pte_entry. The pagewalk code takes care of locking the PUD before
calling ->pud_walk, so handlers do not need to worry whether the PUD is
stable.
[[email protected]: fix SMP x86 32bit build for native_pud_clear()]
Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
[[email protected]: native_pud_clear missing on i386 build]
Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Tested-by: Alexander Kapshuk <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Nilesh Choudhury <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "1G transparent hugepage support for device dax", v2.
The following series implements support for 1G trasparent hugepage on
x86 for device dax. The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX. I have
forward ported the relevant bits to 4.10-rc. The current submission has
only the necessary code to support device DAX.
Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs. Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file. We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].
[1]: https://lkml.org/lkml/2016/1/31/52
Comments from Nilesh @ Oracle:
There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:
processes : 10,000
memory : 6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB
As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level. Memory sizes will keep increasing; so
this number will keep increasing.
An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.
This patch (of 3):
In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.
[[email protected]: remove unused function ext4_dax_huge_fault()]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: clear PMD or PUD size flags when in fall through path]
Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Signed-off-by: Ross Zwisler <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Nilesh Choudhury <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Jiang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.
Remove the vma parameter to simplify things.
[[email protected]: fix ARM build]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There's no users of zap_page_range() who wants non-NULL 'details'.
Let's drop it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
detail == NULL would give the same functionality as
.check_swap_entries==true.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The only user of ignore_dirty is oom-reaper. But it doesn't really use
it.
ignore_dirty only has effect on file pages mapped with dirty pte. But
oom-repear skips shared VMAs, so there's no way we can dirty file pte in
them.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
show_mem() allows to filter out node specific data which is irrelevant
to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is
done in skip_free_areas_node which skips all nodes which are not in the
mems_allowed of the current process. This works most of the time as
expected because the nodemask shouldn't be outside of the allocating
task but there are some exceptions. E.g. memory hotplug might want to
request allocations from outside of the allowed nodes (see
new_node_page).
Get rid of this hardcoded behavior and push the allocation mask down the
show_mem path and use it instead of cpuset_current_mems_allowed. NULL
nodemask is interpreted as cpuset_current_mems_allowed.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
warn_alloc is currently used for to report an allocation failure or an
allocation stall. We print some details of the allocation request like
the gfp mask and the request order. We do not print the allocation
nodemask which is important when debugging the reason for the allocation
failure as well. We alreaddy print the nodemask in the OOM report.
Add nodemask to warn_alloc and print it in warn_alloc as well.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
On 32-bit powerpc the ELF PLT sections of binaries (built with
--bss-plt, or with a toolchain which defaults to it) look like this:
[17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4
[18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4
[19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4
Which results in an ELF load header:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000
This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping
ends at 0002aff8, so the PLT falls in the 0 fill section described by
the load header, and after a page boundary.
Unfortunately the generic ELF loader ignores the X bit in the load
headers when it creates the 0 filled non-file backed mappings. It
assumes all of these mappings are RW BSS sections, which is not the case
for PPC.
gcc/ld has an option (--secure-plt) to not do this, this is said to
incur a small performance penalty.
Currently, to support 32-bit binaries with PLT in BSS kernel maps
*entire brk area* with executable rights for all binaries, even
--secure-plt ones.
Stop doing that.
Teach the ELF loader to check the X bit in the relevant load header and
create 0 filled anonymous mappings that are executable if the load
header requests that.
Test program showing the difference in /proc/$PID/maps:
int main() {
char buf[16*1024];
char *p = malloc(123); /* make "[heap]" mapping appear */
int fd = open("/proc/self/maps", O_RDONLY);
int len = read(fd, buf, sizeof(buf));
write(1, buf, len);
printf("%p\n", p);
return 0;
}
Compiled using: gcc -mbss-plt -m32 -Os test.c -otest
Unpatched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10690000-106c0000 rwxp 00000000 00:00 0 [heap]
f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack]
0x10690008
Patched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10180000-101b0000 rw-p 00000000 00:00 0 [heap]
^^^^ this has changed
f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ff860000-ff890000 rw-p 00000000 00:00 0 [stack]
0x10180008
The patch was originally posted in 2012 by Jason Gunthorpe
and apparently ignored:
https://lkml.org/lkml/2012/9/30/138
Lightly run-tested.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Denys Vlasenko <[email protected]>
Acked-by: Kees Cook <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Tested-by: Jason Gunthorpe <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Florian Weimer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to
ensure compatibility of a VMA with userfault. Introduction of
vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be
used with userfaultfd. Current implementation presumes usage of
vma_is_shmem only by slow path routines in userfaultfd, therefore the
vma_is_shmem is not made inline to leave the few remaining free bits in
vm_flags.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: "Dr. David Alan Gilbert" <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michael Rapoport <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The new routine copy_huge_page_from_user() uses kmap_atomic() to map
PAGE_SIZE pages. However, this prevents page faults in the subsequent
call to copy_from_user(). This is OK in the case where the routine is
copied with mmap_sema held. However, in another case we want to allow
page faults. So, add a new argument allow_pagefault to indicate if the
routine should allow page faults.
[[email protected]: unmap the correct pointer]
Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda
[[email protected]: kunmap() takes a page*, per Hugh]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Dan Carpenter <[email protected]>
Cc: "Dr. David Alan Gilbert" <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michael Rapoport <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
support
userfaultfd UFFDIO_COPY allows user level code to copy data to a page at
fault time. The data is copied from user space to a newly allocated
huge page. The new routine copy_huge_page_from_user performs this copy.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Kravetz <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Cc: "Dr. David Alan Gilbert" <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michael Rapoport <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct. Remove the
additional parameter and simplify pmd_fault() and friends.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.
[[email protected]: fix issue with xfs_tests stall when DAX option is off]
Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems. Create a tracepoint header for FS DAX and add the first DAX
tracepoints to the PMD fault handler. This allows the tracing for DAX to
be done in the same way as the filesystem tracing so that developers can
look at them together and get a coherent idea of what the system is doing.
I added both an entry and exit tracepoint because future patches will add
tracepoints to child functions of dax_iomap_pmd_fault() like
dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages
to be wrapped by the parent function tracepoints so the code flow is more
easily understood. Having entry and exit tracepoints for faults also
allows us to easily see what filesystems functions were called during the
fault. These filesystem functions get executed via iomap_begin() and
iomap_end() calls, for example, and will have their own tracepoints.
For PMD faults we primarily want to understand the type of mapping, the
fault flags, the faulting address and whether it fell back to 4k faults.
If it fell back to 4k faults the tracepoints should let us understand why.
I named the new tracepoint header file "fs_dax.h" to allow for device DAX
to have its own separate tracing header in the same directory at some
point.
Here is an example output for these events from a successful PMD fault:
big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003
big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400
big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Suggested-by: Dave Chinner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
- Errata workarounds for Qualcomm's Falkor CPU
- Qualcomm L2 Cache PMU driver
- Qualcomm SMCCC firmware quirk
- Support for DEBUG_VIRTUAL
- CPU feature detection for userspace via MRS emulation
- Preliminary work for the Statistical Profiling Extension
- Misc cleanups and non-critical fixes
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits)
arm64/kprobes: consistently handle MRS/MSR with XZR
arm64: cpufeature: correctly handle MRS to XZR
arm64: traps: correctly handle MRS/MSR with XZR
arm64: ptrace: add XZR-safe regs accessors
arm64: include asm/assembler.h in entry-ftrace.S
arm64: fix warning about swapper_pg_dir overflow
arm64: Work around Falkor erratum 1003
arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2
arm64: arch_timer: document Hisilicon erratum 161010101
arm64: use is_vmalloc_addr
arm64: use linux/sizes.h for constants
arm64: uaccess: consistently check object sizes
perf: add qcom l2 cache perf events driver
arm64: remove wrong CONFIG_PROC_SYSCTL ifdef
ARM: smccc: Update HVC comment to describe new quirk parameter
arm64: do not trace atomic operations
ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device()
ACPI/IORT: Fix iort_node_get_id() mapping entries indexing
arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA
perf: xgene: Include module.h
...
|
|
Certain architectures may have the kernel image mapped separately to
alias the linear map. Introduce a macro lm_alias to translate a kernel
image symbol into its linear alias. This is used in part with work to
add CONFIG_DEBUG_VIRTUAL support for arm64.
Reviewed-by: Mark Rutland <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Signed-off-by: Laura Abbott <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss in the following sequence:
1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
pmd_t dirty and writeable
2) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pmd_t as clean and write protected.
3) more mmap writes to the PMD. These don't cause any page faults since
the pmd_t is dirty and writeable. The radix tree entry remains clean.
4) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
5) crash - dirty data that should have been fsync'd as part of 4) could
still have been in the processor cache, and is lost.
Fix this by marking the pmd_t clean and write protected in
dax_mapping_entry_mkclean(), which is called as part of the fsync
operation 2). This will cause the writes in step 3) above to generate
page faults where we'll re-dirty the PMD radix tree entry, resulting in
flushes in the fsync that happens in step 4).
Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Write protect DAX PMDs in *sync path".
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss, as detailed in patch 2.
This series is based on Dan's "libnvdimm-pending" branch, which is the
current home for Jan's "dax: Page invalidation fixes" series. You can
find a working tree here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean
This patch (of 2):
Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
huge page PMD leaf to be found and returned.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.
This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.
The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).
This improves the performance of page lock intensive microbenchmarks by
2-3%.
Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).
Signed-off-by: Nicholas Piggin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Bob Peterson <[email protected]>
Cc: Steven Whitehouse <[email protected]>
Cc: Andrew Lutomirski <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Merge more updates from Andrew Morton:
- a few misc things
- kexec updates
- DMA-mapping updates to better support networking DMA operations
- IPC updates
- various MM changes to improve DAX fault handling
- lots of radix-tree changes, mainly to the test suite. All leading up
to reimplementing the IDA/IDR code to be a wrapper layer over the
radix-tree. However the final trigger-pulling patch is held off for
4.11.
* emailed patches from Andrew Morton <[email protected]>: (114 commits)
radix tree test suite: delete unused rcupdate.c
radix tree test suite: add new tag check
radix-tree: ensure counts are initialised
radix tree test suite: cache recently freed objects
radix tree test suite: add some more functionality
idr: reduce the number of bits per level from 8 to 6
rxrpc: abstract away knowledge of IDR internals
tpm: use idr_find(), not idr_find_slowpath()
idr: add ida_is_empty
radix tree test suite: check multiorder iteration
radix-tree: fix replacement for multiorder entries
radix-tree: add radix_tree_split_preload()
radix-tree: add radix_tree_split
radix-tree: add radix_tree_join
radix-tree: delete radix_tree_range_tag_if_tagged()
radix-tree: delete radix_tree_locate_item()
radix-tree: improve multiorder iterators
btrfs: fix race in btrfs_free_dummy_fs_info()
radix-tree: improve dump output
radix-tree: make radix_tree_find_next_bit more useful
...
|
|
DAX will need to implement its own version of page_check_address(). To
avoid duplicating page table walking code, export follow_pte() which
does what we need.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Provide a helper function for finishing write faults due to PTE being
read-only. The helper will be used by DAX to avoid the need of
complicating generic MM code with DAX locking specifics.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move final handling of COW faults from generic code into DAX fault
handler. That way generic code doesn't have to be aware of
peculiarities of DAX locking so remove that knowledge and make locking
functions private to fs/dax.c.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Introduce finish_fault() as a helper function for finishing page faults.
It is rather thin wrapper around alloc_set_pte() but since we'd want to
call this from DAX code or filesystems, it is still useful to avoid some
boilerplate code.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "dax: Clear dirty bits after flushing caches", v5.
Patchset to clear dirty bits from radix tree of DAX inodes when caches
for corresponding pfns have been flushed. In principle, these patches
enable handlers to easily update PTEs and do other work necessary to
finish the fault without duplicating the functionality present in the
generic code. I'd like to thank Kirill and Ross for reviews of the
series!
This patch (of 20):
To allow full handling of COW faults add memcg field to struct vm_fault
and a return value of ->fault() handler meaning that COW fault is fully
handled and memcg charge must not be canceled. This will allow us to
remove knowledge about special DAX locking from the generic fault code.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add orig_pte field to vm_fault structure to allow ->page_mkwrite
handlers to fully handle the fault.
This also allows us to save some passing of extra arguments around.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Every single user of vmf->virtual_address typed that entry to unsigned
long before doing anything with it so the type of virtual_address does
not really provide us any additional safety. Just use masked
vmf->address which already has the appropriate type.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently we have two different structures for passing fault information
around - struct vm_fault and struct fault_env. DAX will need more
information in struct vm_fault to handle its faults so the content of
that structure would become event closer to fault_env. Furthermore it
would need to generate struct fault_env to be able to call some of the
generic functions. So at this point I don't think there's much use in
keeping these two structures separate. Just embed into struct vm_fault
all that is needed to use it for both purposes.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Unexport the low-level __get_user_pages_unlocked() function and replaces
invocations with calls to more appropriate higher-level functions.
In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
with get_user_pages_unlocked() since we can now pass gup_flags.
In async_pf_execute() and process_vm_rw_single_vec() we need to pass
different tsk, mm arguments so get_user_pages_remote() is the sane
replacement in these cases (having added manual acquisition and release
of mmap_sem.)
Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
flag. However, this flag was originally silently dropped by commit
1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
appears to have been unintentional and reintroducing it is therefore not
an issue.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lorenzo Stoakes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krcmar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: unexport __get_user_pages_unlocked()".
This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.
It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.
Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)
This patch (of 2):
Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().
[[email protected]: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Lorenzo Stoakes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krcmar <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It is the reasonable expectation that if an executable file is not
readable there will be no way for a user without special privileges to
read the file. This is enforced in ptrace_attach but if ptrace
is already attached before exec there is no enforcement for read-only
executables.
As the only way to read such an mm is through access_process_vm
spin a variant called ptrace_access_vm that will fail if the
target process is not being ptraced by the current process, or
the current process did not have sufficient privileges when ptracing
began to read the target processes mm.
In the ptrace implementations replace access_process_vm by
ptrace_access_vm. There remain several ptrace sites that still use
access_process_vm as they are reading the target executables
instructions (for kernel consumption) or register stacks. As such it
does not appear necessary to add a permission check to those calls.
This bug has always existed in Linux.
Fixes: v1.0
Cc: [email protected]
Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
This patch unexports the low-level __get_user_pages() function.
Recent refactoring of the get_user_pages* functions allow flags to be
passed through get_user_pages() which eliminates the need for access to
this function from its one user, kvm.
We can see that the two calls to get_user_pages() which replace
__get_user_pages() in kvm_main.c are equivalent by examining their call
stacks:
get_user_page_nowait():
get_user_pages(start, 1, flags, page, NULL)
__get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, start, 1,
flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
check_user_page_hwpoison():
get_user_pages(addr, 1, flags, NULL, NULL)
__get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
NULL, NULL)
Signed-off-by: Lorenzo Stoakes <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|