blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2017-07-07	Merge tag 'for-linus-v4.13-1' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling fixes from Jeff Layton: "The main rationale for all of these changes is to tighten up writeback error reporting to userland. There are many ways now that writeback errors can be lost, such that fsync/fdatasync/msync return 0 when writeback actually failed. This pile contains a small set of cleanups and writeback error handling fixes that I was able to break off from the main pile (#2). Two of the patches in this pile are trivial. The exceptions are the patch to fix up error handling in write_one_page, and the patch to make JFS pay attention to write_one_page errors" * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove call_fsync helper function mm: clean up error handling in write_one_page JFS: do not ignore return code from write_one_page() mm: drop "wait" parameter from write_one_page()
2017-07-05	mm: drop "wait" parameter from write_one_page()	Jeff Layton	1	-1/+1
	The callers all set it to 1. Also, make it clear that this function will not set any sort of AS_* error, and that the caller must do so if necessary. No existing caller uses this on normal files, so none of them need it. Also, add __must_check here since, in general, the callers need to handle an error here in some fashion. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2017-06-19	mm: larger stack guard gap, between vmas	Hugh Dickins	1	-28/+25
	Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <[email protected]> Original-patch-by: Michal Hocko <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Tested-by: Helge Deller <[email protected]> # parisc Signed-off-by: Linus Torvalds <[email protected]>
2017-06-02	mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified	James Morse	1	-0/+11
	KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a special case. (check_user_page_hwpoison()) When huge pages are involved, this doesn't work so well. get_user_pages() calls follow_hugetlb_page(), which stops early if it receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning -EFAULT to the caller. The step to map this to -EHWPOISON based on the FOLL_ flags is missing. The hwpoison special case is skipped, and -EFAULT is returned to user-space, causing Qemu or kvmtool to exit. Instead, move this VM_FAULT_ to errno mapping code into a header file and use it from faultin_page() and follow_hugetlb_page(). With this, KVM works as expected. This isn't a problem for arm64 today as we haven't enabled MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86 too, so I think this should be a fix. This doesn't apply earlier than stable's v4.11.1 due to all sorts of cleanup. [[email protected]: add vm_fault_to_errno() call to faultin_page()] suggested. Link: http://lkml.kernel.org/r/[email protected] [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: James Morse <[email protected]> Acked-by: Punit Agrawal <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: <[email protected]> [4.11.1+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-05-08	treewide: use kv[mz]alloc* rather than opencoded variants	Michal Hocko	1	-0/+8
	There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> # Xen bits Acked-by: Kees Cook <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Andreas Dilger <[email protected]> # Lustre Acked-by: Christian Borntraeger <[email protected]> # KVM/s390 Acked-by: Dan Williams <[email protected]> # nvdim Acked-by: David Sterba <[email protected]> # btrfs Acked-by: Ilya Dryomov <[email protected]> # Ceph Acked-by: Tariq Toukan <[email protected]> # mlx4 Acked-by: Leon Romanovsky <[email protected]> # mlx5 Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Anton Vorontsov <[email protected]> Cc: Colin Cross <[email protected]> Cc: Tony Luck <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Ben Skeggs <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Santosh Raspatur <[email protected]> Cc: Hariprasad S <[email protected]> Cc: Yishai Hadas <[email protected]> Cc: Oleg Drokin <[email protected]> Cc: "Yan, Zheng" <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-05-08	mm: introduce kv[mz]alloc helpers	Michal Hocko	1	-0/+14
	Patch series "kvmalloc", v5. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 6. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/[email protected] [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com This patch (of 9): Using kmalloc with the vmalloc fallback for larger allocations is a common pattern in the kernel code. Yet we do not have any common helper for that and so users have invented their own helpers. Some of them are really creative when doing so. Let's just add kv[mz]alloc and make sure it is implemented properly. This implementation makes sure to not make a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also to not warn about allocation failures. This also rules out the OOM killer as the vmalloc is a more approapriate fallback than a disruptive user visible action. This patch also changes some existing users and removes helpers which are specific for them. In some cases this is not possible (e.g. ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and require GFP_NO{FS,IO} context which is not vmalloc compatible in general (note that the page table allocation is GFP_KERNEL). Those need to be fixed separately. While we are at it, document that __vmalloc{_node} about unsupported gfp mask because there seems to be a lot of confusion out there. kvmalloc_node will warn about GFP_KERNEL incompatible (which are not superset) flags to catch new abusers. Existing ones would have to die slowly. [[email protected]: f2fs fixup] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> [ext4 part] Acked-by: Vlastimil Babka <[email protected]> Cc: John Hubbard <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-05-03	mm: enable page poisoning early at boot	Vinayak Menon	1	-1/+0
	On SPARSEMEM systems page poisoning is enabled after buddy is up, because of the dependency on page extension init. This causes the pages released by free_all_bootmem not to be poisoned. This either delays or misses the identification of some issues because the pages have to undergo another cycle of alloc-free-alloc for any corruption to be detected. Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON flag. Since all the free pages will now be poisoned, the flag need not be verified before checking the poison during an alloc. [[email protected]: fix Kconfig] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vinayak Menon <[email protected]> Acked-by: Laura Abbott <[email protected]> Tested-by: Laura Abbott <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Akinobu Mita <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-05-01	mm, zone_device: Replace {get, put}_zone_device_page() with a single ↵	Dan Williams	1	-14/+0
	reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur after devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <[email protected]> Tested-by: Kirill Shutemov <[email protected]> Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <[email protected]>
2017-04-03	Merge tag 'v4.11-rc5' into x86/mm, to refresh the branch	Ingo Molnar	1	-0/+2
	Signed-off-by: Ingo Molnar <[email protected]>
2017-03-31	mm: move mm_percpu_wq initialization earlier	Michal Hocko	1	-0/+2
	Yang Li has reported that drain_all_pages triggers a WARN_ON which means that this function is called earlier than the mm_percpu_wq is initialized on arm64 with CMA configured: WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127 Hardware name: Freescale Layerscape 2088A RDB Board (DT) task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000 PC is at drain_all_pages+0x244/0x25c LR is at start_isolate_page_range+0x14c/0x1f0 [...] drain_all_pages+0x244/0x25c start_isolate_page_range+0x14c/0x1f0 alloc_contig_range+0xec/0x354 cma_alloc+0x100/0x1fc dma_alloc_from_contiguous+0x3c/0x44 atomic_pool_init+0x7c/0x208 arm64_dma_init+0x44/0x4c do_one_initcall+0x38/0x128 kernel_init_freeable+0x1a0/0x240 kernel_init+0x10/0xfc ret_from_fork+0x10/0x20 Fix this by moving the whole setup_vmstat which is an initcall right now to init_mm_internals which will be called right after the WQ subsystem is initialized. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Yang Li <[email protected]> Tested-by: Yang Li <[email protected]> Tested-by: Xiaolong Ye <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-03-18	mm/gup: Implement the dev_pagemap() logic in the generic ↵	Kirill A. Shutemov	1	-0/+4
	get_user_pages_fast() function This is a preparation patch for the transition of x86 to the generic GUP_fast() implementation. Prepare generic GUP_fast() to handle dev_pagemap(). At the moment, it's only implemented on x86. On non-x86, the new code will be compiled out. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Aneesh Kumar K . V <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dann Frazier <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Steve Capper <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-03-09	mm: convert generic code to 5-level paging	Kirill A. Shutemov	1	-6/+25
	Convert all non-architecture-specific code to 5-level paging. It's mostly mechanical adding handling one more page table level in places where we deal with pud_t. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-03-09	asm-generic: introduce 5level-fixup.h	Kirill A. Shutemov	1	-0/+3
	We are going to switch core MM to 5-level paging abstraction. This is preparation step which adds <asm-generic/5level-fixup.h> As with 4level-fixup.h, the new header allows quickly make all architectures compatible with 5-level paging in core MM. In long run we would like to switch architectures to properly folded p4d level by using <asm-generic/pgtable-nop4d.h>, but it requires more changes to arch-specific code. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	mm: remove shmem_mapping() shmem_zero_setup() duplicates	Hugh Dickins	1	-10/+0
	Remove the prototypes for shmem_mapping() and shmem_zero_setup() from linux/mm.h, since they are already provided in linux/shmem_fs.h. But shmem_fs.h must then provide the inline stub for shmem_mapping() when CONFIG_SHMEM is not set, and a few more cfiles now need to #include it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Simek <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	mm, madvise: fail with ENOMEM when splitting vma will hit max_map_count	David Rientjes	1	-2/+4
	If madvise(2) advice will result in the underlying vma being split and the number of areas mapped by the process will exceed /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN. EAGAIN is returned by madvise(2) when a kernel resource, such as slab, is temporarily unavailable. It indicates that userspace should retry the advice in the near future. This is important for advice such as MADV_DONTNEED which is often used by malloc implementations to free memory back to the system: we really do want to free memory back when madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas, or mempolicies) cannot be allocated. Encountering /proc/sys/vm/max_map_count is not a temporary failure, however, so return ENOMEM to indicate this is a more serious issue. A followup patch to the man page will specify this behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	userfaultfd: non-cooperative: add event for memory unmaps	Mike Rapoport	1	-5/+9
	When a non-cooperative userfaultfd monitor copies pages in the background, it may encounter regions that were already unmapped. Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely changes in the virtual memory layout. Since there might be different uffd contexts for the affected VMAs, we first should create a temporary representation for the unmap event for each uffd context and then notify them one by one to the appropriate userfault file descriptors. The event notification occurs after the mmap_sem has been released. [[email protected]: fix nommu build] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix nommu build] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	mm: replace FAULT_FLAG_SIZE with parameter to huge_fault	Dave Jiang	1	-6/+8
	Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has been somewhat painful with getting the flags set and removed at the correct locations. More than one kernel oops was introduced due to difficulties of getting the placement correctly. Remove the flag values and introduce an input parameter to huge_fault that indicates the size of the page entry. This makes the code easier to trace and should avoid the issues we see with the fault flags where removal of the flag was necessary in the fallback paths. Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <[email protected]> Tested-by: Dan Williams <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Nilesh Choudhury <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	mm, x86: add support for PUD-sized transparent hugepages	Matthew Wilcox	1	-1/+29
	The current transparent hugepage code only supports PMDs. This patch adds support for transparent use of PUDs with DAX. It does not include support for anonymous pages. x86 support code also added. Most of this patch simply parallels the work that was done for huge PMDs. The only major difference is how the new ->pud_entry method in mm_walk works. The ->pmd_entry method replaces the ->pte_entry method, whereas the ->pud_entry method works along with either ->pmd_entry or ->pte_entry. The pagewalk code takes care of locking the PUD before calling ->pud_walk, so handlers do not need to worry whether the PUD is stable. [[email protected]: fix SMP x86 32bit build for native_pud_clear()] Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com [[email protected]: native_pud_clear missing on i386 build] Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Tested-by: Alexander Kapshuk <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Nilesh Choudhury <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	mm,fs,dax: change ->pmd_fault to ->huge_fault	Dave Jiang	1	-1/+9
	Patch series "1G transparent hugepage support for device dax", v2. The following series implements support for 1G trasparent hugepage on x86 for device dax. The bulk of the code was written by Mathew Wilcox a while back supporting transparent 1G hugepage for fs DAX. I have forward ported the relevant bits to 4.10-rc. The current submission has only the necessary code to support device DAX. Comments from Dan Williams: So the motivation and intended user of this functionality mirrors the motivation and users of 1GB page support in hugetlbfs. Given expected capacities of persistent memory devices an in-memory database may want to reduce tlb pressure beyond what they can already achieve with 2MB mappings of a device-dax file. We have customer feedback to that effect as Willy mentioned in his previous version of these patches [1]. [1]: https://lkml.org/lkml/2016/1/31/52 Comments from Nilesh @ Oracle: There are applications which have a process model; and if you assume 10,000 processes attempting to mmap all the 6TB memory available on a server; we are looking at the following: processes : 10,000 memory : 6TB pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB pmd @ 2M page size: 120,000 / 512 = ~240GB pud @ 1G page size: 240GB / 512 = ~480MB As you can see with 2M pages, this system will use up an exorbitant amount of DRAM to hold the page tables; but the 1G pages finally brings it down to a reasonable level. Memory sizes will keep increasing; so this number will keep increasing. An argument can be made to convert the applications from process model to thread model, but in the real world that may not be always practical. Hopefully this helps explain the use case where this is valuable. This patch (of 3): In preparation for adding the ability to handle PUD pages, convert vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The vm_fault structure is extended to include a union of the different page table pointers that may be needed, and three flag bits are reserved to indicate which type of pointer is in the union. [[email protected]: remove unused function ext4_dax_huge_fault()] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: clear PMD or PUD size flags when in fall through path] Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Ross Zwisler <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jan Kara <[email protected]> Cc: Dan Williams <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Nilesh Choudhury <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dave Jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-24	mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf	Dave Jiang	1	-5/+5
	->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [[email protected]: fix ARM build] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jan Kara <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	mm: drop unused argument of zap_page_range()	Kirill A. Shutemov	1	-1/+1
	There's no users of zap_page_range() who wants non-NULL 'details'. Let's drop it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	mm: drop zap_details::check_swap_entries	Kirill A. Shutemov	1	-1/+0
	detail == NULL would give the same functionality as .check_swap_entries==true. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	mm: drop zap_details::ignore_dirty	Kirill A. Shutemov	1	-1/+0
	The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	lib/show_mem.c: teach show_mem to work with the given nodemask	Michal Hocko	1	-3/+2
	show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	mm, page_alloc: warn_alloc print nodemask	Michal Hocko	1	-2/+2
	warn_alloc is currently used for to report an allocation failure or an allocation stall. We print some details of the allocation request like the gfp mask and the request order. We do not print the allocation nodemask which is important when debugging the reason for the allocation failure as well. We alreaddy print the nodemask in the OOM report. Add nodemask to warn_alloc and print it in warn_alloc as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	powerpc: do not make the entire heap executable	Denys Vlasenko	1	-0/+1
	On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt, or with a toolchain which defaults to it) look like this: [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4 [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4 [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4 Which results in an ELF load header: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000 This is all correct, the load region containing the PLT is marked as executable. Note that the PLT starts at 0002b00c but the file mapping ends at 0002aff8, so the PLT falls in the 0 fill section described by the load header, and after a page boundary. Unfortunately the generic ELF loader ignores the X bit in the load headers when it creates the 0 filled non-file backed mappings. It assumes all of these mappings are RW BSS sections, which is not the case for PPC. gcc/ld has an option (--secure-plt) to not do this, this is said to incur a small performance penalty. Currently, to support 32-bit binaries with PLT in BSS kernel maps entire brk area with executable rights for all binaries, even --secure-plt ones. Stop doing that. Teach the ELF loader to check the X bit in the relevant load header and create 0 filled anonymous mappings that are executable if the load header requests that. Test program showing the difference in /proc/$PID/maps: int main() { char buf[161024]; char p = malloc(123); /* make "[heap]" mapping appear */ int fd = open("/proc/self/maps", O_RDONLY); int len = read(fd, buf, sizeof(buf)); write(1, buf, len); printf("%p\n", p); return 0; } Compiled using: gcc -mbss-plt -m32 -Os test.c -otest Unpatched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10690000-106c0000 rwxp 00000000 00:00 0 [heap] f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack] 0x10690008 Patched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10180000-101b0000 rw-p 00000000 00:00 0 [heap] ^^^^ this has changed f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ff860000-ff890000 rw-p 00000000 00:00 0 [stack] 0x10180008 The patch was originally posted in 2012 by Jason Gunthorpe and apparently ignored: https://lkml.org/lkml/2012/9/30/138 Lightly run-tested. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Denys Vlasenko <[email protected]> Acked-by: Kees Cook <[email protected]> Acked-by: Michael Ellerman <[email protected]> Tested-by: Jason Gunthorpe <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Florian Weimer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	userfaultfd: shmem: introduce vma_is_shmem	Mike Rapoport	1	-0/+10
	Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to ensure compatibility of a VMA with userfault. Introduction of vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be used with userfaultfd. Current implementation presumes usage of vma_is_shmem only by slow path routines in userfaultfd, therefore the vma_is_shmem is not made inline to leave the few remaining free bits in vm_flags. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing	Mike Kravetz	1	-1/+2
	The new routine copy_huge_page_from_user() uses kmap_atomic() to map PAGE_SIZE pages. However, this prevents page faults in the subsequent call to copy_from_user(). This is OK in the case where the routine is copied with mmap_sema held. However, in another case we want to allow page faults. So, add a new argument allow_pagefault to indicate if the routine should allow page faults. [[email protected]: unmap the correct pointer] Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda [[email protected]: kunmap() takes a page*, per Hugh] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd ↵	Mike Kravetz	1	-0/+3
	support userfaultfd UFFDIO_COPY allows user level code to copy data to a page at fault time. The data is copied from user space to a newly allocated huge page. The new routine copy_huge_page_from_user performs this copy. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	mm, dax: change pmd_fault() to take only vmf parameter	Dave Jiang	1	-1/+1
	pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	mm, dax: make pmd_fault() and friends be the same as fault()	Dave Jiang	1	-2/+1
	Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [[email protected]: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	dax: add tracepoint infrastructure, PMD tracing	Ross Zwisler	1	-0/+25
	Tracepoints are the standard way to capture debugging and tracing information in many parts of the kernel, including the XFS and ext4 filesystems. Create a tracepoint header for FS DAX and add the first DAX tracepoints to the PMD fault handler. This allows the tracing for DAX to be done in the same way as the filesystem tracing so that developers can look at them together and get a coherent idea of what the system is doing. I added both an entry and exit tracepoint because future patches will add tracepoints to child functions of dax_iomap_pmd_fault() like dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to be wrapped by the parent function tracepoints so the code flow is more easily understood. Having entry and exit tracepoints for faults also allows us to easily see what filesystems functions were called during the fault. These filesystem functions get executed via iomap_begin() and iomap_end() calls, for example, and will have their own tracepoints. For PMD faults we primarily want to understand the type of mapping, the fault flags, the faulting address and whether it fell back to 4k faults. If it fell back to 4k faults the tracepoints should let us understand why. I named the new tracepoint header file "fs_dax.h" to allow for device DAX to have its own separate tracing header in the same directory at some point. Here is an example output for these events from a successful PMD fault: big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE\|ALLOW_RETRY\|KILLABLE\|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE\|ALLOW_RETRY\|KILLABLE\|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Suggested-by: Dave Chinner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-02-22	Merge tag 'arm64-upstream' of ↵	Linus Torvalds	1	-0/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: - Errata workarounds for Qualcomm's Falkor CPU - Qualcomm L2 Cache PMU driver - Qualcomm SMCCC firmware quirk - Support for DEBUG_VIRTUAL - CPU feature detection for userspace via MRS emulation - Preliminary work for the Statistical Profiling Extension - Misc cleanups and non-critical fixes * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits) arm64/kprobes: consistently handle MRS/MSR with XZR arm64: cpufeature: correctly handle MRS to XZR arm64: traps: correctly handle MRS/MSR with XZR arm64: ptrace: add XZR-safe regs accessors arm64: include asm/assembler.h in entry-ftrace.S arm64: fix warning about swapper_pg_dir overflow arm64: Work around Falkor erratum 1003 arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2 arm64: arch_timer: document Hisilicon erratum 161010101 arm64: use is_vmalloc_addr arm64: use linux/sizes.h for constants arm64: uaccess: consistently check object sizes perf: add qcom l2 cache perf events driver arm64: remove wrong CONFIG_PROC_SYSCTL ifdef ARM: smccc: Update HVC comment to describe new quirk parameter arm64: do not trace atomic operations ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device() ACPI/IORT: Fix iort_node_get_id() mapping entries indexing arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA perf: xgene: Include module.h ...
2017-01-11	mm: Introduce lm_alias	Laura Abbott	1	-0/+4
	Certain architectures may have the kernel image mapped separately to alias the linear map. Introduce a macro lm_alias to translate a kernel image symbol into its linear alias. This is used in part with work to add CONFIG_DEBUG_VIRTUAL support for arm64. Reviewed-by: Mark Rutland <[email protected]> Tested-by: Mark Rutland <[email protected]> Signed-off-by: Laura Abbott <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2017-01-10	dax: wrprotect pmd_t in dax_mapping_entry_mkclean	Ross Zwisler	1	-2/+0
	Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd_t dirty and writeable 2) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pmd_t as clean and write protected. 3) more mmap writes to the PMD. These don't cause any page faults since the pmd_t is dirty and writeable. The radix tree entry remains clean. 4) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 5) crash - dirty data that should have been fsync'd as part of 4) could still have been in the processor cache, and is lost. Fix this by marking the pmd_t clean and write protected in dax_mapping_entry_mkclean(), which is called as part of the fsync operation 2). This will cause the writes in step 3) above to generate page faults where we'll re-dirty the PMD radix tree entry, resulting in flushes in the fsync that happens in step 4). Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-10	mm: add follow_pte_pmd()	Ross Zwisler	1	-0/+2
	Patch series "Write protect DAX PMDs in sync path". Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an sync operation. This can result in data loss, as detailed in patch 2. This series is based on Dan's "libnvdimm-pending" branch, which is the current home for Jan's "dax: Page invalidation fixes" series. You can find a working tree here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean This patch (of 2): Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a huge page PMD leaf to be found and returned. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Suggested-by: Dave Hansen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-25	mm: add PageWaiters indicating tasks are waiting for a page bit	Nicholas Piggin	1	-0/+2
	Add a new page flag, PageWaiters, to indicate the page waitqueue has tasks waiting. This can be tested rather than testing waitqueue_active which requires another cacheline load. This bit is always set when the page has tasks on page_waitqueue(page), and is set and cleared under the waitqueue lock. It may be set when there are no tasks on the waitqueue, which will cause a harmless extra wakeup check that will clears the bit. The generic bit-waitqueue infrastructure is no longer used for pages. Instead, waitqueues are used directly with a custom key type. The generic code was not flexible enough to have PageWaiters manipulation under the waitqueue lock (which simplifies concurrency). This improves the performance of page lock intensive microbenchmarks by 2-3%. Putting two bits in the same word opens the opportunity to remove the memory barrier between clearing the lock bit and testing the waiters bit, after some work on the arch primitives (e.g., ensuring memory operand widths match and cover both bits). Signed-off-by: Nicholas Piggin <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Andrew Lutomirski <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	1	-29/+17
	Merge more updates from Andrew Morton: - a few misc things - kexec updates - DMA-mapping updates to better support networking DMA operations - IPC updates - various MM changes to improve DAX fault handling - lots of radix-tree changes, mainly to the test suite. All leading up to reimplementing the IDA/IDR code to be a wrapper layer over the radix-tree. However the final trigger-pulling patch is held off for 4.11. * emailed patches from Andrew Morton <[email protected]>: (114 commits) radix tree test suite: delete unused rcupdate.c radix tree test suite: add new tag check radix-tree: ensure counts are initialised radix tree test suite: cache recently freed objects radix tree test suite: add some more functionality idr: reduce the number of bits per level from 8 to 6 rxrpc: abstract away knowledge of IDR internals tpm: use idr_find(), not idr_find_slowpath() idr: add ida_is_empty radix tree test suite: check multiorder iteration radix-tree: fix replacement for multiorder entries radix-tree: add radix_tree_split_preload() radix-tree: add radix_tree_split radix-tree: add radix_tree_join radix-tree: delete radix_tree_range_tag_if_tagged() radix-tree: delete radix_tree_locate_item() radix-tree: improve multiorder iterators btrfs: fix race in btrfs_free_dummy_fs_info() radix-tree: improve dump output radix-tree: make radix_tree_find_next_bit more useful ...
2016-12-14	mm: export follow_pte()	Jan Kara	1	-0/+2
	DAX will need to implement its own version of page_check_address(). To avoid duplicating page table walking code, export follow_pte() which does what we need. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: provide helper for finishing mkwrite faults	Jan Kara	1	-0/+1
	Provide a helper function for finishing write faults due to PTE being read-only. The helper will be used by DAX to avoid the need of complicating generic MM code with DAX locking specifics. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: move handling of COW faults into DAX code	Jan Kara	1	-8/+1
	Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: factor out functionality to finish page faults	Jan Kara	1	-0/+1
	Introduce finish_fault() as a helper function for finishing page faults. It is rather thin wrapper around alloc_set_pte() but since we'd want to call this from DAX code or filesystems, it is still useful to avoid some boilerplate code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: allow full handling of COW faults in ->fault handlers	Jan Kara	1	-1/+3
	Patch series "dax: Clear dirty bits after flushing caches", v5. Patchset to clear dirty bits from radix tree of DAX inodes when caches for corresponding pfns have been flushed. In principle, these patches enable handlers to easily update PTEs and do other work necessary to finish the fault without duplicating the functionality present in the generic code. I'd like to thank Kirill and Ross for reviews of the series! This patch (of 20): To allow full handling of COW faults add memcg field to struct vm_fault and a return value of ->fault() handler meaning that COW fault is fully handled and memcg charge must not be canceled. This will allow us to remove knowledge about special DAX locking from the generic fault code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: add orig_pte field into vm_fault	Jan Kara	1	-2/+2
	Add orig_pte field to vm_fault structure to allow ->page_mkwrite handlers to fully handle the fault. This also allows us to save some passing of extra arguments around. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: use vmf->address instead of of vmf->virtual_address	Jan Kara	1	-2/+0
	Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: join struct fault_env and vm_fault	Jan Kara	1	-17/+11
	Currently we have two different structures for passing fault information around - struct vm_fault and struct fault_env. DAX will need more information in struct vm_fault to handle its faults so the content of that structure would become event closer to fault_env. Furthermore it would need to generate struct fault_env to be able to call some of the generic functions. So at this point I don't think there's much use in keeping these two structures separate. Just embed into struct vm_fault all that is needed to use it for both purposes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: unexport __get_user_pages_unlocked()	Lorenzo Stoakes	1	-3/+0
	Unexport the low-level __get_user_pages_unlocked() function and replaces invocations with calls to more appropriate higher-level functions. In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked() with get_user_pages_unlocked() since we can now pass gup_flags. In async_pf_execute() and process_vm_rw_single_vec() we need to pass different tsk, mm arguments so get_user_pages_remote() is the sane replacement in these cases (having added manual acquisition and release of mmap_sem.) Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH flag. However, this flag was originally silently dropped by commit 1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this appears to have been unintentional and reintroducing it is therefore not an issue. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Jan Kara <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14	mm: add locked parameter to get_user_pages_remote()	Lorenzo Stoakes	1	-1/+1
	Patch series "mm: unexport __get_user_pages_unlocked()". This patch series continues the cleanup of get_user_pages() functions taking advantage of the fact we can now pass gup_flags as we please. It firstly adds an additional 'locked' parameter to get_user_pages_remote() to allow for its callers to utilise VM_FAULT_RETRY functionality. This is necessary as the invocation of __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of this and no other existing higher level function would allow it to do so. Secondly existing callers of __get_user_pages_unlocked() are replaced with the appropriate higher-level replacement - get_user_pages_unlocked() if the current task and memory descriptor are referenced, or get_user_pages_remote() if other task/memory descriptors are referenced (having acquiring mmap_sem.) This patch (of 2): Add a int locked parameter to get_user_pages_remote() to allow VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked(). Taking into account the previous adjustments to get_user_pages*() functions allowing for the passing of gup_flags, we are now in a position where __get_user_pages_unlocked() need only be exported for his ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to subsequently unexport __get_user_pages_unlocked() as well as allowing for future flexibility in the use of get_user_pages_remote(). [[email protected]: merge fix for get_user_pages_remote API change] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Jan Kara <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-11-22	ptrace: Don't allow accessing an undumpable mm	Eric W. Biederman	1	-0/+2
	It is the reasonable expectation that if an executable file is not readable there will be no way for a user without special privileges to read the file. This is enforced in ptrace_attach but if ptrace is already attached before exec there is no enforcement for read-only executables. As the only way to read such an mm is through access_process_vm spin a variant called ptrace_access_vm that will fail if the target process is not being ptraced by the current process, or the current process did not have sufficient privileges when ptracing began to read the target processes mm. In the ptrace implementations replace access_process_vm by ptrace_access_vm. There remain several ptrace sites that still use access_process_vm as they are reading the target executables instructions (for kernel consumption) or register stacks. As such it does not appear necessary to add a permission check to those calls. This bug has always existed in Linux. Fixes: v1.0 Cc: [email protected] Reported-by: Andy Lutomirski <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2016-10-24	mm: unexport __get_user_pages()	Lorenzo Stoakes	1	-4/+0
	This patch unexports the low-level __get_user_pages() function. Recent refactoring of the get_user_pages* functions allow flags to be passed through get_user_pages() which eliminates the need for access to this function from its one user, kvm. We can see that the two calls to get_user_pages() which replace __get_user_pages() in kvm_main.c are equivalent by examining their call stacks: get_user_page_nowait(): get_user_pages(start, 1, flags, page, NULL) __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL, false, flags \| FOLL_TOUCH) __get_user_pages(current, current->mm, start, 1, flags \| FOLL_TOUCH \| FOLL_GET, page, NULL, NULL) check_user_page_hwpoison(): get_user_pages(addr, 1, flags, NULL, NULL) __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL, false, flags \| FOLL_TOUCH) __get_user_pages(current, current->mm, addr, 1, flags \| FOLL_TOUCH, NULL, NULL, NULL) Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>