aboutsummaryrefslogtreecommitdiff
path: root/mm/memory.c
AgeCommit message (Collapse)AuthorFilesLines
2016-12-14mm: use vmf->page during WP faultsJan Kara1-29/+29
So far we set vmf->page during WP faults only when we needed to pass it to the ->page_mkwrite handler. Set it in all the cases now and use that instead of passing page pointer explicitly around. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: pass vm_fault structure into do_page_mkwrite()Jan Kara1-12/+10
We will need more information in the ->page_mkwrite() helper for DAX to be able to fully finish faults there. Pass vm_fault structure to do_page_mkwrite() and use it there so that information propagates properly from upper layers. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: factor out common parts of write fault handlingJan Kara1-41/+37
Currently we duplicate handling of shared write faults in wp_page_reuse() and do_shared_fault(). Factor them out into a common function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: move handling of COW faults into DAX codeJan Kara1-9/+4
Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: factor out functionality to finish page faultsJan Kara1-9/+35
Introduce finish_fault() as a helper function for finishing page faults. It is rather thin wrapper around alloc_set_pte() but since we'd want to call this from DAX code or filesystems, it is still useful to avoid some boilerplate code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: allow full handling of COW faults in ->fault handlersJan Kara1-7/+7
Patch series "dax: Clear dirty bits after flushing caches", v5. Patchset to clear dirty bits from radix tree of DAX inodes when caches for corresponding pfns have been flushed. In principle, these patches enable handlers to easily update PTEs and do other work necessary to finish the fault without duplicating the functionality present in the generic code. I'd like to thank Kirill and Ross for reviews of the series! This patch (of 20): To allow full handling of COW faults add memcg field to struct vm_fault and a return value of ->fault() handler meaning that COW fault is fully handled and memcg charge must not be canceled. This will allow us to remove knowledge about special DAX locking from the generic fault code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: add orig_pte field into vm_faultJan Kara1-41/+41
Add orig_pte field to vm_fault structure to allow ->page_mkwrite handlers to fully handle the fault. This also allows us to save some passing of extra arguments around. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: use passed vm_fault structure for in wp_pfn_shared()Jan Kara1-7/+2
Instead of creating another vm_fault structure, use the one passed to wp_pfn_shared() for passing arguments into pfn_mkwrite handler. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: trim __do_fault() argumentsJan Kara1-38/+29
Use vm_fault structure to pass cow_page, page, and entry in and out of the function. That reduces number of __do_fault() arguments from 4 to 1. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: use passed vm_fault structure in __do_fault()Jan Kara1-15/+10
Instead of creating another vm_fault structure, use the one passed to __do_fault() for passing arguments into fault handler. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: use pgoff in struct vm_fault instead of passing it separatelyJan Kara1-17/+18
struct vm_fault has already pgoff entry. Use it instead of passing pgoff as a separate argument and then assigning it later. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: use vmf->address instead of of vmf->virtual_addressJan Kara1-5/+4
Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: join struct fault_env and vm_faultJan Kara1-282/+286
Currently we have two different structures for passing fault information around - struct vm_fault and struct fault_env. DAX will need more information in struct vm_fault to handle its faults so the content of that structure would become event closer to fault_env. Furthermore it would need to generate struct fault_env to be able to call some of the generic functions. So at this point I don't think there's much use in keeping these two structures separate. Just embed into struct vm_fault all that is needed to use it for both purposes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14mm: add locked parameter to get_user_pages_remote()Lorenzo Stoakes1-1/+1
Patch series "mm: unexport __get_user_pages_unlocked()". This patch series continues the cleanup of get_user_pages*() functions taking advantage of the fact we can now pass gup_flags as we please. It firstly adds an additional 'locked' parameter to get_user_pages_remote() to allow for its callers to utilise VM_FAULT_RETRY functionality. This is necessary as the invocation of __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of this and no other existing higher level function would allow it to do so. Secondly existing callers of __get_user_pages_unlocked() are replaced with the appropriate higher-level replacement - get_user_pages_unlocked() if the current task and memory descriptor are referenced, or get_user_pages_remote() if other task/memory descriptors are referenced (having acquiring mmap_sem.) This patch (of 2): Add a int *locked parameter to get_user_pages_remote() to allow VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked(). Taking into account the previous adjustments to get_user_pages*() functions allowing for the passing of gup_flags, we are now in a position where __get_user_pages_unlocked() need only be exported for his ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to subsequently unexport __get_user_pages_unlocked() as well as allowing for future flexibility in the use of get_user_pages_remote(). [[email protected]: merge fix for get_user_pages_remote API change] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Jan Kara <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-14Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "After a lot of discussion and work we have finally reachanged a basic understanding of what is necessary to make unprivileged mounts safe in the presence of EVM and IMA xattrs which the last commit in this series reflects. While technically it is a revert the comments it adds are important for people not getting confused in the future. Clearing up that confusion allows us to seriously work on unprivileged mounts of fuse in the next development cycle. The rest of the fixes in this set are in the intersection of user namespaces, ptrace, and exec. I started with the first fix which started a feedback cycle of finding additional issues during review and fixing them. Culiminating in a fix for a bug that has been present since at least Linux v1.0. Potentially these fixes were candidates for being merged during the rc cycle, and are certainly backport candidates but enough little things turned up during review and testing that I decided they should be handled as part of the normal development process just to be certain there were not any great surprises when it came time to backport some of these fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC" exec: Ensure mm->user_ns contains the execed files ptrace: Don't allow accessing an undumpable mm ptrace: Capture the ptracer's creds not PT_PTRACE_CAP mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
2016-12-13Merge tag 'char-misc-4.10-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here's the big char/misc driver patches for 4.10-rc1. Lots of tiny changes over lots of "minor" driver subsystems, the largest being some new FPGA drivers. Other than that, a few other new drivers, but no new driver subsystems added for this kernel cycle, a nice change. All of these have been in linux-next with no reported issues" * tag 'char-misc-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (107 commits) uio-hv-generic: store physical addresses instead of virtual Tools: hv: kvp: configurable external scripts path uio-hv-generic: new userspace i/o driver for VMBus vmbus: add support for dynamic device id's hv: change clockevents unbind tactics hv: acquire vmbus_connection.channel_mutex in vmbus_free_channels() hyperv: Fix spelling of HV_UNKOWN mei: bus: enable non-blocking RX mei: fix the back to back interrupt handling mei: synchronize irq before initiating a reset. VME: Remove shutdown entry from vme_driver auxdisplay: ht16k33: select framebuffer helper modules MAINTAINERS: add git url for fpga fpga: Clarify how write_init works streaming modes fpga zynq: Fix incorrect ISR state on bootup fpga zynq: Remove priv->dev fpga zynq: Add missing \n to messages fpga: Add COMPILE_TEST to all drivers uio: pruss: add clk_disable() char/pcmcia: add some error checking in scr24x_read() ...
2016-12-12mm: THP page cache support for ppc64Aneesh Kumar K.V1-10/+50
Add arch specific callback in the generic THP page cache code that will deposit and withdarw preallocated page table. Archs like ppc64 use this preallocated table to store the hash pte slot information. Testing: kernel build of the patch series on tmpfs mounted with option huge=always The related thp stat: thp_fault_alloc 72939 thp_fault_fallback 60547 thp_collapse_alloc 603 thp_collapse_alloc_failed 0 thp_file_alloc 253763 thp_file_mapped 4251 thp_split_page 51518 thp_split_page_failed 1 thp_deferred_split_page 73566 thp_split_pmd 665 thp_zero_page_alloc 3 thp_zero_page_alloc_failed 0 [[email protected]: remove unneeded parentheses, per Kirill] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12mm: remove the page size change check in tlb_remove_pageAneesh Kumar K.V1-15/+6
Now that we check for page size change early in the loop, we can partially revert e9d55e157034a ("mm: change the interface for __tlb_remove_page"). This simplies the code much, by removing the need to track the last address with which we adjusted the range. We also go back to the older way of filling the mmu_gather array, ie, we add an entry and then check whether the gather batch is full. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12mm: add tlb_remove_check_page_size_change to track page size changeAneesh Kumar K.V1-1/+6
With commit e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force flush if page size change") we added the ability to force a tlb flush when the page size change in a mmu_gather loop. We did that by checking for a page size change every time we added a page to mmu_gather for lazy flush/remove. We can improve that by moving the page size change check early and not doing it every time we add a page. This also helps us to do tlb flush when invalidating a range covering dax mapping. Wrt dax mapping we don't have a backing struct page and hence we don't call tlb_remove_page, which earlier forced the tlb flush on page size change. Moving the page size change check earlier means we will do the same even for dax mapping. We also avoid doing this check on architecture other than powerpc. In a later patch we will remove page size check from tlb_remove_page(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12mm, thp: avoid unlikely branches for split_huge_pmdDavid Rientjes1-2/+2
While doing MADV_DONTNEED on a large area of thp memory, I noticed we encountered many unlikely() branches in profiles for each backing hugepage. This is because zap_pmd_range() would call split_huge_pmd(), which rechecked the conditions that were already validated, but as part of an unlikely() branch. Avoid the unlikely() branch when in a context where pmd is known to be good for __split_huge_pmd() directly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-11-22ptrace: Don't allow accessing an undumpable mmEric W. Biederman1-1/+1
It is the reasonable expectation that if an executable file is not readable there will be no way for a user without special privileges to read the file. This is enforced in ptrace_attach but if ptrace is already attached before exec there is no enforcement for read-only executables. As the only way to read such an mm is through access_process_vm spin a variant called ptrace_access_vm that will fail if the target process is not being ptraced by the current process, or the current process did not have sufficient privileges when ptracing began to read the target processes mm. In the ptrace implementations replace access_process_vm by ptrace_access_vm. There remain several ptrace sites that still use access_process_vm as they are reading the target executables instructions (for kernel consumption) or register stacks. As such it does not appear necessary to add a permission check to those calls. This bug has always existed in Linux. Fixes: v1.0 Cc: [email protected] Reported-by: Andy Lutomirski <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2016-11-10lkdtm: Do not use flush_icache_range() on user addressesCatalin Marinas1-0/+1
The flush_icache_range() API is meant to be used on kernel addresses only as it may not have the infrastructure (exception entries) to handle user memory faults. The lkdtm execute_user_location() function tests the kernel execution of user space addresses by mmap'ing an anonymous page, copying some code together with cache maintenance and attempting to run it. However, the cache maintenance step may fail because of the incorrect API usage described above. The patch changes lkdtm to use access_process_vm() for copying the code into user space which would take care of the necessary cache maintenance. Signed-off-by: Catalin Marinas <[email protected]> [kees: export access_process_vm() for module use] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2016-11-09x86/pat, mm: Make track_pfn_insert() return voidBorislav Petkov1-4/+4
It only returns 0 so we can save us the testing of its retval everywhere. Signed-off-by: Borislav Petkov <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Brian Gerst <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Cc: Dave Airlie <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-10-19mm: replace access_process_vm() write parameter with gup_flagsLorenzo Stoakes1-6/+2
This removes the 'write' argument from access_process_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Jesper Nilsson <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Michael Ellerman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-19mm: replace access_remote_vm() write parameter with gup_flagsLorenzo Stoakes1-8/+3
This removes the 'write' argument from access_remote_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-19mm: replace __access_remote_vm() write parameter with gup_flagsLorenzo Stoakes1-8/+15
This removes the 'write' argument from __access_remote_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-19mm: replace get_user_pages_remote() write/force parameters with gup_flagsLorenzo Stoakes1-1/+5
This removes the 'write' and 'force' from get_user_pages_remote() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: fix cache mode tracking in vm_insert_mixed()Dan Williams1-2/+6
vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(), fails to check the pgprot_t it uses for the mapping against the one recorded in the memtype tracking tree. Add the missing call to track_pfn_insert() to preclude cases where incompatible aliased mappings are established for a given physical address range. Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Cc: David Airlie <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ross Zwisler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-10-07mm: make sure that kthreads will not refault oom reaped memoryMichal Hocko1-0/+13
There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: "Michael S. Tsirkin" <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-09-30Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar1-5/+7
Signed-off-by: Ingo Molnar <[email protected]>
2016-09-25mm: check VMA flags to avoid invalid PROT_NONE NUMA balancingLorenzo Stoakes1-5/+7
The NUMA balancing logic uses an arch-specific PROT_NONE page table flag defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page PMDs respectively as requiring balancing upon a subsequent page fault. User-defined PROT_NONE memory regions which also have this flag set will not normally invoke the NUMA balancing code as do_page_fault() will send a segfault to the process before handle_mm_fault() is even called. However if access_remote_vm() is invoked to access a PROT_NONE region of memory, handle_mm_fault() is called via faultin_page() and __get_user_pages() without any access checks being performed, meaning the NUMA balancing logic is incorrectly invoked on a non-NUMA memory region. A simple means of triggering this problem is to access PROT_NONE mmap'd memory using /proc/self/mem which reliably results in the NUMA handling functions being invoked when CONFIG_NUMA_BALANCING is set. This issue was reported in bugzilla (issue 99101) which includes some simple repro code. There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page() added at commit c0e7cad to avoid accidentally provoking strange behaviour by attempting to apply NUMA balancing to pages that are in fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro. This patch moves the PROT_NONE check into mm/memory.c rather than invoking BUG_ON() as faulting in these pages via faultin_page() is a valid reason for reaching the NUMA check with the PROT_NONE page table flag set and is therefore not always a bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101 Reported-by: Trevor Saunders <[email protected]> Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-09-13sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flagsRik van Riel1-1/+1
Commit: 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly found to introduce a regression with NUMA grouping. It was followed up by these commits: 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags") b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault") The first of those two commits try alternate approaches to NUMA grouping, which apparently do not work as well as looking at the PTE write permissions. The latter patch preserves the PTE write permissions across a NUMA protection fault. However, it forgets to revert the condition for whether or not to group tasks together back to what it was before v3.19, even though the information is now preserved in the page tables once again. This patch brings the NUMA grouping heuristic back to what it was before commit 4d9424669946, which the changelogs of subsequent commits suggest worked best. We have all the information again. We should probably use it. Signed-off-by: Rik van Riel <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-08-02mm: move swap-in anonymous page into active listMinchan Kim1-0/+1
Every swap-in anonymous page starts from inactive lru list's head. It should be activated unconditionally when VM decide to reclaim because page table entry for the page always usually has marked accessed bit. Thus, their window size for getting a new referece is 2 * NR_inactive + NR_active while others is NR_inactive + NR_active. It's not fair that it has more chance to be referenced compared to other newly allocated page which starts from active lru list's head. Johannes: : The page can still have a valid copy on the swap device, so prefering to : reclaim that page over a fresh one could make sense. But as you point : out, having it start inactive instead of active actually ends up giving it : *more* LRU time, and that seems to be without justification. Rik: : The reason newly read in swap cache pages start on the inactive list is : that we do some amount of read-around, and do not know which pages will : get used. : : However, immediately activating the ones that DO get used, like your patch : does, is the right thing to do. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-08-02mm: fail prefaulting if page table allocation failsVegard Nossum1-0/+2
I ran into this: BUG: sleeping function called from invalid context at mm/page_alloc.c:3784 in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1 2 locks held by trinity-c1/1434: #0: (&mm->mmap_sem){......}, at: [<ffffffff810ce31e>] __do_page_fault+0x1ce/0x8f0 #1: (rcu_read_lock){......}, at: [<ffffffff81378f86>] filemap_map_pages+0xd6/0xdd0 CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x65/0x84 panic+0x185/0x2dd ___might_sleep+0x51c/0x600 __might_sleep+0x90/0x1a0 __alloc_pages_nodemask+0x5b1/0x2160 alloc_pages_current+0xcc/0x370 pte_alloc_one+0x12/0x90 __pte_alloc+0x1d/0x200 alloc_set_pte+0xe3e/0x14a0 filemap_map_pages+0x42b/0xdd0 handle_mm_fault+0x17d5/0x28b0 __do_page_fault+0x310/0x8f0 trace_do_page_fault+0x18d/0x310 do_async_page_fault+0x27/0xa0 async_page_fault+0x28/0x30 The important bits from the above is that filemap_map_pages() is calling into the page allocator while holding rcu_read_lock (sleeping is not allowed inside RCU read-side critical sections). According to Kirill Shutemov, the prefaulting code in do_fault_around() is supposed to take care of this, but missing error handling means that the allocation failure can go unnoticed. We don't need to return VM_FAULT_OOM (or any other error) here, since we can just let the normal fault path try again. Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vegard Nossum <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Hillf Danton" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHEKirill A. Shutemov1-2/+3
For file mappings, we don't deposit page tables on THP allocation because it's not strictly required to implement split_huge_pmd(): we can just clear pmd and let following page faults to reconstruct the page table. But Power makes use of deposited page table to address MMU quirk. Let's hide THP page cache, including huge tmpfs, under separate config option, so it can be forbidden on Power. We can revert the patch later once solution for Power found. Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26shmem: add huge pages supportKirill A. Shutemov1-1/+1
Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26thp: handle file COW faultsKirill A. Shutemov1-0/+5
File COW for THP is handled on pte level: just split the pmd. It's not clear how benefitial would be allocation of huge pages on COW faults. And it would require some code to make them work. I think at some point we can consider teaching khugepaged to collapse pages in COW mappings, but allocating huge on fault is probably overkill. Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26thp, vmstats: add counters for huge file pagesKirill A. Shutemov1-0/+1
THP_FILE_ALLOC: how many times huge page was allocated and put page cache. THP_FILE_MAPPED: how many times file huge page was mapped. Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: introduce do_set_pmd()Kirill A. Shutemov1-1/+71
With postponed page table allocation we have chance to setup huge pages. do_set_pte() calls do_set_pmd() if following criteria met: - page is compound; - pmd entry in pmd_none(); - vma has suitable size and alignment; Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26rmap: support file thpKirill A. Shutemov1-2/+2
Naive approach: on mapping/unmapping the page as compound we update ->_mapcount on each 4k page. That's not efficient, but it's not obvious how we can optimize this. We can look into optimization later. PG_double_map optimization doesn't work for file pages since lifecycle of file pages is different comparing to anon pages: file page can be mapped again at any time. Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: postpone page table allocation until we have page to mapKirill A. Shutemov1-120/+178
The idea (and most of code) is borrowed again: from Hugh's patchset on huge tmpfs[1]. Instead of allocation pte page table upfront, we postpone this until we have page to map in hands. This approach opens possibility to map the page as huge if filesystem supports this. Comparing to Hugh's patch I've pushed page table allocation a bit further: into do_set_pte(). This way we can postpone allocation even in faultaround case without moving do_fault_around() after __do_fault(). do_set_pte() got renamed to alloc_set_pte() as it can allocate page table if required. [1] http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: introduce fault_envKirill A. Shutemov1-304/+278
The idea borrowed from Peter's patch from patchset on speculative page faults[1]: Instead of passing around the endless list of function arguments, replace the lot with a single structure so we can change context without endless function signature changes. The changes are mostly mechanical with exception of faultaround code: filemap_map_pages() got reworked a bit. This patch is preparation for the next one. [1] http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: do not pass mm_struct into handle_mm_faultKirill A. Shutemov1-6/+7
We always have vma->vm_mm around. Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: make swapin readahead to improve thp collapse rateEbru Akagunduz1-1/+1
This patch makes swapin readahead to improve thp collapse rate. When khugepaged scanned pages, there can be a few of the pages in swap area. With the patch THP can collapse 4kB pages into a THP when there are up to max_ptes_swap swap ptes in a 2MB range. The patch was tested with a test program that allocates 400B of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. Without the patch, system did not swap in readahead. THP rate was %65 of the program of the memory, it did not change over time. With this patch, after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. [[email protected]: trivial cleanup of exit path of the function] [[email protected]: __collapse_huge_page_swapin(): drop unused 'pte' parameter] [[email protected]: do not hold anon_vma lock during swap in] Signed-off-by: Ebru Akagunduz <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Xie XiuQi <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm/mmu_gather: track page size with mmu gather and force flush if page size ↵Aneesh Kumar K.V1-1/+9
change This allows an arch which needs to do special handing with respect to different page size when flushing tlb to implement the same in mmu gather. Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: change the interface for __tlb_remove_page()Aneesh Kumar K.V1-6/+13
This updates the generic and arch specific implementation to return true if we need to do a tlb flush. That means if a __tlb_remove_page indicate a flush is needed, the page we try to remove need to be tracked and added again after the flush. We need to track it because we have already update the pte to none and we can't just loop back. This change is done to enable us to do a tlb_flush when we try to flush a range that consists of different page sizes. For architectures like ppc64, we can do a range based tlb flush and we need to track page size for that. When we try to remove a huge page, we will force a tlb flush and starts a new mmu gather. [[email protected]: mm-change-the-interface-for-__tlb_remove_page-v3] Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-15mm: thp: refix false positive BUG in page_move_anon_rmap()Hugh Dickins1-2/+1
The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's worth: the syzkaller fuzzer hit it again. It's still wrong for some THP cases, because linear_page_index() was never intended to apply to addresses before the start of a vma. That's easily fixed with a signed long cast inside linear_page_index(); and Dmitry has tested such a patch, to verify the false positive. But why extend linear_page_index() just for this case? when the avoidance in page_move_anon_rmap() has already grown ugly, and there's no reason for the check at all (nothing else there is using address or index). Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE, remove CONFIG_DEBUG_VM PageTransHuge adjustment. And one more thing: should the compound_head(page) be done inside or outside page_move_anon_rmap()? It's usually pushed down to the lowest level nowadays (and mm/memory.c shows no other explicit use of it), so I think it's better done in page_move_anon_rmap() than by caller. Fixes: 0798d3c022dc ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Mika Westerberg <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: <[email protected]> [4.5+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-06-24Revert "mm: disable fault around on emulated access bit architecture"Kirill A. Shutemov1-8/+0
This reverts commit d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb. After revert of 5c0a85fad949 ("mm: make faultaround produce old ptes") faultaround doesn't have dependencies on hardware accessed bit, so let's revert this one too. Link: http://lkml.kernel.org/r/1465893750-44080-3-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: "Huang, Ying" <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-06-24Revert "mm: make faultaround produce old ptes"Kirill A. Shutemov1-18/+5
This reverts commit 5c0a85fad949212b3e059692deecdeed74ae7ec7. The commit causes ~6% regression in unixbench. Let's revert it for now and consider other solution for reclaim problem later. Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: "Huang, Ying" <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-26Merge tag 'dax-locking-for-4.7' of ↵Linus Torvalds1-22/+18
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX locking updates from Ross Zwisler: "Filesystem DAX locking for 4.7 - We use a bit in an exceptional radix tree entry as a lock bit and use it similarly to how page lock is used for normal faults. This fixes races between hole instantiation and read faults of the same index. - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD locking is implemented" * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Remove i_mmap_lock protection dax: Use radix tree entry lock to protect cow faults dax: New fault locking dax: Allow DAX code to replace exceptional entries dax: Define DAX lock bit for radix tree exceptional entry dax: Make huge page handling depend of CONFIG_BROKEN dax: Fix condition for filling of PMD holes