aboutsummaryrefslogtreecommitdiff
path: root/mm/mempolicy.c
AgeCommit message (Collapse)AuthorFilesLines
2016-07-26shmem: add huge pages supportKirill A. Shutemov1-1/+1
Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-07-26mm: thp: check pmd_trans_unstable() after split_huge_pmd()Naoya Horiguchi1-0/+2
split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing to pte entries, which can be checked with pmd_trans_unstable(). Some callers make this assertion and some do it differently and some not, so let's do it in a unified manner. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-19mm, page_alloc: avoid looking up the first zone in a zonelist twiceMel Gorman1-9/+10
The allocator fast path looks up the first usable zone in a zonelist and then get_page_from_freelist does the same job in the zonelist iterator. This patch preserves the necessary information. 4.6.0-rc2 4.6.0-rc2 fastmark-v1r20 initonce-v1r20 Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%) Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%) Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%) Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%) Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%) Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%) Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%) Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%) Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%) Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%) Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%) Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%) Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%) Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%) Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%) The benefit is negligible and the results are within the noise but each cycle counts. Signed-off-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-19mm/mempolicy.c:offset_il_node() document and clarifyAndrew Morton1-9/+11
This code was pretty obscure and was relying upon obscure side-effects of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1. Clean that all up and document the function's intent. Acked-by: Vlastimil Babka <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Laura Abbott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-19include/linux/nodemask.h: create next_node_in() helperAndrew Morton1-22/+2
Lots of code does node = next_node(node, XXX); if (node == MAX_NUMNODES) node = first_node(XXX); so create next_node_in() to do this and use it in various places. [[email protected]: use next_node_in() helper] Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Wang Xiaoqiang <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-20Merge branch 'mm-pkeys-for-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ...
2016-03-17mm: coalesce split stringsJoe Perches1-3/+1
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <[email protected]> Acked-by: Tejun Heo <[email protected]> [percpu] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-15mm/mempolicy.c: skip VM_HUGETLB and VM_MIXEDMAP VMA for lazy mbindLiang Chen1-1/+3
VM_HUGETLB and VM_MIXEDMAP vma needs to be excluded to avoid compound pages being marked for migration and unexpected COWs when handling hugetlb fault. Thanks to Naoya Horiguchi for reminding me on these checks. Signed-off-by: Liang Chen <[email protected]> Signed-off-by: Gavin Guo <[email protected]> Suggested-by: Naoya Horiguchi <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-09mm, thp: fix migration of PTE-mapped transparent huge pagesKirill A. Shutemov1-1/+1
We don't have native support of THP migration, so we have to split huge page into small pages in order to migrate it to different node. This includes PTE-mapped huge pages. I made mistake in refcounting patchset: we don't actually split PTE-mapped huge page in queue_pages_pte_range(), if we step on head page. The result is that the head page is queued for migration, but none of tail pages: putting head page on queue takes pin on the page and any subsequent attempts of split_huge_pages() would fail and we skip queuing tail pages. unmap_and_move_huge_page() will eventually split the huge pages, but only one of 512 pages would get migrated. Let's fix the situation. Fixes: 248db92da13f2507 ("migrate_pages: try to split pages on queuing") Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-02-16mm/gup: Switch all callers of get_user_pages() to not pass tsk/mmDave Hansen1-3/+3
We will soon modify the vanilla get_user_pages() so it can no longer be used on mm/tasks other than 'current/current->mm', which is by far the most common way it is called. For now, we allow the old-style calls, but warn when they are used. (implemented in previous patch) This patch switches all callers of: get_user_pages() get_user_pages_unlocked() get_user_pages_locked() to stop passing tsk/mm so they will no longer see the warnings. Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-05mempolicy: do not try to queue pages from !vma_migratable()Kirill A. Shutemov1-9/+5
Maybe I miss some point, but I don't see a reason why we try to queue pages from non migratable VMAs. This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page(): #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/mman.h> #include <numaif.h> #define SIZE 0x2000 int foo; int main() { int fd; char *p; unsigned long mask = 2; fd = open("/dev/sg0", O_RDWR); p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); /* Faultin pages */ foo = p[0] + p[0x1000]; mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT); return 0; } The only case when we can queue pages from such VMA is MPOL_MF_STRICT plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU, but gfp mask is not sutable for migaration (see mapping_gfp_mask() check in vma_migratable()). That's looks like a bug to me. Let's filter out non-migratable vma at start of queue_pages_test_walk() and go to queue_pages_pte_range() only if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL flag is set. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dmitry Vyukov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-15mm: mempolicy: skip non-migratable VMAs when setting MPOL_MF_LAZYLiang Chen1-1/+2
MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and VM_HUGETLB VMAs, and avoid useless overhead of minor faults. Signed-off-by: Liang Chen <[email protected]> Signed-off-by: Gavin Guo <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-15migrate_pages: try to split pages on queuingKirill A. Shutemov1-4/+38
We are not able to migrate THPs. It means it's not enough to split only PMD on migration -- we need to split compound page under it too. Signed-off-by: Kirill A. Shutemov <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> Acked-by: Jerome Marchand <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Steve Capper <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-15thp: rename split_huge_page_pmd() to split_huge_pmd()Kirill A. Shutemov1-1/+1
We are going to decouple splitting THP PMD from splitting underlying compound page. This patch renames split_huge_page_pmd*() functions to split_huge_pmd*() to reflect the fact that it doesn't imply page splitting, only PMD. Signed-off-by: Kirill A. Shutemov <[email protected]> Tested-by: Sasha Levin <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> Acked-by: Jerome Marchand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Steve Capper <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/mempolicy.c: convert the shared_policy lock to a rwlockNathan Zimmer1-13/+17
When running the SPECint_rate gcc on some very large boxes it was noticed that the system was spending lots of time in mpol_shared_policy_lookup(). The gamess benchmark can also show it and is what I mostly used to chase down the issue since the setup for that I found to be easier. To be clear the binaries were on tmpfs because of disk I/O requirements. We then used text replication to avoid icache misses and having all the copies banging on the memory where the instruction code resides. This results in us hitting a bottleneck in mpol_shared_policy_lookup() since lookup is serialised by the shared_policy lock. I have only reproduced this on very large (3k+ cores) boxes. The problem starts showing up at just a few hundred ranks getting worse until it threatens to livelock once it gets large enough. For example on the gamess benchmark at 128 ranks this area consumes only ~1% of time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over 90%. To alleviate the contention in this area I converted the spinlock to an rwlock. This allows a large number of lookups to happen simultaneously. The results were quite good reducing this consumtion at max ranks to around 2%. [[email protected]: tidy up code comments] Signed-off-by: Nathan Zimmer <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Nadia Yvette Chambers <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm: rename alloc_pages_exact_node() to __alloc_pages_node()Vlastimil Babka1-2/+2
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Robin Holt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Michael Ellerman <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Cliff Whickman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in ↵Aristeu Rozanski1-3/+0
queue_pages_range() This check was introduced as part of 6f4576e3687 ("mempolicy: apply page table walker on queue_pages_range()") which got duplicated by 48684a65b4e ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)") by reintroducing it earlier on queue_page_test_walk() Signed-off-by: Aristeu Rozanski <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Acked-by: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-04userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctxAndrea Arcangeli1-2/+2
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge must be aware about so that we can merge vmas back like they were originally before arming the userfaultfd on some memory range. Signed-off-by: Andrea Arcangeli <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: Sanidhya Kashyap <[email protected]> Cc: [email protected] Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andres Lagar-Cavilla <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Feiner <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Huangpeng (Peter)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-06-24mm, thp: respect MPOL_PREFERRED policy with non-local nodeVlastimil Babka1-16/+22
Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node"), we handle THP allocations on page fault in a special way - for non-interleave memory policies, the allocation is only attempted on the node local to the current CPU, if the policy's nodemask allows the node. This is motivated by the assumption that THP benefits cannot offset the cost of remote accesses, so it's better to fallback to base pages on the local node (which might still be available, while huge pages are not due to fragmentation) than to allocate huge pages on a remote node. The nodemask check prevents us from violating e.g. MPOL_BIND policies where the local node is not among the allowed nodes. However, the current implementation can still give surprising results for the MPOL_PREFERRED policy when the preferred node is different than the current CPU's local node. In such case we should honor the preferred node and not use the local node, which is what this patch does. If hugepage allocation on the preferred node fails, we fall back to base pages and don't try other nodes, with the same motivation as is done for the local node hugepage allocations. The patch also moves the MPOL_INTERLEAVE check around to simplify the hugepage specific test. The difference can be demonstrated using in-tree transhuge-stress test on the following 2-node machine where half memory on one node was occupied to show the difference. > numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 7878 MB node 0 free: 3623 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 8045 MB node 1 free: 7818 MB node distances: node 0 1 0: 10 21 1: 21 10 Before the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages Number of successful THP allocations corresponds to free memory on node 0 in the first case and node 1 in the second case, i.e. -p parameter is ignored and cpu binding "wins". After the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -p1 -C0 ./transhuge-stress transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages The -p parameter is respected regardless of cpu binding. > numactl -C0 ./transhuge-stress transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -C12 ./transhuge-stress transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages Without -p parameter, hugepage restriction to CPU-local node works as before. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Signed-off-by: Vlastimil Babka <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> [4.0+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-05-14mm, numa: really disable NUMA balancing by default on single node machinesMel Gorman1-1/+1
NUMA balancing is meant to be disabled by default on UMA machines but the check is using nr_node_ids (highest node) instead of num_online_nodes (online nodes). The consequences are that a UMA machine with a node ID of 1 or higher will enable NUMA balancing. This will incur useless overhead due to minor faults with the impact depending on the workload. These are the impact on the stats when running a kernel build on a single node machine whose node ID happened to be 1: vanilla patched NUMA base PTE updates 5113158 0 NUMA huge PMD updates 643 0 NUMA page range updates 5442374 0 NUMA hint faults 2109622 0 NUMA hint local faults 2109622 0 NUMA hint local percent 100 100 NUMA pages migrated 0 0 Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: <[email protected]> [3.8+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-04-14mm, thp: really limit transparent hugepage allocation to local nodeDavid Rientjes1-1/+2
Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") restructured alloc_hugepage_vma() with the intent of only allocating transparent hugepages locally when there was not an effective interleave mempolicy. alloc_pages_exact_node() does not limit the allocation to the single node, however, but rather prefers it. This is because __GFP_THISNODE is not set which would cause the node-local nodemask to be passed. Without it, only a nodemask that prefers the local node is passed. Fix this by passing __GFP_THISNODE and falling back to small pages when the allocation fails. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") suffers from a similar problem for khugepaged, which is also fixed. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Jarno Rajahalme <[email protected]> Cc: Li Zefan <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-04-14mm, mempolicy: migrate_to_node should only migrate to nodeDavid Rientjes1-1/+2
migrate_to_node() is intended to migrate a page from one source node to a target node. Today, migrate_to_node() could end up migrating to any node, not only the target node. This is because the page migration allocator, new_node_page() does not pass __GFP_THISNODE to alloc_pages_exact_node(). This causes the target node to be preferred but allows fallback to any other node in order of affinity. Prevent this by allocating with __GFP_THISNODE. If memory is not available, -ENOMEM will be returned as appropriate. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-13mm: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo1-4/+3
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-12mm: convert p[te|md]_mknonnuma and remaining page table manipulationsMel Gorman1-1/+1
With PROT_NONE, the traditional page table manipulation functions are sufficient. [[email protected]: fix compiler warning in pmdp_invalidate()] [[email protected]: fix build with STRICT_MM_TYPECHECKS] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Linus Torvalds <[email protected]> Acked-by: Aneesh Kumar <[email protected]> Tested-by: Sasha Levin <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Dave Jones <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill Shutemov <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)Naoya Horiguchi1-0/+3
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). For example for pagemap_read(), when no callbacks are called against VM_PFNMAP vma, pagemap_read() may prepare pagemap data for next virtual address range at wrong index. That could confuse and/or break userspace applications. This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows: - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole() over vma(VM_PFNMAP), - for clear_refs and queue_pages which have their own ->tests_walk, just return 1 and skip vma(VM_PFNMAP). This is no problem because these are not interested in hole regions, - for other callers, just skip the vma(VM_PFNMAP) as a default behavior. Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Shiraz Hashim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mempolicy: apply page table walker on queue_pages_range()Naoya Horiguchi1-136/+92
queue_pages_range() does page table walking in its own way now, but there is some code duplicate. This patch applies page table walker to reduce lines of code. queue_pages_range() has to do some precheck to determine whether we really walk over the vma or just skip it. Now we have test_walk() callback in mm_walk for this purpose, so we can do this replacement cleanly. queue_pages_test_walk() depends on not only the current vma but also the previous one, so queue_pages->prev is introduced to remember it. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vmaVlastimil Babka1-85/+33
The previous commit ("mm/thp: Allocate transparent hugepages on local node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a special policy for THP allocations. The function has the same interface as alloc_pages_vma(), shares a lot of boilerplate code and a long comment. This patch merges the hugepage special case into alloc_pages_vma. The extra if condition should be cheap enough price to pay. We also prevent a (however unlikely) race with parallel mems_allowed update, which could make hugepage allocation restart only within the fallback call to alloc_hugepage_vma() and not reconsider the special rule in alloc_hugepage_vma(). Also by making sure mpol_cond_put(pol) is always called before actual allocation attempt, we can use a single exit path within the function. Also update the comment for missing node parameter and obsolete reference to mm_sem. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm/thp: allocate transparent hugepages on local nodeAneesh Kumar K.V1-0/+72
This make sure that we try to allocate hugepages from local node if allowed by mempolicy. If we can't, we fallback to small page allocation based on mempolicy. This is based on the observation that allocating pages on local node is more beneficial than allocating hugepages on remote node. With this patch applied we may find transparent huge page allocation failures if the current node doesn't have enough freee hugepages. Before this patch such failures result in us retrying the allocation on other nodes in the numa node mask. [[email protected]: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency] Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-19Merge branch 'for-linus' of ↵Linus Torvalds1-5/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #3 from Al Viro: "Assorted fixes and patches from the last cycle" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [regression] chunk lost from bd9b51 vfs: make mounts and mountstats honor root dir like mountinfo does vfs: cleanup show_mountinfo init: fix read-write root mount unfuck binfmt_misc.c (broken by commit e6084d4) vm_area_operations: kill ->migrate() new helper: iter_is_iovec() move_extent_per_page(): get rid of unused w_flags lustre: get rid of playing with ->fs btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
2014-12-18mm/mempolicy.c: remove unnecessary is_valid_nodemask()Zhihui Zhang1-8/+2
When nodes is true, nsc->mask2 has already been filtered by nsc->mask1, which has already factored in node_states[N_MEMORY]. Signed-off-by: Zhihui Zhang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-17vm_area_operations: kill ->migrate()Al Viro1-5/+0
the only instance this method has ever grown was one in kernfs - one that call ->migrate() of another vm_ops if it exists. Signed-off-by: Al Viro <[email protected]>
2014-10-09mm: mempolicy: skip inaccessible VMAs when setting MPOL_MF_LAZYMel Gorman1-1/+3
PROT_NUMA VMAs are skipped to avoid problems distinguishing between present, prot_none and special entries. MPOL_MF_LAZY is not visible from userspace since commit a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the same way task_numa_work does. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: unexport get_vma_policy() and remove its "task" argOleg Nesterov1-11/+8
- get_vma_policy(task) is not safe if task != current, remove this argument. - get_vma_policy() no longer has callers outside of mempolicy.c, make it static. Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem)Oleg Nesterov1-14/+1
Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic was never correct and it is no longer needed, see the previous patch. Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: introduce __get_vma_policy(), export get_task_policy()Oleg Nesterov1-18/+26
Extract the code which looks for vma's policy from get_vma_policy() into the new helper, __get_vma_policy(). Export get_task_policy(). Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: remove the "task" arg of vma_policy_mof() and simplify itOleg Nesterov1-14/+11
1. vma_policy_mof(task) is simply not safe unless task == current, it can race with do_exit()->mpol_put(). Remove this arg and update its single caller. 2. vma can not be NULL, remove this check and simplify the code. Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: sanitize the usage of get_task_policy()Oleg Nesterov1-11/+14
Cleanup + preparation. Every user of get_task_policy() calls it unconditionally, even if it is not going to use the result. get_task_policy() is cheap but still this does not look clean, plus the code looks simpler if get_task_policy() is called only when this is really needed. Note: I hope this is correct, but it is not clear why vma_policy_mof() doesn't fall back to get_task_policy() if ->get_policy() returns NULL. Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: change get_task_policy() to return default_policy rather than NULLOleg Nesterov1-18/+13
Every caller of get_task_policy() falls back to default_policy if it returns NULL. Change get_task_policy() to do this. Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-09mempolicy: change alloc_pages_vma() to use mpol_cond_put()Oleg Nesterov1-2/+1
Trivial cleanup. alloc_pages_vma() can use mpol_cond_put(). Signed-off-by: Oleg Nesterov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-07-10Merge branch 'for-3.16-fixes' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Mostly fixes for the fallouts from the recent cgroup core changes. The decoupled nature of cgroup dynamic hierarchy management (hierarchies are created dynamically on mount but may or may not be reused once unmounted depending on remaining usages) led to more ugliness being added to kernfs. Hopefully, this is the last of it" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: break kernfs active protection in cpuset_write_resmask() cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() kernfs: introduce kernfs_pin_sb() cgroup: fix mount failure in a corner case cpuset,mempolicy: fix sleeping function called from invalid context cgroup: fix broken css_has_online_children()
2014-06-25cpuset,mempolicy: fix sleeping function called from invalid contextGu Zheng1-2/+0
When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <[email protected]> Acked-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: stable <[email protected]>
2014-06-23mm: fix crashes from mbind() merging vmasHugh Dickins1-26/+20
In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced vma merging to mbind(), but it should have also changed the convention of passing start vma from queue_pages_range() (formerly check_range()) to new_vma_page(): vma merging may have already freed that structure, resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes. Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") Reported-by: Naoya Horiguchi <[email protected]> Tested-by: Naoya Horiguchi <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: <[email protected]> [2.6.34+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds1-16/+19
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-06mm: convert some level-less printks to pr_*Mitchel Humpherys1-1/+4
printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06mm: add !pte_present() check on existing hugetlb_entry callbacksNaoya Horiguchi1-1/+5
The age table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. [ I think that reason is bogus, btw - it should just do what the regular code does, which is to call the "pte_hole()" function for such hugetlb entries - Linus] However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present checks on buggy callbacks. This bug exists for years and got visible by introducing hugepage migration. ChangeLog v2: - fix if condition (check !pte_present() instead of pte_present()) Reported-by: Sasha Levin <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: <[email protected]> [3.12+] Signed-off-by: Andrew Morton <[email protected]> [ Backported to 3.15. Signed-off-by: Josh Boyer <[email protected]> ] Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm, migration: add destination page freeing callbackDavid Rientjes1-2/+2
Memory migration uses a callback defined by the caller to determine how to allocate destination pages. When migration fails for a source page, however, it frees the destination page back to the system. This patch adds a memory migration callback defined by the caller to determine how to free destination pages. If a caller, such as memory compaction, builds its own freelist for migration targets, this can reuse already freed memory instead of scanning additional memory. If the caller provides a function to handle freeing of destination pages, it is called when page migration fails. If the caller passes NULL then freeing back to the system will be handled as usual. This patch introduces no functional change. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/mempolicy.c: parameter doc uniformizationFabian Frederick1-11/+11
Also fixes kernel-doc warning Signed-off-by: Fabian Frederick <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm: constify nmask argument to set_mempolicy()Rasmus Villemoes1-1/+1
The nmask argument to set_mempolicy() is const according to the user-space header numaif.h, and since the kernel does indeed not modify it, it might as well be declared const in the kernel. Signed-off-by: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm: constify nmask argument to mbind()Rasmus Villemoes1-1/+1
The nmask argument to mbind() is const according to the userspace header numaif.h, and since the kernel does indeed not modify it, it might as well be declared const in the kernel. Signed-off-by: Rasmus Villemoes <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07mm, mempolicy: remove per-process flagDavid Rientjes1-31/+0
PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users. There's no significant performance degradation to checking current->mempolicy rather than current->flags & PF_MEMPOLICY in the allocation path, especially since this is considered unlikely(). Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with 64GB of memory and without a mempolicy: threads before after 16 1249409 1244487 32 1281786 1246783 48 1239175 1239138 64 1244642 1241841 80 1244346 1248918 96 1266436 1254316 112 1307398 1312135 128 1327607 1326502 Per-process flags are a scarce resource so we should free them up whenever possible and make them available. We'll be using it shortly for memcg oom reserves. Signed-off-by: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Jianguo Wu <[email protected]> Cc: Tim Hockin <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>