aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-09-08memcg: move memcg_proto_active from sock.hMichal Hocko2-6/+1
The only user is sock_update_memcg which is living in memcontrol.c so it doesn't make much sense to pollute sock.h by this inline helper. Move it to memcontrol.c and open code it into its only caller. Signed-off-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memcg, tcp_kmem: check for cg_proto in sock_update_memcgMichal Hocko1-2/+1
sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg doesn't check for NULL. The function relies on the mem_cgroup_is_root check because we shouldn't get NULL otherwise because mem_cgroup_from_task will always return !NULL. All other callers are checking for NULL and we can safely replace mem_cgroup_is_root() check by cg_proto != NULL which will be more straightforward (proto_cgroup returns NULL for the root memcg already). Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memcg: restructure mem_cgroup_can_attach()Tejun Heo1-29/+32
Restructure it to lower nesting level and help the planned threadgroup leader iteration changes. This is pure reorganization. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memcg: get rid of extern for functions in memcontrol.hMichal Hocko1-8/+8
Most of the exported functions in this header are not marked extern so change the rest to follow the same style. Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memcg: get rid of mem_cgroup_root_css for !CONFIG_MEMCGMichal Hocko1-2/+0
The only user is cgwb_bdi_init and that one depends on CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG so it doesn't make much sense to definte an empty stub for !CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead to runtime crashes if used in unguarded code paths. Better fail during compilation. Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memcg: export struct mem_cgroupMichal Hocko7-378/+351
mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [[email protected]: inline memcg_kmem_is_active] [[email protected]: do not expose type outside of CONFIG_MEMCG] [[email protected]: memcontrol.h needs eventfd.h for eventfd_ctx] [[email protected]: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08sparc32: do not include swap.h from pgtable_32.hMichal Hocko1-1/+1
"memcg: export struct mem_cgroup" will add includes into linux/memcontrol.h which lead to further header dependency issues as reported by Guenter Roeck: In file included from include/linux/highmem.h:7:0, from include/linux/bio.h:23, from include/linux/writeback.h:192, from include/linux/memcontrol.h:30, from include/linux/swap.h:8, from ./arch/sparc/include/asm/pgtable_32.h:17, from ./arch/sparc/include/asm/pgtable.h:6, from arch/sparc/kernel/traps_32.c:23: include/linux/mm.h: In function 'is_vmalloc_addr': include/linux/mm.h:371:17: error: 'VMALLOC_START' undeclared (first use in this function) include/linux/mm.h:371:17: note: each undeclared identifier is reported only once for each function it appears in include/linux/mm.h:371:41: error: 'VMALLOC_END' undeclared (first use in this function) include/linux/mm.h: In function 'maybe_mkwrite': include/linux/mm.h:556:3: error: implicit declaration of function 'pte_mkwrite' The issue is that pgtable_32.h depends on swap.h to get swap_entry_t but that goes all the way down to linux/mm.h which wants to have VMALLOC_* which is defined later in pgtable_32.h, though. swap_entry_t is defined in include/mm_types.h so it should be sufficient to include this header without more dependencies. Signed-off-by: Michal Hocko <[email protected]> Reported-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/dmapool: allow NULL `pool' pointer in dma_pool_destroy()Sergey Senozhatsky1-0/+3
dma_pool_destroy() does not tolerate a NULL dma_pool pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all dma_pool_destroy() callers to do a NULL check if (pool) dma_pool_destroy(pool); Or, otherwise, be invalid dma_pool_destroy() users. Tweak dma_pool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/mempool: allow NULL `pool' pointer in mempool_destroy()Sergey Senozhatsky1-0/+3
mempool_destroy() does not tolerate a NULL mempool_t pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all mempool_destroy() callers to do a NULL check if (pool) mempool_destroy(pool); Or, otherwise, be invalid mempool_destroy() users. Tweak mempool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()Sergey Senozhatsky1-0/+3
kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check if (cache) kmem_cache_destroy(cache); Or, otherwise, be invalid kmem_cache_destroy() users. Tweak kmem_cache_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm, oom: remove unnecessary variableDavid Rientjes1-13/+8
The "killed" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious. Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm, oom: add description of struct oom_controlDavid Rientjes1-3/+17
Describe the purpose of struct oom_control and what each member does. Also make gfp_mask and order const since they are never manipulated or passed to functions that discard the qualifier. Signed-off-by: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm, oom: do not panic for oom kills triggered from sysrqDavid Rientjes2-3/+7
Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <[email protected]> Suggested-by: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm, oom: pass an oom order of -1 when triggered by sysrqDavid Rientjes5-8/+3
The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm, oom: organize oom context into structDavid Rientjes5-80/+98
There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm: make set_recommended_min_free_kbytes() return voidNicholas Krause1-2/+1
This makes set_recommended_min_free_kbytes() have a return type of void as it cannot fail. Signed-off-by: Nicholas Krause <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm: improve __GFP_NORETRY comment based on implementationDavid Rientjes1-1/+4
Explicitly state that __GFP_NORETRY will attempt direct reclaim and memory compaction before returning NULL and that the oom killer is not called in the current implementation of the page allocator. [[email protected]: s/has/have/] Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08fs: do not prefault sys_write() user buffer pagesDave Hansen1-16/+18
=== Short summary ==== iov_iter_fault_in_readable() works around a really rare case and we can avoid the deadlock it addresses in another way: disable page faults and work around copy failures by faulting after the copy in a slow path instead of before in a hot one. I have a little microbenchmark that does repeated, small writes to tmpfs. This patch speeds that micro up by 6.2%. === Long version === When doing a sys_write() we have a source buffer in userspace and then a target file page. If both of those are the same physical page, there is a potential deadlock that we avoid. It would happen something like this: 1. We start the write to the file 2. Allocate page cache page and set it !Uptodate 3. Touch the userspace buffer to copy in the user data 4. Page fault (since source of the write not yet mapped) 5. Page fault code tries to lock the page and deadlocks (more details on this below) To avoid this, we prefault the page to guarantee that this fault does not occur. But, this prefault comes at a cost. It is one of the most expensive things that we do in a hot write() path (especially if we compare it to the read path). It is working around a pretty rare case. To fix this, it's pretty simple. We move the "prefault" code to run after we attempt the copy. We explicitly disable page faults _during_ the copy, detect the copy failure, then execute the "prefault" ouside of where the page lock needs to be held. iov_iter_copy_from_user_atomic() actually already has an implicit pagefault_disable() inside of it (at least on x86), but we add an explicit one. I don't think we can depend on every kmap_atomic() implementation to pagefault_disable() for eternity. =================================================== The stack trace when this happens looks like this: wait_on_page_bit_killable+0xc0/0xd0 __lock_page_or_retry+0x84/0xa0 filemap_fault+0x1ed/0x3d0 __do_fault+0x41/0xc0 handle_mm_fault+0x9bb/0x1210 __do_page_fault+0x17f/0x3d0 do_page_fault+0xc/0x10 page_fault+0x22/0x30 generic_perform_write+0xca/0x1a0 __generic_file_write_iter+0x190/0x1f0 ext4_file_write_iter+0xe9/0x460 __vfs_write+0xaa/0xe0 vfs_write+0xa6/0x1a0 SyS_write+0x46/0xa0 entry_SYSCALL_64_fastpath+0x12/0x6a 0xffffffffffffffff (Note, this does *NOT* happen in practice today because the kmap_atomic() does a pagefault_disable(). The trace above was obtained by taking out the pagefault_disable().) You can trigger the deadlock with this little code snippet: fd = open("foo", O_RDWR); fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0); write(fd, &fdmap[0], 1); Signed-off-by: Dave Hansen <[email protected]> Cc: Al Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Tejun Heo <[email protected]> Cc: NeilBrown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Paul Cassella <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm: /proc/pid/smaps:: show proportional swap share of the mappingMinchan Kim4-7/+77
We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 1387744 [[email protected]: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <[email protected]> Reported-by: Bongkyu Kim <[email protected]> Tested-by: Bongkyu Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jerome Marchand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memtest: remove unused header filesVladimir Murzin1-5/+0
memtest does not require these headers to be included. Signed-off-by: Vladimir Murzin <[email protected]> Cc: Leon Romanovsky <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memtest: cleanup log messagesVladimir Murzin1-9/+5
- prefer pr_info(... to printk(KERN_INFO ... - use %pa for phys_addr_t - use cpu_to_be64 while printing pattern in reserve_bad_mem() Signed-off-by: Vladimir Murzin <[email protected]> Cc: Leon Romanovsky <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08memtest: use kstrtouint instead of simple_strtoulVladimir Murzin1-3/+5
Since simple_strtoul is obsolete and memtest_pattern is type of int, use kstrtouint instead. Signed-off-by: Vladimir Murzin <[email protected]> Cc: Leon Romanovsky <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08pagemap: update documentationKonstantin Khlebnikov1-2/+12
Notes about recent changes. [[email protected]: various tweaks] Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Mark Williamson <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08pagemap: add mmap-exclusive bit for marking pages mapped only hereKonstantin Khlebnikov3-2/+25
This patch sets bit 56 in pagemap if this page is mapped only once. It allows to detect exclusively used pages without exposing PFN: present file exclusive state 0 0 0 non-present 1 1 0 file page mapped somewhere else 1 1 1 file page mapped only here 1 0 0 anon non-CoWed page (shared with parent/child) 1 0 1 anon CoWed page (or never forked) CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context. MMap-exclusive bit doesn't reflect potential page-sharing via swapcache: page could be mapped once but has several swap-ptes which point to it. Application could detect that by swap bit in pagemap entry and touch that pte via /proc/pid/mem to get real information. See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com Requested by Mark Williamson. [[email protected]: fix spello] Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08pagemap: hide physical addresses from non-privileged usersKonstantin Khlebnikov1-11/+14
This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all. See http://lkml.kernel.org/r/[email protected] Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace") Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08pagemap: rework hugetlb and thp reportKonstantin Khlebnikov1-56/+44
This patch moves pmd dissection out of reporting loop: huge pages are reported as bunch of normal pages with contiguous PFNs. Add missing "FILE" bit in hugetlb vmas. Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08pagemap: switch to the new format and do some cleanupKonstantin Khlebnikov2-114/+61
This patch removes page-shift bits (scheduled to remove since 3.11) and completes migration to the new bit layout. Also it cleans messy macro. Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08pagemap: check permissions and capabilities at open timeKonstantin Khlebnikov1-20/+28
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them. Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages. Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format. This patch (of 5): This patch moves permission checks from pagemap_read() into pagemap_open(). Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way. See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm: remove put_page_unless_one()Vineet Gupta1-12/+0
It has no callers. Signed-off-by: Vineet Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/memblock.c: WARN_ON when flags differs from overlap regionWei Yang1-0/+1
Each memblock_region has flags to indicates the type of this range. For the overlap case, memblock_add_range() inserts the lower part and leave the upper part as indicated in the overlapped region. If the flags of the new range differs from the overlapped region, the information recorded is not correct. This patch adds a WARN_ON when the flags of the new range differs from the overlapped region. Signed-off-by: Wei Yang <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/page_alloc.c: remove unused variable in free_area_init_core()Wei Yang1-3/+2
Commit febd5949e134 ("mm/memory hotplug: init the zone's size when calculating node totalpages") refines the function free_area_init_core(). After doing so, these two parameters are not used anymore. This patch removes these two parameters. Signed-off-by: Wei Yang <[email protected]> Cc: Gu Zheng <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm/page_alloc.c: refine the calculation of highest possible node idWei Yang1-4/+2
nr_node_ids records the highest possible node id, which is calculated by scanning the bitmap node_states[N_POSSIBLE]. Current implementation scan the bitmap from the beginning, which will scan the whole bitmap. This patch reverses the order by scanning from the end with find_last_bit(). Signed-off-by: Wei Yang <[email protected]> Cc: Tejun Heo <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm, dax: use i_mmap_unlock_write() in do_cow_fault()Kirill A. Shutemov1-4/+4
__dax_fault() takes i_mmap_lock for write. Let's pair it with write unlock on do_cow_fault() side. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08mm: take i_mmap_lock in unmap_mapping_range() for DAXKirill A. Shutemov2-25/+21
DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap. __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from all mappings. We need to drop i_mmap_lock there to avoid lock deadlock. Re-aquiring the lock should be fine since we check i_size after the point. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: use linear_page_index()Matthew Wilcox1-1/+1
I was basically open-coding it (thanks to copying code from do_fault() which probably also needs to be fixed). Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: ensure that zero pages are removed from other processesMatthew Wilcox1-1/+5
If the first access to a huge page was a store, there would be no existing zero pmd in this process's page tables. There could be a zero pmd in another process's page tables, if it had done a load. We can detect this case by noticing that the buffer_head returned from the filesystem is New, and ensure that other processes mapping this huge page have their page tables flushed. Signed-off-by: Matthew Wilcox <[email protected]> Reported-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: don't use set_huge_zero_page()Kirill A. Shutemov3-10/+13
This is another place where DAX assumed that pgtable_t was a pointer. Open code the important parts of set_huge_zero_page() in DAX and make set_huge_zero_page() static again. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08thp: fix zap_huge_pmd() for DAXKirill A. Shutemov1-40/+31
The original DAX code assumed that pgtable_t was a pointer, which isn't true on all architectures. Restructure the code to not rely on that assumption. [[email protected]: further fixes integrated into this patch] Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08thp: decrement refcount on huge zero page if it is splitKirill A. Shutemov1-1/+3
The DAX code neglected to put the refcount on the huge zero page. Also we must notify on splits. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: fix race between simultaneous faultsMatthew Wilcox2-19/+25
If two threads write-fault on the same hole at the same time, the winner of the race will return to userspace and complete their store, only to have the loser overwrite their store with zeroes. Fix this for now by taking the i_mmap_sem for write instead of read, and do so outside the call to get_block(). Now the loser of the race will see the block has already been zeroed, and will not zero it again. This severely limits our scalability. I have ideas for improving it, but those can wait for a later patch. Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08ext4: start transaction before calling into DAXMatthew Wilcox1-3/+52
Jan Kara pointed out that in the case where we are writing to a hole, we can end up with a lock inversion between the page lock and the journal lock. We can avoid this by starting the transaction in ext4 before calling into DAX. The journal lock nests inside the superblock pagefault lock, so we have to duplicate that code from dax_fault, like XFS does. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Jan Kara <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08ext4: add ext4_get_block_dax()Matthew Wilcox3-3/+16
DAX wants different semantics from any currently-existing ext4 get_block callback. Unlike ext4_get_block_write(), it needs to honour the 'create' flag, and unlike ext4_get_block(), it needs to be able to return unwritten extents. So introduce a new ext4_get_block_dax() which has those semantics. We could also change ext4_get_block_write() to honour the 'create' flag, but that might have consequences on other users that I do not currently understand. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: improve comment about truncate raceMatthew Wilcox1-1/+6
Jan Kara pointed out I should be more explicit here about the perils of racing against truncate. The comment is mostly the same as for the PTE case. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08thp: change insert_pfn's return type to voidMatthew Wilcox1-3/+3
It would make more sense to have all the return values from vmf_insert_pfn_pmd() encoded in one place instead of having to follow the convention into insert_pfn(). Suggested by Jeff Moyer. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Jeff Moyer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08ext4: use ext4_get_block_write() for DAXMatthew Wilcox1-4/+4
DAX relies on the get_block function either zeroing newly allocated blocks before they're findable by subsequent calls to get_block, or marking newly allocated blocks as unwritten. ext4_get_block() cannot create unwritten extents, but ext4_get_block_write() can. Signed-off-by: Matthew Wilcox <[email protected]> Reported-by: Andy Rudoff <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08fs/dax.c: fix typo in #endif commentValentin Rothberg1-1/+1
Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in #endif comment introduced by commit 2b26a9206d6a ("dax: add huge page fault support"). Signed-off-by: Valentin Rothberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08xfs: huge page fault supportMatthew Wilcox2-1/+30
Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08ext4: huge page fault supportMatthew Wilcox1-1/+9
Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08ext2: huge page fault supportMatthew Wilcox1-1/+8
Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: add huge page fault supportMatthew Wilcox3-3/+170
This is the support code for DAX-enabled filesystems to allow them to provide huge pages in response to faults. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>