From 371a096edf43a8c71844cf71c20765c8b21d07d9 Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Fri, 7 Oct 2016 16:59:30 -0700 Subject: mm: don't use radix tree writeback tags for pages in swap cache MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK, etc.) to accelerate finding the pages with a specific tag in the radix tree during inode writeback. But for anonymous pages in the swap cache, there is no inode writeback. So there is no need to find the pages with some writeback tags in the radix tree. It is not necessary to touch radix tree writeback tags for pages in the swap cache. Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is introduced for address spaces which don't need to update the writeback tags. The flag is set for swap caches. It may be used for DAX file systems, etc. With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. The improvement comes from the reduced contention on the swap cache radix tree lock. To test sequential swapping out, the test case uses 8 processes, which sequentially allocate and write to the anonymous pages until RAM and part of the swap device is used up. Details of comparison is as follow, base base+patch ---------------- -------------------------- %stddev %change %stddev \ | \ 2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput 1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so 10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list 10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg 10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage 10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" Acked-by: Rik van Riel Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Mel Gorman Cc: Tejun Heo Cc: Wu Fengguang Cc: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swap_state.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm/swap_state.c') diff --git a/mm/swap_state.c b/mm/swap_state.c index c8310a37be3a..268b8191982b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -37,6 +37,8 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = { .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), .i_mmap_writable = ATOMIC_INIT(0), .a_ops = &swap_aops, + /* swap cache doesn't use writeback related tags */ + .flags = 1 << AS_NO_WRITEBACK_TAGS, } }; -- cgit From 6fcb52a56ff60d240f06296b12827e7f20d45f63 Mon Sep 17 00:00:00 2001 From: Aaron Lu Date: Fri, 7 Oct 2016 17:00:08 -0700 Subject: thp: reduce usage of huge zero page's atomic counter The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu Cc: Sergey Senozhatsky Cc: "Kirill A. Shutemov" Cc: Dave Hansen Cc: Tim Chen Cc: Huang Ying Cc: Vlastimil Babka Cc: Jerome Marchand Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Ebru Akagunduz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dax.c | 2 +- include/linux/huge_mm.h | 8 ++++---- include/linux/sched.h | 1 + kernel/fork.c | 1 + mm/huge_memory.c | 36 +++++++++++++++++++++++++----------- mm/swap.c | 4 +--- mm/swap_state.c | 4 +--- 7 files changed, 34 insertions(+), 22 deletions(-) (limited to 'mm/swap_state.c') diff --git a/fs/dax.c b/fs/dax.c index cc025f82ef07..014defd2e744 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1036,7 +1036,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if (!write && !buffer_mapped(&bh)) { spinlock_t *ptl; pmd_t entry; - struct page *zero_page = get_huge_zero_page(); + struct page *zero_page = mm_get_huge_zero_page(vma->vm_mm); if (unlikely(!zero_page)) { dax_pmd_dbg(&bh, address, "no zero page"); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4fca5263fd42..9b9f65d99873 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -156,8 +156,8 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) return is_huge_zero_page(pmd_page(pmd)); } -struct page *get_huge_zero_page(void); -void put_huge_zero_page(void); +struct page *mm_get_huge_zero_page(struct mm_struct *mm); +void mm_put_huge_zero_page(struct mm_struct *mm); #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot)) @@ -220,9 +220,9 @@ static inline bool is_huge_zero_page(struct page *page) return false; } -static inline void put_huge_zero_page(void) +static inline void mm_put_huge_zero_page(struct mm_struct *mm) { - BUILD_BUG(); + return; } static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma, diff --git a/include/linux/sched.h b/include/linux/sched.h index 6bee6f988912..348f51b0ec92 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -526,6 +526,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ +#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/kernel/fork.c b/kernel/fork.c index 9a8ec66cd4df..6d42242485cb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -854,6 +854,7 @@ static inline void __mmput(struct mm_struct *mm) ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); + mm_put_huge_zero_page(mm); set_mm_exe_file(mm, NULL); if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a0b0e562407d..12b9f1a39b63 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -59,7 +59,7 @@ static struct shrinker deferred_split_shrinker; static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; -struct page *get_huge_zero_page(void) +static struct page *get_huge_zero_page(void) { struct page *zero_page; retry: @@ -86,7 +86,7 @@ retry: return READ_ONCE(huge_zero_page); } -void put_huge_zero_page(void) +static void put_huge_zero_page(void) { /* * Counter should never go to zero here. Only shrinker can put @@ -95,6 +95,26 @@ void put_huge_zero_page(void) BUG_ON(atomic_dec_and_test(&huge_zero_refcount)); } +struct page *mm_get_huge_zero_page(struct mm_struct *mm) +{ + if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags)) + return READ_ONCE(huge_zero_page); + + if (!get_huge_zero_page()) + return NULL; + + if (test_and_set_bit(MMF_HUGE_ZERO_PAGE, &mm->flags)) + put_huge_zero_page(); + + return READ_ONCE(huge_zero_page); +} + +void mm_put_huge_zero_page(struct mm_struct *mm) +{ + if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags)) + put_huge_zero_page(); +} + static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink, struct shrink_control *sc) { @@ -644,7 +664,7 @@ int do_huge_pmd_anonymous_page(struct fault_env *fe) pgtable = pte_alloc_one(vma->vm_mm, haddr); if (unlikely(!pgtable)) return VM_FAULT_OOM; - zero_page = get_huge_zero_page(); + zero_page = mm_get_huge_zero_page(vma->vm_mm); if (unlikely(!zero_page)) { pte_free(vma->vm_mm, pgtable); count_vm_event(THP_FAULT_FALLBACK); @@ -666,10 +686,8 @@ int do_huge_pmd_anonymous_page(struct fault_env *fe) } } else spin_unlock(fe->ptl); - if (!set) { + if (!set) pte_free(vma->vm_mm, pgtable); - put_huge_zero_page(); - } return ret; } gfp = alloc_hugepage_direct_gfpmask(vma); @@ -823,7 +841,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, * since we already have a zero page to copy. It just takes a * reference. */ - zero_page = get_huge_zero_page(); + zero_page = mm_get_huge_zero_page(dst_mm); set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd, zero_page); ret = 0; @@ -1081,7 +1099,6 @@ alloc: update_mmu_cache_pmd(vma, fe->address, fe->pmd); if (!page) { add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); - put_huge_zero_page(); } else { VM_BUG_ON_PAGE(!PageHead(page), page); page_remove_rmap(page, true); @@ -1542,7 +1559,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, } smp_wmb(); /* make pte visible before pmd */ pmd_populate(mm, pmd, pgtable); - put_huge_zero_page(); } static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, @@ -1565,8 +1581,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (!vma_is_anonymous(vma)) { _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); - if (is_huge_zero_pmd(_pmd)) - put_huge_zero_page(); if (vma_is_dax(vma)) return; page = pmd_page(_pmd); diff --git a/mm/swap.c b/mm/swap.c index 75c63bb2a1da..4dcf852e1e6d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -748,10 +748,8 @@ void release_pages(struct page **pages, int nr, bool cold) locked_pgdat = NULL; } - if (is_huge_zero_page(page)) { - put_huge_zero_page(); + if (is_huge_zero_page(page)) continue; - } page = compound_head(page); if (!put_page_testzero(page)) diff --git a/mm/swap_state.c b/mm/swap_state.c index 268b8191982b..8679c997eab6 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -254,9 +254,7 @@ static inline void free_swap_cache(struct page *page) void free_page_and_swap_cache(struct page *page) { free_swap_cache(page); - if (is_huge_zero_page(page)) - put_huge_zero_page(); - else + if (!is_huge_zero_page(page)) put_page(page); } -- cgit From f6ab1f7f6b2d8e48c5fc47746a67363b20d79a1d Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Fri, 7 Oct 2016 17:00:21 -0700 Subject: mm, swap: use offset of swap entry as key of swap cache This patch is to improve the performance of swap cache operations when the type of the swap device is not 0. Originally, the whole swap entry value is used as the key of the swap cache, even though there is one radix tree for each swap device. If the type of the swap device is not 0, the height of the radix tree of the swap cache will be increased unnecessary, especially on 64bit architecture. For example, for a 1GB swap device on the x86_64 architecture, the height of the radix tree of the swap cache is 11. But if the offset of the swap entry is used as the key of the swap cache, the height of the radix tree of the swap cache is 4. The increased height causes unnecessary radix tree descending and increased cache footprint. This patch reduces the height of the radix tree of the swap cache via using the offset of the swap entry instead of the whole swap entry value as the key of the swap cache. In 32 processes sequential swap out test case on a Xeon E5 v3 system with RAM disk as swap, the lock contention for the spinlock of the swap cache is reduced from 20.15% to 12.19%, when the type of the swap device is 1. Use the whole swap entry as key, perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37, perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78, Use the swap offset as key, perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25, perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94, Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: "Kirill A. Shutemov" Cc: Dave Hansen Cc: Dan Williams Cc: Joonsoo Kim Cc: Hugh Dickins Cc: Mel Gorman Cc: Minchan Kim Cc: Aaron Lu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 8 ++++---- mm/memcontrol.c | 5 +++-- mm/mincore.c | 5 +++-- mm/swap_state.c | 8 ++++---- mm/swapfile.c | 4 ++-- 5 files changed, 16 insertions(+), 14 deletions(-) (limited to 'mm/swap_state.c') diff --git a/include/linux/mm.h b/include/linux/mm.h index 046077b4209d..028e84e2ab42 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1048,19 +1048,19 @@ struct address_space *page_file_mapping(struct page *page) return page->mapping; } +extern pgoff_t __page_file_index(struct page *page); + /* * Return the pagecache index of the passed page. Regular pagecache pages - * use ->index whereas swapcache pages use ->private + * use ->index whereas swapcache pages use swp_offset(->private) */ static inline pgoff_t page_index(struct page *page) { if (unlikely(PageSwapCache(page))) - return page_private(page); + return __page_file_index(page); return page->index; } -extern pgoff_t __page_file_index(struct page *page); - /* * Return the file index of the page. Regular pagecache pages use ->index * whereas swapcache pages use swp_offset(->private) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0739d4129a93..60bb830abc34 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4408,7 +4408,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, * Because lookup_swap_cache() updates some statistics counter, * we call find_get_page() with swapper_space directly. */ - page = find_get_page(swap_address_space(ent), ent.val); + page = find_get_page(swap_address_space(ent), swp_offset(ent)); if (do_memsw_account()) entry->val = ent.val; @@ -4446,7 +4446,8 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, swp_entry_t swp = radix_to_swp_entry(page); if (do_memsw_account()) *entry = swp; - page = find_get_page(swap_address_space(swp), swp.val); + page = find_get_page(swap_address_space(swp), + swp_offset(swp)); } } else page = find_get_page(mapping, pgoff); diff --git a/mm/mincore.c b/mm/mincore.c index c0b5ba965200..bfb866435478 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -66,7 +66,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) */ if (radix_tree_exceptional_entry(page)) { swp_entry_t swp = radix_to_swp_entry(page); - page = find_get_page(swap_address_space(swp), swp.val); + page = find_get_page(swap_address_space(swp), + swp_offset(swp)); } } else page = find_get_page(mapping, pgoff); @@ -150,7 +151,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } else { #ifdef CONFIG_SWAP *vec = mincore_page(swap_address_space(entry), - entry.val); + swp_offset(entry)); #else WARN_ON(1); *vec = 1; diff --git a/mm/swap_state.c b/mm/swap_state.c index 8679c997eab6..35d7e0ee1c77 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -94,7 +94,7 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry) address_space = swap_address_space(entry); spin_lock_irq(&address_space->tree_lock); error = radix_tree_insert(&address_space->page_tree, - entry.val, page); + swp_offset(entry), page); if (likely(!error)) { address_space->nrpages++; __inc_node_page_state(page, NR_FILE_PAGES); @@ -145,7 +145,7 @@ void __delete_from_swap_cache(struct page *page) entry.val = page_private(page); address_space = swap_address_space(entry); - radix_tree_delete(&address_space->page_tree, page_private(page)); + radix_tree_delete(&address_space->page_tree, swp_offset(entry)); set_page_private(page, 0); ClearPageSwapCache(page); address_space->nrpages--; @@ -283,7 +283,7 @@ struct page * lookup_swap_cache(swp_entry_t entry) { struct page *page; - page = find_get_page(swap_address_space(entry), entry.val); + page = find_get_page(swap_address_space(entry), swp_offset(entry)); if (page) { INC_CACHE_INFO(find_success); @@ -310,7 +310,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * called after lookup_swap_cache() failed, re-calling * that would confuse statistics. */ - found_page = find_get_page(swapper_space, entry.val); + found_page = find_get_page(swapper_space, swp_offset(entry)); if (found_page) break; diff --git a/mm/swapfile.c b/mm/swapfile.c index 134c085d0d7b..2210de290b54 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -105,7 +105,7 @@ __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset) struct page *page; int ret = 0; - page = find_get_page(swap_address_space(entry), entry.val); + page = find_get_page(swap_address_space(entry), swp_offset(entry)); if (!page) return 0; /* @@ -1005,7 +1005,7 @@ int free_swap_and_cache(swp_entry_t entry) if (p) { if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) { page = find_get_page(swap_address_space(entry), - entry.val); + swp_offset(entry)); if (page && !trylock_page(page)) { put_page(page); page = NULL; -- cgit