From 1c30844d2dfe272d58c8fc000960b835d13aa2ac Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:52 -0800 Subject: mm: reclaim small amounts of memory when an external fragmentation event occurs An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 124 insertions(+), 9 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 24ab1f7394ab..bd8971a29204 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -88,6 +88,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; + /* e.g. boosted watermark reclaim leaves slabs alone */ + unsigned int may_shrinkslab:1; + /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - shrink_slab(sc->gfp_mask, pgdat->node_id, + if (sc->may_shrinkslab) { + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + } /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .may_shrinkslab = 1, }; /* @@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, .may_unmap = 1, .reclaim_idx = MAX_NR_ZONES - 1, .may_swap = !noswap, + .may_shrinkslab = 1, }; unsigned long lru_pages; @@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, + .may_shrinkslab = 1, }; /* @@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } +static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx) +{ + int i; + struct zone *zone; + + /* + * Check for watermark boosts top-down as the higher zones + * are more likely to be boosted. Both watermarks and boosts + * should not be checked at the time time as reclaim would + * start prematurely when there is no boosting and a lower + * zone is balanced. + */ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + if (zone->watermark_boost) + return true; + } + + return false; +} + /* * Returns true if there is an eligible zone balanced for the request order * and classzone_idx @@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) unsigned long mark = -1; struct zone *zone; + /* + * Check watermarks bottom-up as lower zones are more likely to + * meet watermarks. + */ for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; @@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; unsigned long pflags; + unsigned long nr_boost_reclaim; + unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; + bool boosted; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, - .priority = DEF_PRIORITY, - .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = 1, }; psi_memstall_enter(&pflags); @@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); + /* + * Account for the reclaim boost. Note that the zone boost is left in + * place so that parallel allocations that are near the watermark will + * stall or direct reclaim until kswapd is finished. + */ + nr_boost_reclaim = 0; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + nr_boost_reclaim += zone->watermark_boost; + zone_boosts[i] = zone->watermark_boost; + } + boosted = nr_boost_reclaim; + +restart: + sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; + bool balanced; bool ret; sc.reclaim_idx = classzone_idx; @@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Note that - * sc.reclaim_idx is not used as buffer_heads_over_limit may - * have adjusted it. + * If the pgdat is imbalanced then ignore boosting and preserve + * the watermarks for a later time and restart. Note that the + * zone watermarks will be still reset at the end of balancing + * on the grounds that the normal reclaim should be enough to + * re-evaluate if boosting is required when kswapd next wakes. + */ + balanced = pgdat_balanced(pgdat, sc.order, classzone_idx); + if (!balanced && nr_boost_reclaim) { + nr_boost_reclaim = 0; + goto restart; + } + + /* + * If boosting is not active then only reclaim if there are no + * eligible zones. Note that sc.reclaim_idx is not used as + * buffer_heads_over_limit may have adjusted it. */ - if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + if (!nr_boost_reclaim && balanced) goto out; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) + raise_priority = false; + + /* + * Do not writeback or swap pages for boosted reclaim. The + * intent is to relieve pressure not issue sub-optimal IO + * from reclaim context. If no pages are reclaimed, the + * reclaim will be aborted. + */ + sc.may_writepage = !laptop_mode && !nr_boost_reclaim; + sc.may_swap = !nr_boost_reclaim; + sc.may_shrinkslab = !nr_boost_reclaim; + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * progress in reclaiming pages */ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; + nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed); + + /* + * If reclaim made no progress for a boost, stop reclaim as + * IO cannot be queued and it could be an infinite loop in + * extreme circumstances. + */ + if (nr_boost_reclaim && !nr_reclaimed) + break; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) pgdat->kswapd_failures++; out: + /* If reclaim was boosted, account for the reclaim done in this pass */ + if (boosted) { + unsigned long flags; + + for (i = 0; i <= classzone_idx; i++) { + if (!zone_boosts[i]) + continue; + + /* Increments are under the zone lock */ + zone = pgdat->node_zones + i; + spin_lock_irqsave(&zone->lock, flags); + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); + spin_unlock_irqrestore(&zone->lock, flags); + } + + /* + * As there is now likely space, wakeup kcompact to defragment + * pageblocks. + */ + wakeup_kcompactd(pgdat, pageblock_order, classzone_idx); + } + snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); psi_memstall_leave(&pflags); @@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, /* Hopeless node, leave it to direct reclaim if possible */ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || - pgdat_balanced(pgdat, order, classzone_idx)) { + (pgdat_balanced(pgdat, order, classzone_idx) && + !pgdat_watermark_boosted(pgdat, classzone_idx))) { /* * There may be plenty of free memory available, but it's too * fragmented for high-order allocations. Wake up kcompactd -- cgit From 9a1ea439b16b92002e0a6fceebc5d1794906e297 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Fri, 28 Dec 2018 00:36:14 -0800 Subject: mm: put_and_wait_on_page_locked() while page is migrated Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins Reported-by: Baoquan He Tested-by: Baoquan He Reviewed-by: Andrea Arcangeli Acked-by: Michal Hocko Acked-by: Linus Torvalds Acked-by: Vlastimil Babka Cc: Matthew Wilcox Cc: Baoquan He Cc: David Hildenbrand Cc: Mel Gorman Cc: David Herrmann Cc: Tim Chen Cc: Kan Liang Cc: Andi Kleen Cc: Davidlohr Bueso Cc: Peter Zijlstra Cc: Christoph Lameter Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 2 ++ mm/filemap.c | 87 +++++++++++++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 6 ++-- mm/migrate.c | 12 +++---- mm/vmscan.c | 10 ++---- 5 files changed, 84 insertions(+), 33 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 226f96f0dee0..e2d7039af6a3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page) return wait_on_page_bit_killable(compound_head(page), PG_locked); } +extern void put_and_wait_on_page_locked(struct page *page); + /* * Wait for a page to complete writeback */ diff --git a/mm/filemap.c b/mm/filemap.c index 81adec8ee02c..d2df272152f5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, if (wait_page->bit_nr != key->bit_nr) return 0; - /* Stop walking if it's locked */ + /* + * Stop walking if it's locked. + * Is this safe if put_and_wait_on_page_locked() is in use? + * Yes: the waker must hold a reference to this page, and if PG_locked + * has now already been set by another task, that task must also hold + * a reference to the *same usage* of this page; so there is no need + * to walk on to wake even the put_and_wait_on_page_locked() callers. + */ if (test_bit(key->bit_nr, &key->page->flags)) return -1; @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit) wake_up_page_bit(page, bit); } +/* + * A choice of three behaviors for wait_on_page_bit_common(): + */ +enum behavior { + EXCLUSIVE, /* Hold ref to page and take the bit when woken, like + * __lock_page() waiting on then setting PG_locked. + */ + SHARED, /* Hold ref to page and check the bit when woken, like + * wait_on_page_writeback() waiting on PG_writeback. + */ + DROP, /* Drop ref to page before wait, no check when woken, + * like put_and_wait_on_page_locked() on PG_locked. + */ +}; + static inline int wait_on_page_bit_common(wait_queue_head_t *q, - struct page *page, int bit_nr, int state, bool lock) + struct page *page, int bit_nr, int state, enum behavior behavior) { struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; + bool bit_is_set; bool thrashing = false; + bool delayacct = false; unsigned long pflags; int ret = 0; if (bit_nr == PG_locked && !PageUptodate(page) && PageWorkingset(page)) { - if (!PageSwapBacked(page)) + if (!PageSwapBacked(page)) { delayacct_thrashing_start(); + delayacct = true; + } psi_memstall_enter(&pflags); thrashing = true; } init_wait(wait); - wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; + wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0; wait->func = wake_page_function; wait_page.page = page; wait_page.bit_nr = bit_nr; @@ -1084,14 +1110,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, spin_unlock_irq(&q->lock); - if (likely(test_bit(bit_nr, &page->flags))) { + bit_is_set = test_bit(bit_nr, &page->flags); + if (behavior == DROP) + put_page(page); + + if (likely(bit_is_set)) io_schedule(); - } - if (lock) { + if (behavior == EXCLUSIVE) { if (!test_and_set_bit_lock(bit_nr, &page->flags)) break; - } else { + } else if (behavior == SHARED) { if (!test_bit(bit_nr, &page->flags)) break; } @@ -1100,12 +1129,23 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, ret = -EINTR; break; } + + if (behavior == DROP) { + /* + * We can no longer safely access page->flags: + * even if CONFIG_MEMORY_HOTREMOVE is not enabled, + * there is a risk of waiting forever on a page reused + * for something that keeps it locked indefinitely. + * But best check for -EINTR above before breaking. + */ + break; + } } finish_wait(q, wait); if (thrashing) { - if (!PageSwapBacked(page)) + if (delayacct) delayacct_thrashing_end(); psi_memstall_leave(&pflags); } @@ -1124,17 +1164,36 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, void wait_on_page_bit(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false); + wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); } EXPORT_SYMBOL(wait_on_page_bit); int wait_on_page_bit_killable(struct page *page, int bit_nr) { wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false); + return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED); } EXPORT_SYMBOL(wait_on_page_bit_killable); +/** + * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked + * @page: The page to wait for. + * + * The caller should hold a reference on @page. They expect the page to + * become unlocked relatively soon, but do not wish to hold up migration + * (for example) by holding the reference while waiting for the page to + * come unlocked. After this function returns, the caller should not + * dereference @page. + */ +void put_and_wait_on_page_locked(struct page *page) +{ + wait_queue_head_t *q; + + page = compound_head(page); + q = page_waitqueue(page); + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP); +} + /** * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue * @page: Page defining the wait queue of interest @@ -1264,7 +1323,8 @@ void __lock_page(struct page *__page) { struct page *page = compound_head(__page); wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true); + wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, + EXCLUSIVE); } EXPORT_SYMBOL(__lock_page); @@ -1272,7 +1332,8 @@ int __lock_page_killable(struct page *__page) { struct page *page = compound_head(__page); wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true); + return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, + EXCLUSIVE); } EXPORT_SYMBOL_GPL(__lock_page_killable); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index da6682bb69aa..0c0e18409fde 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1490,8 +1490,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); goto out; } @@ -1527,8 +1526,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index f7e4bfdc13b7..acda06f99754 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, /* * Once page cache replacement of page migration started, page_count - * *must* be zero. And, we don't want to call wait_on_page_locked() - * against a page without get_page(). - * So, we use get_page_unless_zero(), here. Even failed, page fault - * will occur again. + * is zero; but we must not call put_and_wait_on_page_locked() without + * a ref. Use get_page_unless_zero(), and just fault again if it fails. */ if (!get_page_unless_zero(page)) goto out; pte_unmap_unlock(ptep, ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); return; out: pte_unmap_unlock(ptep, ptl); @@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) if (!get_page_unless_zero(page)) goto unlock; spin_unlock(ptl); - wait_on_page_locked(page); - put_page(page); + put_and_wait_on_page_locked(page); return; unlock: spin_unlock(ptl); diff --git a/mm/vmscan.c b/mm/vmscan.c index bd8971a29204..a714c4f800e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,14 +1460,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, count_memcg_page_event(page, PGLAZYFREED); } else if (!mapping || !__remove_mapping(mapping, page, true)) goto keep_locked; - /* - * At this point, we have no other references and there is - * no way to pick any more up (removed from LRU, removed - * from pagecache). Can use non-atomic bitops now (and - * we obviously don't have to worry about waking up a process - * waiting on the page lock, because there are no references. - */ - __ClearPageLocked(page); + + unlock_page(page); free_it: nr_reclaimed++; -- cgit