aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-12-02mm/vmpressure.c: fix race in vmpressure_work_fn()Andrew Morton1-3/+5
In some android devices, there will be a "divide by zero" exception. vmpr->scanned could be zero before spin_lock(&vmpr->sr_lock). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=88051 [[email protected]: neaten] Reported-by: ji_ang <[email protected]> Cc: Anton Vorontsov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-02mm: frontswap: invalidate expired data on a dup-store failureWeijie Yang1-1/+3
If a frontswap dup-store failed, it should invalidate the expired page in the backend, or it could trigger some data corruption issue. Such as: 1. use zswap as the frontswap backend with writeback feature 2. store a swap page(version_1) to entry A, success 3. dup-store a newer page(version_2) to the same entry A, fail 4. use __swap_writepage() write version_2 page to swapfile, success 5. zswap do shrink, writeback version_1 page to swapfile 6. version_2 page is overwrited by version_1, data corrupt. This patch fixes this issue by invalidating expired data immediately when meet a dup-store failure. Signed-off-by: Weijie Yang <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Bob Liu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-02Merge tag 'v3.18-rc7' into drm-nextDave Airlie9-52/+153
This fixes a bunch of conflicts prior to merging i915 tree. Linux 3.18-rc7 Conflicts: drivers/gpu/drm/exynos/exynos_drm_drv.c drivers/gpu/drm/i915/i915_drv.c drivers/gpu/drm/i915/intel_pm.c drivers/gpu/drm/tegra/dc.c
2014-11-27iov_iter.c: convert copy_to_iter() to iterate_and_advanceAl Viro1-82/+9
Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: convert copy_from_iter() to iterate_and_advanceAl Viro1-91/+15
Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()Al Viro1-36/+24
Just have copy_page_{to,from}_iter() fall back to kmap_atomic + copy_{to,from}_iter() + kunmap_atomic() in ITER_BVEC case. As the matter of fact, that's what we want to do for any iov_iter kind that isn't blocking - e.g. ITER_KVEC will also go that way once we recognize it on iov_iter.c primitives level Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: convert iov_iter_zero() to iterate_and_advanceAl Viro1-86/+12
Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kindsAl Viro1-62/+45
Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: convert iov_iter_get_pages() to iterate_all_kindsAl Viro1-50/+28
Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: convert iov_iter_npages() to iterate_all_kindsAl Viro1-54/+19
Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: iterate_and_advanceAl Viro1-76/+28
same as iterate_all_kinds, but iterator is moved to the position past the last byte we'd handled. iov_iter_advance() converted to it Signed-off-by: Al Viro <[email protected]>
2014-11-27iov_iter.c: macros for iterating over iov_iterAl Viro1-126/+86
iterate_all_kinds(iter, size, ident, step_iovec, step_bvec) iterates through the ranges covered by iter (up to size bytes total), repeating step_iovec or step_bvec for each of those. ident is declared in expansion of that thing, either as struct iovec or struct bvec, and it contains the range we are currently looking at. step_bvec should be a void expression, step_iovec - a size_t one, with non-zero meaning "stop here, that many bytes from this range left". In the end, the amount actually handled is stored in size. iov_iter_copy_from_user_atomic() and iov_iter_alignment() converted to it. Signed-off-by: Al Viro <[email protected]>
2014-11-22Merge branch 'master' into for-3.19Tejun Heo19-218/+342
Pull in to receive 54ef6df3f3f1 ("rcu: Provide counterpart to rcu_dereference() for non-RCU situations"). Signed-off-by: Tejun Heo <[email protected]>
2014-11-20Merge Linus' tree to be be to apply submitted patches to newer code thanJiri Kosina66-4214/+6629
current trivial.git base
2014-11-19kill f_dentry usesAl Viro1-2/+2
Signed-off-by: Al Viro <[email protected]>
2014-11-19merge nfs bugfixes into nfsd for-3.19 branchJ. Bruce Fields21-218/+458
In addition to nfsd bugfixes, there are some fixes in -rc5 for client bugs that can interfere with my testing.
2014-11-18x86, mpx: Cleanup unused bound tablesDave Hansen1-0/+2
The previous patch allocates bounds tables on-demand. As noted in an earlier description, these can add up to *HUGE* amounts of memory. This has caused OOMs in practice when running tests. This patch adds support for freeing bounds tables when they are no longer in use. There are two types of mappings in play when unmapping tables: 1. The mapping with the actual data, which userspace is munmap()ing or brk()ing away, etc... 2. The mapping for the bounds table *backing* the data (is tagged with VM_MPX, see the patch "add MPX specific mmap interface"). If userspace use the prctl() indroduced earlier in this patchset to enable the management of bounds tables in kernel, when it unmaps the first type of mapping with the actual data, the kernel needs to free the mapping for the bounds table backing the data. This patch hooks in at the very end of do_unmap() to do so. We look at the addresses being unmapped and find the bounds directory entries and tables which cover those addresses. If an entire table is unused, we clear associated directory entry and free the table. Once we unmap the bounds table, we would have a bounds directory entry pointing at empty address space. That address space might now be allocated for some other (random) use, and the MPX hardware might now try to walk it as if it were a bounds table. That would be bad. So any unmapping of an enture bounds table has to be accompanied by a corresponding write to the bounds directory entry to invalidate it. That write to the bounds directory can fault, which causes the following problem: Since we are doing the freeing from munmap() (and other paths like it), we hold mmap_sem for write. If we fault, the page fault handler will attempt to acquire mmap_sem for read and we will deadlock. To avoid the deadlock, we pagefault_disable() when touching the bounds directory entry and use a get_user_pages() to resolve the fault. The unmapping of bounds tables happends under vm_munmap(). We also (indirectly) call vm_munmap() to _do_ the unmapping of the bounds tables. We avoid unbounded recursion by disallowing freeing of bounds tables *for* bounds tables. This would not occur normally, so should not have any practical impact. Being strict about it here helps ensure that we do not have an exploitable stack overflow. Based-on-patch-by: Qiaowei Ren <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Dave Hansen <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-11-17mmu_gather: move minimal range calculations into generic codeWill Deacon1-22/+8
On architectures with hardware broadcasting of TLB invalidation messages , it makes sense to reduce the range of the mmu_gather structure when unmapping page ranges based on the dirty address information passed to tlb_remove_tlb_entry. arm64 already does this by directly manipulating the start/end fields of the gather structure, but this confuses the generic code which does not expect these fields to change and can end up calculating invalid, negative ranges when forcing a flush in zap_pte_range. This patch moves the minimal range calculation out of the arm64 code and into the generic implementation, simplifying zap_pte_range in the process (which no longer needs to care about start/end, since they will point to the appropriate ranges already). With the range being tracked by core code, the need_flush flag is dropped in favour of checking that the end of the range has actually been set. Cc: Benjamin Herrenschmidt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King - ARM Linux <[email protected]> Cc: Michal Simek <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2014-11-14Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Fix for a really embarrassing braino in iov_iter. Kudos to paulus..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Fix thinko in iov_iter_single_seg_count
2014-11-14mm: Update generic gup implementation to handle hugepage directoryAneesh Kumar K.V1-8/+73
Update generic gup implementation with powerpc specific details. On powerpc at pmd level we can have hugepte, normal pmd pointer or a pointer to the hugepage directory. Tested-by: Steve Capper <[email protected]> Acked-by: Steve Capper <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2014-11-13mem-hotplug: reset node present pages when hot-adding a new pgdatTang Chen1-0/+17
When memory is hot-added, all the memory is in offline state. So clear all zones' present_pages because they will be updated in online_pages() and offline_pages(). Otherwise, /proc/zoneinfo will corrupt: When the memory of node2 is offline: # cat /proc/zoneinfo ...... Node 2, zone Movable ...... spanned 8388608 present 8388608 managed 0 When we online memory on node2: # cat /proc/zoneinfo ...... Node 2, zone Movable ...... spanned 8388608 present 16777216 managed 8388608 Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Cc: <[email protected]> [3.16+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mem-hotplug: reset node managed pages when hot-adding a new pgdatTang Chen3-7/+19
In free_area_init_core(), zone->managed_pages is set to an approximate value for lowmem, and will be adjusted when the bootmem allocator frees pages into the buddy system. But free_area_init_core() is also called by hotadd_new_pgdat() when hot-adding memory. As a result, zone->managed_pages of the newly added node's pgdat is set to an approximate value in the very beginning. Even if the memory on that node has node been onlined, /sys/device/system/node/nodeXXX/meminfo has wrong value: hot-add node2 (memory not onlined) cat /sys/device/system/node/node2/meminfo Node 2 MemTotal: 33554432 kB Node 2 MemFree: 0 kB Node 2 MemUsed: 33554432 kB Node 2 Active: 0 kB This patch fixes this problem by reset node managed pages to 0 after hot-adding a new node. 1. Move reset_managed_pages_done from reset_node_managed_pages() to reset_all_zones_managed_pages() 2. Make reset_node_managed_pages() non-static 3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat is initialized Signed-off-by: Tang Chen <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: <[email protected]> [3.16+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/debug-pagealloc: correct freepage accounting and order resettingJoonsoo Kim1-3/+5
One thing I did in this patch is fixing freepage accounting. If we clear guard page and link it onto isolate buddy list, we should not increase freepage count. This patch adds conditional branch to skip counting in this case. Without this patch, this overcounting happens frequently if guard order is set and CMA is used. Another thing fixed in this patch is the target to reset order. In __free_one_page(), we check the buddy page whether it is a guard page or not. And, if so, we should clear guard attribute on the buddy page and reset order of it to 0. But, current code resets original page's order rather than buddy one's. Maybe, this doesn't have any problem, because whole merged page's order will be re-assigned soon. But, it is better to correct code. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Gioh Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm, compaction: prevent infinite loop in compact_zoneVlastimil Babka1-2/+6
Several people have reported occasionally seeing processes stuck in compact_zone(), even triggering soft lockups, in 3.18-rc2+. Testing a revert of commit e14c720efdd7 ("mm, compaction: remember position within pageblock in free pages scanner") fixed the issue, although the stuck processes do not appear to involve the free scanner. Finally, by code inspection, the bug was found in isolate_migratepages() which uses a slightly different condition to detect if the migration and free scanners have met, than compact_finished(). That has not been a problem until commit e14c720efdd7 allowed the free scanner position between individual invocations to be in the middle of a pageblock. In a relatively rare case, the migration scanner position can end up at the beginning of a pageblock, with the free scanner position in the middle of the same pageblock. If it's the migration scanner's turn, isolate_migratepages() exits immediately (without updating the position), while compact_finished() decides to continue compaction, resulting in a potentially infinite loop. The system can recover only if another process creates enough high-order pages to make the watermark checks in compact_finished() pass. This patch fixes the immediate problem by bumping the migration scanner's position to meet the free scanner in isolate_migratepages(), when both are within the same pageblock. This causes compact_finished() to terminate properly. A more robust check in compact_finished() is planned as a cleanup for better future maintainability. Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner) Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: P. Christeas <[email protected]> Tested-by: P. Christeas <[email protected]> Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2 Reported-by: Norbert Preining <[email protected]> Tested-by: Norbert Preining <[email protected]> Link: https://lkml.org/lkml/2014/11/4/904 Reported-by: Pavel Machek <[email protected]> Link: https://lkml.org/lkml/2014/11/7/164 Cc: Joonsoo Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm: alloc_contig_range: demote pages busy message from warn to infoMichal Nazarewicz1-3/+2
Having test_pages_isolated failure message as a warning confuses users into thinking that it is more serious than it really is. In reality, if called via CMA, allocation will be retried so a single test_pages_isolated failure does not prevent allocation from succeeding. Demote the warning message to an info message and reformat it such that the text "failed" does not appear and instead a less worrying "PFNS busy" is used. This message is trivially reproducible on a 10GB x86 machine on 3.16.y kernels configured with CONFIG_DMA_CMA. Signed-off-by: Michal Nazarewicz <[email protected]> Cc: Laurent Pinchart <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/slab: fix unalignment problem on Malta with EVA due to slab mergeJoonsoo Kim1-0/+4
Unlike SLUB, sometimes, object isn't started at the beginning of the slab in SLAB. This causes the unalignment problem after slab merging is supported by commit 12220dea07f1 ("mm/slab: support slab merge"). Following is the report from Markos that fail to boot on Malta with EVA. Calibrating delay loop... 19.86 BogoMIPS (lpj=99328) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 4096 (order: 0, 16384 bytes) Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes) Kernel bug detected[#1]: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea07f1 #1631 task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000 epc : 80141190 alloc_unbound_pwq+0x234/0x304 Not tainted ra : 80141184 alloc_unbound_pwq+0x228/0x304 Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000) Call Trace: alloc_unbound_pwq+0x234/0x304 apply_workqueue_attrs+0x11c/0x294 __alloc_workqueue_key+0x23c/0x470 init_workqueues+0x320/0x400 do_one_initcall+0xe8/0x23c kernel_init_freeable+0x9c/0x224 kernel_init+0x10/0x100 ret_from_kernel_thread+0x14/0x1c [ end trace cb88537fdc8fa200 ] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b alloc_unbound_pwq() allocates slab object from pool_workqueue. This kmem_cache requires 256 bytes alignment, but, current merging code doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires only cacheline size alignment so that above failure occurs. However, in x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't happen on it. To fix this problem, this patch introduces alignment mismatch check in find_mergeable(). This will fix the problem. Signed-off-by: Joonsoo Kim <[email protected]> Reported-by: Markos Chandras <[email protected]> Tested-by: Markos Chandras <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/page_alloc: restrict max order of merging on isolated pageblockJoonsoo Kim3-29/+78
Current pageblock isolation logic could isolate each pageblock individually. This causes freepage accounting problem if freepage with pageblock order on isolate pageblock is merged with other freepage on normal pageblock. We can prevent merging by restricting max order of merging to pageblock order if freepage is on isolate pageblock. A side-effect of this change is that there could be non-merged buddy freepage even if finishing pageblock isolation, because undoing pageblock isolation is just to move freepage from isolate buddy list to normal buddy list rather than to consider merging. So, the patch also makes undoing pageblock isolation consider freepage merge. When un-isolation, freepage with more than pageblock order and it's buddy are checked. If they are on normal pageblock, instead of just moving, we isolate the freepage and free it in order to get merged. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Tang Chen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Heesub Shin <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Ritesh Harjani <[email protected]> Cc: Gioh Kim <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/page_alloc: move freepage counting logic to __free_one_page()Joonsoo Kim1-11/+3
All the caller of __free_one_page() has similar freepage counting logic, so we can move it to __free_one_page(). This reduce line of code and help future maintenance. This is also preparation step for "mm/page_alloc: restrict max order of merging on isolated pageblock" which fix the freepage counting problem on freepage with more than pageblock order. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Tang Chen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Heesub Shin <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Ritesh Harjani <[email protected]> Cc: Gioh Kim <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/page_alloc: add freepage on isolate pageblock to correct buddy listJoonsoo Kim1-5/+8
In free_pcppages_bulk(), we use cached migratetype of freepage to determine type of buddy list where freepage will be added. This information is stored when freepage is added to pcp list, so if isolation of pageblock of this freepage begins after storing, this cached information could be stale. In other words, it has original migratetype rather than MIGRATE_ISOLATE. There are two problems caused by this stale information. One is that we can't keep these freepages from being allocated. Although this pageblock is isolated, freepage will be added to normal buddy list so that it could be allocated without any restriction. And the other problem is incorrect freepage accounting. Freepages on isolate pageblock should not be counted for number of freepage. Following is the code snippet in free_pcppages_bulk(). /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ __free_one_page(page, page_to_pfn(page), zone, 0, mt); trace_mm_page_pcpu_drain(page, 0, mt); if (likely(!is_migrate_isolate_page(page))) { __mod_zone_page_state(zone, NR_FREE_PAGES, 1); if (is_migrate_cma(mt)) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1); } As you can see above snippet, current code already handle second problem, incorrect freepage accounting, by re-fetching pageblock migratetype through is_migrate_isolate_page(page). But, because this re-fetched information isn't used for __free_one_page(), first problem would not be solved. This patch try to solve this situation to re-fetch pageblock migratetype before __free_one_page() and to use it for __free_one_page(). In addition to move up position of this re-fetch, this patch use optimization technique, re-fetching migratetype only if there is isolate pageblock. Pageblock isolation is rare event, so we can avoid re-fetching in common case with this optimization. This patch also correct migratetype of the tracepoint output. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Tang Chen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Heesub Shin <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Ritesh Harjani <[email protected]> Cc: Gioh Kim <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/page_alloc: fix incorrect isolation behavior by rechecking migratetypeJoonsoo Kim2-2/+11
Before describing bugs itself, I first explain definition of freepage. 1. pages on buddy list are counted as freepage. 2. pages on isolate migratetype buddy list are *not* counted as freepage. 3. pages on cma buddy list are counted as CMA freepage, too. Now, I describe problems and related patch. Patch 1: There is race conditions on getting pageblock migratetype that it results in misplacement of freepages on buddy list, incorrect freepage count and un-availability of freepage. Patch 2: Freepages on pcp list could have stale cached information to determine migratetype of buddy list to go. This causes misplacement of freepages on buddy list and incorrect freepage count. Patch 4: Merging between freepages on different migratetype of pageblocks will cause freepages accouting problem. This patch fixes it. Without patchset [3], above problem doesn't happens on my CMA allocation test, because CMA reserved pages aren't used at all. So there is no chance for above race. With patchset [3], I did simple CMA allocation test and get below result: - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation - run kernel build (make -j16) on background - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval - Result: more than 5000 freepage count are missed With patchset [3] and this patchset, I found that no freepage count are missed so that I conclude that problems are solved. On my simple memory offlining test, these problems also occur on that environment, too. This patch (of 4): There are two paths to reach core free function of buddy allocator, __free_one_page(), one is free_one_page()->__free_one_page() and the other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page(). Each paths has race condition causing serious problems. At first, this patch is focused on first type of freepath. And then, following patch will solve the problem in second type of freepath. In the first type of freepath, we got migratetype of freeing page without holding the zone lock, so it could be racy. There are two cases of this race. 1. pages are added to isolate buddy list after restoring orignal migratetype CPU1 CPU2 get migratetype => return MIGRATE_ISOLATE call free_one_page() with MIGRATE_ISOLATE grab the zone lock unisolate pageblock release the zone lock grab the zone lock call __free_one_page() with MIGRATE_ISOLATE freepage go into isolate buddy list, although pageblock is already unisolated This may cause two problems. One is that we can't use this page anymore until next isolation attempt of this pageblock, because freepage is on isolate buddy list. The other is that freepage accouting could be wrong due to merging between different buddy list. Freepages on isolate buddy list aren't counted as freepage, but ones on normal buddy list are counted as freepage. If merge happens, buddy freepage on normal buddy list is inevitably moved to isolate buddy list without any consideration of freepage accouting so it could be incorrect. 2. pages are added to normal buddy list while pageblock is isolated. It is similar with above case. This also may cause two problems. One is that we can't keep these freepages from being allocated. Although this pageblock is isolated, freepage would be added to normal buddy list so that it could be allocated without any restriction. And the other problem is same as case 1, that it, incorrect freepage accouting. This race condition would be prevented by checking migratetype again with holding the zone lock. Because it is somewhat heavy operation and it isn't needed in common case, we want to avoid rechecking as much as possible. So this patch introduce new variable, nr_isolate_pageblock in struct zone to check if there is isolated pageblock. With this, we can avoid to re-check migratetype in common case and do it only if there is isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above mentioned problems. Changes from v3: Add one more check in free_one_page() that checks whether migratetype is MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Tang Chen <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Heesub Shin <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Ritesh Harjani <[email protected]> Cc: Gioh Kim <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mm/compaction: skip the range until proper target pageblock is metJoonsoo Kim1-0/+10
Commit 7d49d8868336 ("mm, compaction: reduce zone checking frequency in the migration scanner") has a side-effect that changes the iteration range calculation. Before the change, block_end_pfn is calculated using start_pfn, but now it blindly adds pageblock_nr_pages to the previous value. This causes the problem that isolation_start_pfn is larger than block_end_pfn when we isolate the page with more than pageblock order. In this case, isolation would fail due to an invalid range parameter. To prevent this, this patch implements skipping the range until a proper target pageblock is met. Without this patch, CMA with more than pageblock order always fails but with this patch it will succeed. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13Fix thinko in iov_iter_single_seg_countPaul Mackerras1-2/+2
The branches of the if (i->type & ITER_BVEC) statement in iov_iter_single_seg_count() are the wrong way around; if ITER_BVEC is clear then we use i->bvec, when we should be using i->iov. This fixes it. In my case, the symptom that this caused was that a KVM guest doing filesystem operations on a virtual disk would result in one of qemu's threads on the host going into an infinite loop in generic_perform_write(). The loop would hit the copied == 0 case and call iov_iter_single_seg_count() to reduce the number of bytes to try to process, but because of the error, iov_iter_single_seg_count() would just return i->count and the loop made no progress and continued forever. Cc: [email protected] # 3.16+ Signed-off-by: Paul Mackerras <[email protected]> Signed-off-by: Al Viro <[email protected]>
2014-11-13zbud, zswap: change module author emailSeth Jennings2-2/+2
Old email no longer viable. Signed-off-by: Seth Jennings <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2014-11-07Merge tag 'xfs-for-linus-3.18-rc3' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This update fixes a warning in the new pagecache_isize_extended() and updates some related comments, another fix for zero-range misbehaviour, and an unforntuately large set of fixes for regressions in the bulkstat code. The bulkstat fixes are large but necessary. I wouldn't normally push such a rework for a -rcX update, but right now xfsdump can silently create incomplete dumps on 3.17 and it's possible that even xfsrestore won't notice that the dumps were incomplete. Hence we need to get this update into 3.17-stable kernels ASAP. In more detail, the refactoring work I committed in 3.17 has exposed a major hole in our QA coverage. With both xfsdump (the major user of bulkstat) and xfsrestore silently ignoring missing files in the dump/restore process, incomplete dumps were going unnoticed if they were being triggered. Many of the dump/restore filesets were so small that they didn't evenhave a chance of triggering the loop iteration bugs we introduced in 3.17, so we didn't exercise the code sufficiently, either. We have already taken steps to improve QA coverage in xfstests to avoid this happening again, and I've done a lot of manual verification of dump/restore on very large data sets (tens of millions of inodes) of the past week to verify this patch set results in bulkstat behaving the same way as it does on 3.16. Unfortunately, the fixes are not exactly simple - in tracking down the problem historic API warts were discovered (e.g xfsdump has been working around a 20 year old bug in the bulkstat API for the past 10 years) and so that complicated the process of diagnosing and fixing the problems. i.e. we had to fix bugs in the code as well as discover and re-introduce the userspace visible API bugs that we unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly. Summary: - incorrect warnings about i_mutex locking in pagecache_isize_extended() and updates comments to match expected locking - another zero-range bug fix for stray file size updates - a bunch of fixes for regression in the bulkstat code introduced in 3.17" * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: track bulkstat progress by agino xfs: bulkstat error handling is broken xfs: bulkstat main loop logic is a mess xfs: bulkstat chunk-formatter has issues xfs: bulkstat chunk formatting cursor is broken xfs: bulkstat btree walk doesn't terminate mm: Fix comment before truncate_setsize() xfs: rework zero range to prevent invalid i_size updates mm: Remove false WARN_ON from pagecache_isize_extended() xfs: Check error during inode btree iteration in xfs_bulkstat() xfs: bulkstat doesn't release AGI buffer on error
2014-11-07VFS: Rename do_fallocate() to vfs_fallocate()Anna Schumaker1-1/+1
This function needs to be exported so it can be used by the NFSD module when responding to the new ALLOCATE and DEALLOCATE operations in NFS v4.2. Christoph Hellwig suggested renaming the function to stay consistent with how other vfs functions are named. Signed-off-by: Anna Schumaker <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2014-11-07mm: Fix comment before truncate_setsize()Jan Kara1-2/+3
XFS doesn't always hold i_mutex when calling truncate_setsize() and it uses a different lock to serialize truncates and writes. So fix the comment before truncate_setsize(). Reported-by: Jan Beulich <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2014-11-03Merge branch 'fixes-for-v3.18' of ↵Linus Torvalds1-24/+44
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull CMA and DMA-mapping fixes from Marek Szyprowski: "This contains important fixes for recently introduced highmem support for default contiguous memory region used for dma-mapping subsystem" * 'fixes-for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: mm, cma: make parameters order consistent in func declaration and definition mm: cma: Use %pa to print physical addresses mm: cma: Ensure that reservations never cross the low/high mem boundary mm: cma: Always consider a 0 base address reservation as dynamic mm: cma: Don't crash on allocation if CMA area can't be activated
2014-10-30mm: Remove false WARN_ON from pagecache_isize_extended()Jan Kara1-1/+0
The WARN_ON checking whether i_mutex is held in pagecache_isize_extended() was wrong because some filesystems (e.g. XFS) use different locks for serialization of truncates / writes. So just remove the check. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2014-10-29mm/balloon_compaction: fix deflation when compaction is disabledKonstantin Khlebnikov1-0/+2
If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages with balloon and doesn't set PagePrivate flag, as a result balloon_page_dequeue() cannot get any pages because it thinks that all of them are isolated. Without balloon compaction nobody can isolate ballooned pages. It's safe to remove this check. Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management"). Signed-off-by: Konstantin Khlebnikov <[email protected]> Reported-by: Matt Mullins <[email protected]> Cc: <[email protected]> [3.17] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm/slab_common: don't check for duplicate cache namesMikulas Patocka1-10/+0
The SLUB cache merges caches with the same size and alignment and there was long standing bug with this behavior: - create the cache named "foo" - create the cache named "bar" (which is merged with "foo") - delete the cache named "foo" (but it stays allocated because "bar" uses it) - create the cache named "foo" again - it fails because the name "foo" is already used That bug was fixed in commit 694617474e33 ("slab_common: fix the check for duplicate slab names") by not warning on duplicate cache names when the SLUB subsystem is used. Recently, cache merging was implemented the with SLAB subsystem too, in 12220dea07f1 ("mm/slab: support slab merge")). Therefore we need stop checking for duplicate names even for the SLAB subsystem. This patch fixes the bug by removing the check. Signed-off-by: Mikulas Patocka <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm: rmap: split out page_remove_file_rmap()Johannes Weiner1-32/+46
page_remove_rmap() has too many branches on PageAnon() and is hard to follow. Move the file part into a separate function. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm: memcontrol: fix missed end-writeback page accountingJohannes Weiner3-68/+79
Commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") changed page migration to uncharge the old page right away. The page is locked, unmapped, truncated, and off the LRU, but it could race with writeback ending, which then doesn't unaccount the page properly: test_clear_page_writeback() migration wait_on_page_writeback() TestClearPageWriteback() mem_cgroup_migrate() clear PCG_USED mem_cgroup_update_page_stat() if (PageCgroupUsed(pc)) decrease memcg pages under writeback release pc->mem_cgroup->move_lock The per-page statistics interface is heavily optimized to avoid a function call and a lookup_page_cgroup() in the file unmap fast path, which means it doesn't verify whether a page is still charged before clearing PageWriteback() and it has to do it in the stat update later. Rework it so that it looks up the page's memcg once at the beginning of the transaction and then uses it throughout. The charge will be verified before clearing PageWriteback() and migration can't uncharge the page as long as that is still set. The RCU lock will protect the memcg past uncharge. As far as losing the optimization goes, the following test results are from a microbenchmark that maps, faults, and unmaps a 4GB sparse file three times in a nested fashion, so that there are two negative passes that don't account but still go through the new transaction overhead. There is no actual difference: old: 33.195102545 seconds time elapsed ( +- 0.01% ) new: 33.199231369 seconds time elapsed ( +- 0.03% ) The time spent in page_remove_rmap()'s callees still adds up to the same, but the time spent in the function itself seems reduced: # Children Self Command Shared Object Symbol old: 0.12% 0.11% filemapstress [kernel.kallsyms] [k] page_remove_rmap new: 0.12% 0.08% filemapstress [kernel.kallsyms] [k] page_remove_rmap Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: <[email protected]> [3.17.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm: page-writeback: inline account_page_dirtied() into single callerJohannes Weiner1-19/+4
A follow-up patch would have changed the call signature. To save the trouble, just fold it instead. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: <[email protected]> [3.17.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()Yasuaki Ishimatsu1-5/+0
When hot adding the same memory after hot removal, the following messages are shown: WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426() ... Call Trace: dump_stack+0x46/0x58 warn_slowpath_common+0x81/0xa0 warn_slowpath_null+0x1a/0x20 free_area_init_node+0x3fe/0x426 hotadd_new_pgdat+0x90/0x110 add_memory+0xd4/0x200 acpi_memory_device_add+0x1aa/0x289 acpi_bus_attach+0xfd/0x204 acpi_bus_attach+0x178/0x204 acpi_bus_scan+0x6a/0x90 acpi_device_hotplug+0xe8/0x418 acpi_hotplug_work_fn+0x1f/0x2b process_one_work+0x14e/0x3f0 worker_thread+0x11b/0x510 kthread+0xe1/0x100 ret_from_fork+0x7c/0xb0 The detaled explanation is as follows: When hot removing memory, pgdat is set to 0 in try_offline_node(). But if the pgdat is allocated by bootmem allocator, the clearing step is skipped. And when hot adding the same memory, the uninitialized pgdat is reused. But free_area_init_node() checks wether pgdat is set to zero. As a result, free_area_init_node() hits WARN_ON(). This patch clears pgdat which is allocated by bootmem allocator in try_offline_node(). Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Zhen <[email protected]> Cc: Wang Nan <[email protected]> Cc: Tang Chen <[email protected]> Reviewed-by: Toshi Kani <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm, thp: fix collapsing of hugepages on madviseDavid Rientjes2-9/+10
If an anonymous mapping is not allowed to fault thp memory and then madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never collapse this memory into thp memory. This occurs because the madvise(2) handler for thp, hugepage_madvise(), clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags until the final action of madvise_behavior(). This causes the khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when the vma had previously had VM_NOHUGEPAGE set. Fix this by passing the correct vma flags to the khugepaged mm slot handler. There's no chance khugepaged can run on this vma until after madvise_behavior() returns since we hold mm->mmap_sem. It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags in hugepage_advise(), but I didn't want to introduce special case behavior into madvise_behavior(). I think it's best to just let it always set vma->vm_flags itself. Signed-off-by: David Rientjes <[email protected]> Reported-by: Suleiman Souhlal <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm: free compound page with correct orderYu Zhao1-2/+2
Compound page should be freed by put_page() or free_pages() with correct order. Not doing so will cause tail pages leaked. The compound order can be obtained by compound_order() or use HPAGE_PMD_ORDER in our case. Some people would argue the latter is faster but I prefer the former which is more general. This bug was observed not just on our servers (the worst case we saw is 11G leaked on a 48G machine) but also on our workstations running Ubuntu based distro. $ cat /proc/vmstat | grep thp_zero_page_alloc thp_zero_page_alloc 55 thp_zero_page_alloc_failed 0 This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked. Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page") Signed-off-by: Yu Zhao <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Bob Liu <[email protected]> Cc: <[email protected]> [3.8+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29mm/compaction.c: avoid premature range skip in isolate_migratepages_rangeJoonsoo Kim1-0/+3
Commit edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()") commonizes isolate_migratepages variants and make them use isolate_migratepages_block(). isolate_migratepages_block() could stop the execution when enough pages are isolated, but, there is no code in isolate_migratepages_range() to handle this case. In the result, even if isolate_migratepages_block() returns prematurely without checking all pages in the range, isolate_migratepages_block() is called repeately on the following pageblock and some pages in the previous range are skipped to check. Then, CMA is failed frequently due to this fact. To fix this problem, this patch let isolate_migratepages_range() know the situation that enough pages are isolated and stop the isolation in that case. Note that isolate_migratepages() has no such problem, because, it always stops the isolation after just one call of isolate_migratepages_block(). Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Zhang Yanfei <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29cgroup/kmemleak: add kmemleak_free() for cgroup deallocations.Wang Nan1-0/+1
Commit ff7ee93f4715 ("cgroup/kmemleak: Annotate alloc_page() for cgroup allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but corresponding kmemleak_free() is missing, which makes kmemleak be wrongly disabled after memory offlining. Log is pasted at the end of this commit message. This patch add kmemleak_free() into free_page_cgroup(). During page offlining, this patch removes corresponding entries in kmemleak rbtree. After that, the freed memory can be allocated again by other subsystems without killing kmemleak. bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak Offlined Pages 32768 kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing) CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x46/0x58 create_object+0x266/0x2c0 kmemleak_alloc+0x26/0x50 kmem_cache_alloc+0xd3/0x160 __sigqueue_alloc+0x49/0xd0 __send_signal+0xcb/0x410 send_signal+0x45/0x90 __group_send_sig_info+0x13/0x20 do_notify_parent+0x1bb/0x260 do_exit+0x767/0xa40 do_group_exit+0x44/0xa0 SyS_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff880016900000 (size 524288): kmemleak: comm "swapper/0", pid 0, jiffies 4294667296 kmemleak: min_count = 0 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: log_early+0x63/0x77 kmemleak_alloc+0x4b/0x50 init_section_page_cgroup+0x7f/0xf5 page_cgroup_init+0xc5/0xd0 start_kernel+0x333/0x408 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0xf5/0xfc Fixes: ff7ee93f4715 (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations) Signed-off-by: Wang Nan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: <[email protected]> [3.2+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-10-29percpu: off by one in BUG_ON()Dan Carpenter1-1/+1
The unit_map[] array has "nr_cpu_ids" number of elements. It's allocated a few lines earlier in the function. So this test should be >= instead of >. Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2014-10-28zap_pte_range: update addr when forcing flush after TLB batching faiureWill Deacon1-0/+1
When unmapping a range of pages in zap_pte_range, the page being unmapped is added to an mmu_gather_batch structure for asynchronous freeing. If we run out of space in the batch structure before the range has been completely unmapped, then we break out of the loop, force a TLB flush and free the pages that we have batched so far. If there are further pages to unmap, then we resume the loop where we left off. Unfortunately, we forget to update addr when we break out of the loop, which causes us to truncate the range being invalidated as the end address is exclusive. When we re-enter the loop at the same address, the page has already been freed and the pte_present test will fail, meaning that we do not reconsider the address for invalidation. This patch fixes the problem by incrementing addr by the PAGE_SIZE before breaking out of the loop on batch failure. Signed-off-by: Will Deacon <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>