aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2011-01-20memcg: correctly order reading PCG_USED and pc->mem_cgroupJohannes Weiner1-18/+9
The placement of the read-side barrier is confused: the writer first sets pc->mem_cgroup, then PCG_USED. The read-side barrier has to be between testing PCG_USED and reading pc->mem_cgroup. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20mm: fix truncate_setsize() commentJan Kara1-6/+5
Contrary to what the comment says, truncate_setsize() should be called *before* filesystem truncated blocks. Signed-off-by: Jan Kara <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20memcg: fix rmdir, force_empty with THPKAMEZAWA Hiroyuki1-11/+26
Now, when THP is enabled, memcg's rmdir() function is broken because move_account() for THP page is not supported. This will cause account leak or -EBUSY issue at rmdir(). This patch fixes the issue by supporting move_account() THP pages. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20memcg: fix LRU accounting with THPKAMEZAWA Hiroyuki1-4/+18
memory cgroup's LRU stat should take care of size of pages because Transparent Hugepage inserts hugepage into LRU. If this value is the number wrong, memory reclaim will not work well. Note: only head page of THP's huge page is linked into LRU. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20memcg: fix USED bit handling at uncharge in THPKAMEZAWA Hiroyuki2-40/+53
Now, under THP: at charge: - PageCgroupUsed bit is set to all page_cgroup on a hugepage. ....set to 512 pages. at uncharge - PageCgroupUsed bit is unset on the head page. So, some pages will remain with "Used" bit. This patch fixes that Used bit is set only to the head page. Used bits for tail pages will be set at splitting if necessary. This patch adds this lock order: compound_lock() -> page_cgroup_move_lock(). [[email protected]: fix warning] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20memcg: modify accounting function for supporting THP betterKAMEZAWA Hiroyuki1-13/+12
mem_cgroup_charge_statisics() was designed for charging a page but now, we have transparent hugepage. To fix problems (in following patch) it's required to change the function to get the number of pages as its arguments. The new function gets following as argument. - type of page rather than 'pc' - size of page which is accounted. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20mm: compaction: prevent division-by-zero during user-requested compactionJohannes Weiner1-0/+11
Up until 3e7d344 ("mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim"), compaction skipped calculating the fragmentation index of a zone when compaction was explicitely requested through the procfs knob. However, when compaction_suitable was introduced, it did not come with an extra check for order == -1, set on explicit compaction requests, and passed this order on to the fragmentation index calculation, where it overshifts the number of requested pages, leading to a division by zero. This patch makes sure that order == -1 is recognized as the flag it is rather than passing it along as valid order parameter. [[email protected]: add comment, per Mel] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20mm/vmscan.c: remove duplicate include of compaction.hJesper Juhl1-1/+0
Signed-off-by: Jesper Juhl <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20memblock: fix memblock_is_region_memory()Tomi Valkeinen1-4/+4
memblock_is_region_memory() uses reserved memblocks to search for the given region, while it should use the memory memblocks. I encountered the problem with OMAP's framebuffer ram allocation. Normally the ram is allocated dynamically, and this function is not called. However, if we want to pass the framebuffer from the bootloader to the kernel (to retain the boot image), this function is used to check the validity of the kernel parameters for the framebuffer ram area. Signed-off-by: Tomi Valkeinen <[email protected]> Acked-by: Yinghai Lu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-20thp: keep highpte mapped until it is no longer neededJohannes Weiner1-1/+2
Two users reported THP-related crashes on 32-bit x86 machines. Their oops reports indicated an invalid pte, and subsequent code inspection showed that the highpte is actually used after unmap. The fix is to unmap the pte only after all operations against it are finished. Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Ilya Dryomov <[email protected]> Reported-by: werner <[email protected]> Cc: Andrea Arcangeli <[email protected]> Tested-by: Ilya Dryomov <[email protected]> Tested-by: Steven Rostedt <[email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-17Revert "mm: simplify code of swap.c"Linus Torvalds1-54/+47
This reverts commit d8505dee1a87b8d41b9c4ee1325cd72258226fbc. Chris Mason ended up chasing down some page allocation errors and pages stuck waiting on the IO scheduler, and was able to narrow it down to two commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock contention") and d8505dee1a87 ("mm: simplify code of swap.c"). This reverts the second one. Reported-and-debugged-by: Chris Mason <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: linux-mm <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-17Revert "mm: batch activate_page() to reduce lock contention"Linus Torvalds3-92/+13
This reverts commit 744ed1442757767ffede5008bb13e0805085902e. Chris Mason ended up chasing down some page allocation errors and pages stuck waiting on the IO scheduler, and was able to narrow it down to two commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock contention") and d8505dee1a87 ("mm: simplify code of swap.c"). This reverts the first of them. Reported-and-debugged-by: Chris Mason <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: linux-mm <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-16fix non-x86 build failure in pmdp_get_and_clearAndrea Arcangeli1-7/+4
pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as BUG() and they were defined only to diminish the risk of build issues on not-x86 archs and to be consistent with the generic pte methods previously defined in include/asm-generic/pgtable.h. But they are causing more trouble than they were supposed to solve, so it's simpler not to define them when THP is off. This is also correcting the export of pmdp_splitting_flush which is currently unused (x86 isn't using the generic implementation in mm/pgtable-generic.c and no other arch needs that [yet]). Signed-off-by: Andrea Arcangeli <[email protected]> Sam Ravnborg <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-15mm/slab.c: make local symbols staticH Hartley Sweeten1-3/+3
Local symbols should be static. Signed-off-by: H Hartley Sweeten <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Matt Mackall <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2011-01-15Merge branch 'slub/hotplug' into slab/urgentPekka Enberg2-2/+6
2011-01-13Merge branch 'release' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits) ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework ACPI: fix resource check message ACPI / Battery: Update information on info notification and resume ACPI: Drop device flag wake_capable ACPI: Always check if _PRW is present before trying to evaluate it ACPI / PM: Check status of power resources under mutexes ACPI / PM: Rename acpi_power_off_device() ACPI / PM: Drop acpi_power_nocheck ACPI / PM: Drop acpi_bus_get_power() Platform / x86: Make fujitsu_laptop use acpi_bus_update_power() ACPI / Fan: Rework the handling of power resources ACPI / PM: Register power resource devices as soon as they are needed ACPI / PM: Register acpi_power_driver early ACPI / PM: Add function for updating device power state consistently ACPI / PM: Add function for device power state initialization ACPI / PM: Introduce __acpi_bus_get_power() ACPI / PM: Introduce function for refcounting device power resources ACPI / PM: Add functions for manipulating lists of power resources ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes ACPICA: Update version to 20101209 ...
2011-01-13memcg: fix memory migration of shmem swapcacheDaisuke Nishimura2-4/+3
In the current implementation mem_cgroup_end_migration() decides whether the page migration has succeeded or not by checking "oldpage->mapping". But if we are tring to migrate a shmem swapcache, the page->mapping of it is NULL from the begining, so the check would be invalid. As a result, mem_cgroup_end_migration() assumes the migration has succeeded even if it's not, so "newpage" would be freed while it's not uncharged. This patch fixes it by passing mem_cgroup_end_migration() the result of the page migration. Signed-off-by: Daisuke Nishimura <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: Minchan Kim <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13memcg: use [kv]zalloc[_node] rather than [kv]malloc+memsetJesper Juhl1-6/+3
In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then followed by memset() to zero the memory. This can be more efficiently achieved by using kzalloc() and vzalloc(). There's also one situation where we can use kzalloc_node() - this is what's new in this version of the patch. Signed-off-by: Jesper Juhl <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13memcg: fix deadlock between cpuset and memcgDaisuke Nishimura1-35/+49
Commit b1dd693e ("memcg: avoid deadlock between move charge and try_charge()") can cause another deadlock about mmap_sem on task migration if cpuset and memcg are mounted onto the same mount point. After the commit, cgroup_attach_task() has sequence like: cgroup_attach_task() ss->can_attach() cpuset_can_attach() mem_cgroup_can_attach() down_read(&mmap_sem) (1) ss->attach() cpuset_attach() mpol_rebind_mm() down_write(&mmap_sem) (2) up_write(&mmap_sem) cpuset_migrate_mm() do_migrate_pages() down_read(&mmap_sem) up_read(&mmap_sem) mem_cgroup_move_task() mem_cgroup_clear_mc() up_read(&mmap_sem) We can cause deadlock at (2) because we've already aquire the mmap_sem at (1). But the commit itself is necessary to fix deadlocks which have existed before the commit like: Ex.1) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | down_write(&mmap_sem) mc.moving_task = current | .. mem_cgroup_precharge_mc() | __mem_cgroup_try_charge() mem_cgroup_count_precharge() | prepare_to_wait() down_read(&mmap_sem) | if (mc.moving_task) -> cannot aquire the lock | -> true | schedule() | -> move charge should wake it up Ex.2) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | mc.moving_task = current | mem_cgroup_precharge_mc() | mem_cgroup_count_precharge() | down_read(&mmap_sem) | .. | up_read(&mmap_sem) | | down_write(&mmap_sem) mem_cgroup_move_task() | .. mem_cgroup_move_charge() | __mem_cgroup_try_charge() down_read(&mmap_sem) | prepare_to_wait() -> cannot aquire the lock | if (mc.moving_task) | -> true | schedule() | -> move charge should wake it up This patch fixes all of these problems by: 1. revert the commit. 2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge() has released the mmap_sem. 3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel all extra charges, wake up all waiters, and retry trylock. Signed-off-by: Daisuke Nishimura <[email protected]> Reported-by: Ben Blum <[email protected]> Cc: Miao Xie <[email protected]> Cc: David Rientjes <[email protected]> Cc: Paul Menage <[email protected]> Cc: Hiroyuki Kamezawa <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list()Minchan Kim1-1/+0
Signed-off-by: Minchan Kim <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13memcg: fix unit mismatch in memcg oom limit calculationJohannes Weiner1-2/+3
Adding the number of swap pages to the byte limit of a memory control group makes no sense. Convert the pages to bytes before adding them. The only user of this code is the OOM killer, and the way it is used means that the error results in a higher OOM badness value. Since the cgroup limit is the same for all tasks in the cgroup, the error should have no practical impact at the moment. But let's not wait for future or changing users to trip over it. Signed-off-by: Johannes Weiner <[email protected]> Cc: Greg Thelen <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13memcg: add lock to synchronize page accounting and migrationKAMEZAWA Hiroyuki1-2/+7
Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page accounting and migration code. This reworks the locking scheme of _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable. 1. If pages are being migrated from a memcg, then updates to that memcg page statistics are protected by grabbing PCG_MOVE_LOCK using move_lock_page_cgroup(). In an upcoming commit, memcg dirty page accounting will be updating memcg page accounting (specifically: num writeback pages) from IRQ context (softirq). Avoid a deadlocking nested spin lock attempt by disabling irq on the local processor when grabbing the PCG_MOVE_LOCK. 2. lock for update_page_stat is used only for avoiding race with move_account(). So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem is between mem_cgroup_update_page_stat() and mem_cgroup_move_account_page(). Trade-off: * Changing lock_page_cgroup() to always disable IRQ (or local_bh) has some impacts on performance and I think it's bad to disable IRQ when it's not necessary. * adding a new lock makes move_account() slower. Score is here. Performance Impact: moving a 8G anon process. Before: real 0m0.792s user 0m0.000s sys 0m0.780s After: real 0m0.854s user 0m0.000s sys 0m0.842s This score is bad but planned patches for optimization can reduce this impact. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Greg Thelen <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Daisuke Nishimura <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13memcg: create extensible page stat update routinesGreg Thelen2-11/+9
Replace usage of the mem_cgroup_update_file_mapped() memcg statistic update routine with two new routines: * mem_cgroup_inc_page_stat() * mem_cgroup_dec_page_stat() As before, only the file_mapped statistic is managed. However, these more general interfaces allow for new statistics to be more easily added. New statistics are added with memcg dirty page accounting. Signed-off-by: Greg Thelen <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: batch activate_page() to reduce lock contentionShaohua Li3-13/+92
The zone->lru_lock is heavily contented in workload where activate_page() is frequently used. We could do batch activate_page() to reduce the lock contention. The batched pages will be added into zone list when the pool is full or page reclaim is trying to drain them. For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes, processes shared map to the file. Each process read access the whole file and then exit. The process exit will do unmap_vmas() and cause a lot of activate_page() call. In such workload, we saw about 58% total time reduction with below patch. Other workloads with a lot of activate_page also benefits a lot too. I tested some microbenchmarks: case-anon-cow-rand-mt 0.58% case-anon-cow-rand -3.30% case-anon-cow-seq-mt -0.51% case-anon-cow-seq -5.68% case-anon-r-rand-mt 0.23% case-anon-r-rand 0.81% case-anon-r-seq-mt -0.71% case-anon-r-seq -1.99% case-anon-rx-rand-mt 2.11% case-anon-rx-seq-mt 3.46% case-anon-w-rand-mt -0.03% case-anon-w-rand -0.50% case-anon-w-seq-mt -1.08% case-anon-w-seq -0.12% case-anon-wx-rand-mt -5.02% case-anon-wx-seq-mt -1.43% case-fork 1.65% case-fork-sleep -0.07% case-fork-withmem 1.39% case-hugetlb -0.59% case-lru-file-mmap-read-mt -0.54% case-lru-file-mmap-read 0.61% case-lru-file-mmap-read-rand -2.24% case-lru-file-readonce -0.64% case-lru-file-readtwice -11.69% case-lru-memcg -1.35% case-mmap-pread-rand-mt 1.88% case-mmap-pread-rand -15.26% case-mmap-pread-seq-mt 0.89% case-mmap-pread-seq -69.72% case-mmap-xread-rand-mt 0.71% case-mmap-xread-seq-mt 0.38% The most significent are: case-lru-file-readtwice -11.69% case-mmap-pread-rand -15.26% case-mmap-pread-seq -69.72% which use activate_page a lot. others are basically variations because each run has slightly difference. [[email protected]: coding-style fixes] Signed-off-by: Shaohua Li <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: simplify code of swap.cShaohua Li1-47/+54
Clean up code and remove duplicate code. Next patch will use pagevec_lru_move_fn introduced here too. Signed-off-by: Shaohua Li <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm/page_alloc.c: don't cache `current' in a localAndrew Morton1-14/+10
It's old-fashioned and unneeded. akpm:/usr/src/25> size mm/page_alloc.o text data bss dec hex filename 39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before) 39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after) Acked-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: fix hugepage migrationHugh Dickins1-17/+6
2.6.37 added an unmap_and_move_huge_page() for memory failure recovery, but its anon_vma handling was still based around the 2.6.35 conventions. Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma, drop_anon_vma in the same way as we're now changing unmap_and_move(). I don't particularly like to propose this for stable when I've not seen its problems in practice nor tested the solution: but it's clearly out of synch at present. Signed-off-by: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Jun'ichi Nomura" <[email protected]> Cc: Andi Kleen <[email protected]> Cc: <[email protected]> [2.6.37, 2.6.36] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: fix migration hangs on anon_vma lockHugh Dickins1-29/+19
Increased usage of page migration in mmotm reveals that the anon_vma locking in unmap_and_move() has been deficient since 2.6.36 (or even earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a ("mm: fix hang on anon_vma->root->lock") missed the issue here: the anon_vma to which we get a reference may already have been freed back to its slab (it is in use when we check page_mapped, but that can change), and so its anon_vma->root may be switched at any moment by reuse in anon_vma_prepare. Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's not: just rely on page_lock_anon_vma() to do all the hard thinking for us, then we don't need any rcu read locking over here. In removing the rcu_unlock label: since PageAnon is a bit in page->mapping, it's impossible for a !page->mapping page to be anon; but insert VM_BUG_ON in case the implementation ever changes. [[email protected]: coding-style fixes] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Jun'ichi Nomura" <[email protected]> Cc: Andi Kleen <[email protected]> Cc: <[email protected]> [2.6.37, 2.6.36] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13ksm: drain pagevecs to lruHugh Dickins1-0/+12
It was hard to explain the page counts which were causing new LTP tests of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally. Signed-off-by: Hugh Dickins <[email protected]> Reported-by: CAI Qian <[email protected]> Cc:Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13hugetlb: fix handling of parse errors in sysfsEric B Munson1-4/+2
When parsing changes to the huge page pool sizes made from userspace via the sysfs interface, bogus input values are being covered up by nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0 when strict_strtoul returns an error. This can cause an infinite loop in the nr_hugepages_store code. This patch changes the return value for these functions to -EINVAL when strict_strtoul returns an error. Signed-off-by: Eric B Munson <[email protected]> Reported-by: CAI Qian <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13hugetlb: do not allow pagesize >= MAX_ORDER pool adjustmentEric B Munson1-2/+21
Huge pages with order >= MAX_ORDER must be allocated at boot via the kernel command line, they cannot be allocated or freed once the kernel is up and running. Currently we allow values to be written to the sysfs and sysctl files controling pool size for these huge page sizes. This patch makes the store functions for nr_hugepages and nr_overcommit_hugepages return -EINVAL when the pool for a page size >= MAX_ORDER is changed. [[email protected]: avoid multiple return paths in nr_hugepages_store_common()] [[email protected]: add checking in hugetlb_overcommit_handler()] Signed-off-by: Eric B Munson <[email protected]> Reported-by: CAI Qian <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13hugetlb: check the return value of string conversion in sysctl handlerMichal Hocko1-6/+12
proc_doulongvec_minmax may fail if the given buffer doesn't represent a valid number. If we provide something invalid we will initialize the resulting value (nr_overcommit_huge_pages in this case) to a random value from the stack. The issue was introduced by a3d0c6aa when the default handler has been replaced by the helper function where we do not check the return value. Reproducer: echo "" > /proc/sys/vm/nr_overcommit_hugepages [[email protected]: correctly propagate proc_doulongvec_minmax return code] Signed-off-by: Michal Hocko <[email protected]> Cc: CAI Qian <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc()Andrew Morton1-1/+1
As it stands this code will degenerate into a busy-wait if the calling task has signal_pending(). Cc: Rolf Eike Beer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm/dmapool.c: take lock only once in dma_pool_free()Rolf Eike Beer1-8/+6
dma_pool_free() scans for the page to free in the pool list holding the pool lock. Then it releases the lock basically to acquire it immediately again. Modify the code to only take the lock once. This will do some additional loops and computations with the lock held in if memory debugging is activated. If it is not activated the only new operations with this lock is one if and one substraction. Signed-off-by: Rolf Eike Beer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm/page_alloc.c: simplify calculation of combined index of adjacent buddy listsKyongHo Cho1-15/+10
The previous approach of calucation of combined index was page_idx & ~(1 << order)) but we have same result with page_idx & buddy_idx This reduces instructions slightly as well as enhances readability. [[email protected]: coding-style fixes] [[email protected]: fix used-unintialised warning] Signed-off-by: KyongHo Cho <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13brk: fix min_brk lower bound computation for COMPAT_BRKJiri Kosina1-1/+9
Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still be overriden by randomize_va_space sysctl. If this is the case, the min_brk computation in sys_brk() implementation is wrong, as it solely takes into account COMPAT_BRK setting, assuming that brk start is not randomized. But that might not be the case if randomize_va_space sysctl has been set to '2' at the time the binary has been loaded from disk. In such case, the check has to be done in a same way as in !CONFIG_COMPAT_BRK case. In addition to that, the check for the COMPAT_BRK case introduced back in a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower bound") is slightly wrong -- the lower bound shouldn't be mm->end_code, but mm->end_data instead, as that's where the legacy applications expect brk section to start (i.e. immediately after last global variable). [[email protected]: fix comment] Signed-off-by: Jiri Kosina <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common()Jesper Juhl1-1/+3
The NODEMASK_ALLOC macro may dynamically allocate memory for its second argument ('nodes_allowed' in this context). In nr_hugepages_store_common() we may abort early if strict_strtoul() fails, but in that case we do not free the memory already allocated to 'nodes_allowed', causing a memory leak. This patch closes the leak by freeing the memory in the error path. [[email protected]: use NODEMASK_FREE, per Minchan Kim] Signed-off-by: Jesper Juhl <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13mm: migration: use rcu_dereference_protected when dereferencing the radix ↵Mel Gorman1-2/+2
tree slot during file page migration migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page migraton"). The point of the RCU protection there is part of getting a stable reference to anon_vma and is only held for anon pages as file pages are locked which is sufficient protection against freeing. However, while a file page's mapping is being migrated, the radix tree is double checked to ensure it is the expected page. This uses radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held triggering the following warning. [ 173.674290] =================================================== [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ] [ 173.676016] --------------------------------------------------- [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection! [ 173.676016] [ 173.676016] other info that might help us debug this: [ 173.676016] [ 173.676016] [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0 [ 173.676016] 1 lock held by hugeadm/2899: [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab [ 173.676016] [ 173.676016] stack backtrace: [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild [ 173.676016] Call Trace: [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86 [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39 [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107 [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107 [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa This patch introduces radix_tree_deref_slot_protected() which calls rcu_dereference_protected(). Users of it must pass in the mapping->tree_lock that is protecting this dereference. Holding the tree lock protects against parallel updaters of the radix tree meaning that rcu_dereference_protected is allowable. [[email protected]: remove unneeded casts] Signed-off-by: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Milton Miller <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: <[email protected]> [2.6.37.early] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: add compound_trans_head() helperAndrea Arcangeli1-12/+3
Cleanup some code with common compound_trans_head helper. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Avi Kivity <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: KSM on THPAndrea Arcangeli1-9/+58
This makes KSM full operational with THP pages. Subpages are scanned while the hugepage is still in place and delivering max cpu performance, and only if there's a match and we're going to deduplicate memory, the single hugepages with the subpage match is split. There will be no false sharing between ksmd and khugepaged. khugepaged won't collapse 2m virtual regions with KSM pages inside. ksmd also should only split pages when the checksum matches and we're likely to split an hugepage for some long living ksm page (usual ksm heuristic to avoid sharing pages that get de-cowed). Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: khugepaged: make khugepaged aware about madviseAndrea Arcangeli2-5/+20
MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after mmap and before touching the memory. While this is enough for most usages, it's little effort to make madvise more dynamic at runtime on an existing mapping by making khugepaged aware about madvise. MADV_HUGEPAGE: register in khugepaged immediately without waiting a page fault (that may not ever happen if all pages are already mapped and the "enabled" knob was set to madvise during the initial page faults). MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop collapsing pages where not needed. [[email protected]: tweak comment] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Michael Kerrisk <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: madvise(MADV_NOHUGEPAGE)Andrea Arcangeli2-12/+33
Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be hugepage backed. Return -EINVAL if the vma is not of an anonymous type, or the feature isn't built into the kernel. Never silently return success. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: compound_trans_orderAndrea Arcangeli2-12/+12
Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: fix memory-failure hugetlbfs vs THP collisionAndrea Arcangeli2-2/+2
hugetlbfs was changed to allow memory failure to migrate the hugetlbfs pages and that broke THP as split_huge_page was then called on hugetlbfs pages too. compound_head/order was also run unsafe on THP pages that can be splitted at any time. All compound_head() invocations in memory-failure.c that are run on pages that aren't pinned and that can be freed and reused from under us (while compound_head is running) are buggy because compound_head can return a dangling pointer, but I'm not fixing this as this is a generic memory-failure bug not specific to THP but it applies to hugetlbfs too, so I can fix it later after THP is merged upstream. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: add debug checks for mapcount related invariantsAndrea Arcangeli1-2/+4
Add debug checks for invariants that if broken could lead to mapcount vs page_mapcount debug checks to trigger later in split_huge_page. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: scale nr_rotated to balance memory pressureRik van Riel1-2/+3
Make sure we scale up nr_rotated when we encounter a referenced transparent huge page. This ensures pageout scanning balance is not distorted when there are huge pages on the LRU. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: fix anon memory statistics with transparent hugepagesRik van Riel3-6/+17
Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU statistics, so the Active(anon) and Inactive(anon) statistics in /proc/meminfo are correct. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: disable transparent hugepages by default on small systemsRik van Riel1-0/+8
On small systems, the extra memory used by the anti-fragmentation memory reserve and simply because huge pages are smaller than large pages can easily outweigh the benefits of less TLB misses. A less obvious concern is if run on a NUMA machine with asymmetric node sizes and one of them is very small. The reserve could make the node unusable. In case of the crashdump kernel, OOMs have been observed due to the anti-fragmentation memory reserve taking up a large fraction of the crashdump image. This patch disables transparent hugepages on systems with less than 1GB of RAM, but the hugepage subsystem is fully initialized so administrators can enable THP through /sys if desired. Signed-off-by: Rik van Riel <[email protected]> Acked-by: Avi Kiviti <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: use compaction for all allocation ordersAndrea Arcangeli1-1/+1
It makes no sense not to enable compaction for small order pages as we don't want to end up with bad order 2 allocations and good and graceful order 9 allocations. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-01-13thp: use compaction in kswapd for GFP_ATOMIC order > 0Andrea Arcangeli2-15/+46
This takes advantage of memory compaction to properly generate pages of order > 0 if regular page reclaim fails and priority level becomes more severe and we don't reach the proper watermarks. Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>