aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2010-08-27x86, memblock: Replace e820_/_early string with memblock_Yinghai Lu1-2/+2
1.include linux/memblock.h directly. so later could reduce e820.h reference. 2 this patch is done by sed scripts mainly -v2: use MEMBLOCK_ERROR instead of -1ULL or -1UL Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27x86: Use memblock to replace early_resYinghai Lu3-46/+18
1. replace find_e820_area with memblock_find_in_range 2. replace reserve_early with memblock_x86_reserve_range 3. replace free_early with memblock_x86_free_range. 4. NO_BOOTMEM will switch to use memblock too. 5. use _e820, _early wrap in the patch, in following patch, will replace them all 6. because memblock_x86_free_range support partial free, we can remove some special care 7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill() so adjust some calling later in setup.c::setup_arch() -- corruption_check and mptable_update -v2: Move reserve_brk() early Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range() that could happen We have more then 128 RAM entry in E820 tables, and memblock_x86_fill() could use memblock_find_in_range() to find a new place for memblock.memory.region array. and We don't need to use extend_brk() after fill_memblock_area() So move reserve_brk() early before fill_memblock_area(). -v3: Move find_smp_config early To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable in right place. -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in memblock.reserved already.. use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later. -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit active_region for 32bit does include high pages need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped() -v6: Use current_limit instead -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L -v8: Set memblock_can_resize early to handle EFI with more RAM entries -v9: update after kmemleak changes in mainline Suggested-by: David S. Miller <[email protected]> Suggested-by: Benjamin Herrenschmidt <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27memblock: Add find_memory_core_early()Yinghai Lu1-0/+36
According to node range in early_node_map[] with __memblock_find_in_range to find free range. Will be used by memblock_x86_find_in_range_node() memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27bootmem, x86: Add weak version of reserve_bootmem_genericYinghai Lu1-0/+6
It will be used memblock_x86_to_bootmem converting It is an wrapper for reserve_bootmem, and x86 64bit is using special one. Also clean up that version for x86_64. We don't need to take care of numa path for that, bootmem can handle it how Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27memblock: Add memblock_free/reserve_reserved_regions()Yinghai Lu1-0/+24
So we can avoid export memblock_reserved_init_regions() Suggested by Ben. -v2: use __init_memblock attribute Signed-off-by: Yinghai Lu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27percpu: fix a mismatch between code and commentNamhyung Kim1-1/+1
When pcpu_build_alloc_info() searches best_upa value, it ignores current value if the number of waste units exceeds 1/3 of the number of total cpus. But the comment on the code says that it will ignore if wastage is over 25%. Modify the comment. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2010-08-27percpu: fix a memory leak in pcpu_extend_area_map()Huang Shijie1-1/+3
The original code did not free the old map. This patch fixes it. tj: use @old as memcpy source instead of @chunk->map, and indentation and description update Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2010-08-27writeback: do not lose wakeup events when forking bdi threadsArtem Bityutskiy1-2/+5
This patch fixes the following issue: INFO: task mount.nfs4:1120 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000 ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000 ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8 00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0 Call Trace: [<ffffffff813bc747>] schedule_timeout+0x34/0xf1 [<ffffffff813bc530>] ? wait_for_common+0x3f/0x130 [<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf [<ffffffff813bc5c3>] wait_for_common+0xd2/0x130 [<ffffffff8104159c>] ? default_wake_function+0x0/0xf [<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a [<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a [<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc [<ffffffff811056a6>] __sync_filesystem+0x47/0x7e [<ffffffff81105798>] sync_filesystem+0x47/0x4b [<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2 [<ffffffff810e80f8>] kill_anon_super+0x11/0x4f [<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs] [<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41 [<ffffffff810e7fd6>] deactivate_super+0x40/0x45 [<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed [<ffffffff810fc73b>] release_mounts+0x9a/0xb0 [<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b [<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs] [<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs] [<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs] [<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196 [<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8 [<ffffffff810fdf68>] do_mount+0x771/0x7e8 [<ffffffff810fe062>] sys_mount+0x83/0xbd [<ffffffff810089c2>] system_call_fastpath+0x16/0x1b The reason of this hang was a race condition: when the flusher thread is forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep. 'bdi->wb.task' is still NULL. And this is a dangerous time window. If at this time someone queues a work for this bdi, he does not see the bdi thread and wakes up the forker thread instead! But the forker has already forked this bdi thread, but just did not make it visible yet! The result is that we lose the wake up event for this bdi thread and the NFS4 code waits forever. To fix the problem, we should use 'ktrhead_create()' for creating bdi threads, then make them visible in 'bdi->wb.task', and only after this wake them up. This is exactly what this patch does. Signed-off-by: Artem Bityutskiy <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2010-08-25Merge branch '2.6.36-fixes' of ↵Linus Torvalds1-16/+10
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev * '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev: xfs: do not discard page cache data on EAGAIN xfs: don't do memory allocation under the CIL context lock xfs: Reduce log force overhead for delayed logging xfs: dummy transactions should not dirty VFS state xfs: ensure f_ffree returned by statfs() is non-negative xfs: handle negative wbc->nr_to_write during sync writeback writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 xfs: fix untrusted inode number lookup xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE xfs: unlock items before allowing the CIL to commit
2010-08-24guard page for stacks that grow upwardsLuck, Tony2-7/+11
pa-risc and ia64 have stacks that grow upwards. Check that they do not run into other mappings. By making VM_GROWSUP 0x0 on architectures that do not ever use it, we can avoid some unpleasant #ifdefs in check_stack_guard_page(). Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-24writeback: write_cache_pages doesn't terminate at nr_to_write <= 0Dave Chinner1-16/+10
I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have been. Enabling writeback tracing showed: flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 Writeback is not terminating in background writeback if ->writepage is returning with wbc->nr_to_write == 0, resulting in sub-optimal single page writeback on XFS. Fix the write_cache_pages loop to terminate correctly when this situation occurs and so prevent this sub-optimal background writeback pattern. This improves sustained sequential buffered write performance from around 250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM. Cc:<[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2010-08-23x86, mm: Avoid unnecessary TLB flushShaohua Li1-1/+1
In x86, access and dirty bits are set automatically by CPU when CPU accesses memory. When we go into the code path of below flush_tlb_fix_spurious_fault(), we already set dirty bit for pte and don't need flush tlb. This might mean tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When the CPUs do page write, they will automatically check the bit and no software involved. On the other hand, flush tlb in below position is harmful. Test creates CPU number of threads, each thread writes to a same but random address in same vma range and we measure the total time. Under a 4 socket system, original time is 1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is 20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for tlb flush. Signed-off-by: Shaohua Li <[email protected]> LKML-Reference: <[email protected]> Acked-by: Suresh Siddha <[email protected]> Cc: Andrea Archangeli <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-22mm: exporting account_page_dirtyMichael Rubin1-0/+1
This allows code outside of the mm core to safely manipulate page state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. This has happened once already. Signed-off-by: Michael Rubin <[email protected]> Signed-off-by: Sage Weil <[email protected]>
2010-08-22Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slab: fix object alignment slub: add missing __percpu markup in mm/slub_def.h
2010-08-21mm: make stack guard page logic use vm_prev pointerLinus Torvalds1-4/+11
Like the mlock() change previously, this makes the stack guard check code use vma->vm_prev to see what the mapping below the current stack is, rather than have to look it up with find_vma(). Also, accept an abutting stack segment, since that happens naturally if you split the stack with mlock or mprotect. Tested-by: Ian Campbell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-21mm: make the mlock() stack guard page checks stricterLinus Torvalds1-5/+16
If we've split the stack vma, only the lowest one has the guard page. Now that we have a doubly linked list of vma's, checking this is trivial. Tested-by: Ian Campbell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-21mm: make the vma list be doubly linkedLinus Torvalds2-6/+22
It's a really simple list, and several of the users want to go backwards in it to find the previous vma. So rather than have to look up the previous entry with 'find_vma_prev()' or something similar, just make it doubly linked instead. Tested-by: Ian Campbell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-20oom: __task_cred() need rcu_read_lock()KOSAKI Motohiro1-1/+1
dump_tasks() needs to hold the RCU read lock around its access of the target task's UID. To this end it should use task_uid() as it only needs that one thing from the creds. The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the target process replacing its credentials on another CPU. Then, this patch change to call rcu_read_lock() explicitly. =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- mm/oom_kill.c:410 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 4 locks held by kworker/1:2/651: #0: (events){+.+.+.}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 #1: (moom_work){+.+...}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 #2: (tasklist_lock){.+.+..}, at: [<ffffffff810fafd4>] out_of_memory+0x164/0x3f0 #3: (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810fa48e>] find_lock_task_mm+0x2e/0x70 Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: David Howells <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-20oom: fix tasklist_lock leakKOSAKI Motohiro1-3/+6
Commit 0aad4b3124 ("oom: fold __out_of_memory into out_of_memory") introduced a tasklist_lock leak. Then it caused following obvious danger warnings and panic. ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ rsyslogd/1422 is leaving the kernel with locks still held! 1 lock held by rsyslogd/1422: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0 BUG: scheduling while atomic: rsyslogd/1422/0x00000002 INFO: lockdep is turned off. This patch fixes it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-20oom: fix NULL pointer dereferenceKOSAKI Motohiro1-3/+2
Commit b940fd7035 ("oom: remove unnecessary code and cleanup") added an unnecessary NULL pointer dereference. remove it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-20lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged()Jan Kara1-1/+2
When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows index and tree traversal code goes astray reading memory until it hits unreadable memory. Check for overflow and exit in that case. Signed-off-by: Jan Kara <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-17shmem: put_super must percpu_counter_destroyHugh Dickins1-2/+6
list_add() corruption messages reported from shmem_fill_super()'s recently introduced percpu_counter_init(): shmem_put_super() needs to remember to percpu_counter_destroy(). And also check error from percpu_counter_init(). Reported-bisected-and-tested-by: Tetsuo Handa <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-15mm: fix up some user-visible effects of the stack guard pageLinus Torvalds1-0/+8
This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <[email protected] Reported-by: Henrique de Moraes Holschuh <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2010-08-14mm/page-writeback: fix non-kernel-doc function commentsRandy Dunlap1-2/+2
Remove leading /** from non-kernel-doc function comments to prevent kernel-doc warnings. Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-14mm: fix page table unmap for stack guard page properlyLinus Torvalds1-7/+6
We do in fact need to unmap the page table _before_ doing the whole stack guard page logic, because if it is needed (mainly 32-bit x86 with PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it will do a kmap_atomic/kunmap_atomic. And those kmaps will create an atomic region that we cannot do allocations in. However, the whole stack expand code will need to do anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an atomic region. Now, a better model might actually be to do the anon_vma_prepare() when _creating_ a VM_GROWSDOWN segment, and not have to worry about any of this at page fault time. But in the meantime, this is the straightforward fix for the issue. See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details. Reported-by: Wylda <[email protected]> Reported-by: Sedat Dilek <[email protected]> Reported-by: Mike Pagano <[email protected]> Reported-by: François Valenduc <[email protected]> Tested-by: Ed Tomlinson <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Greg KH <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2010-08-13NOMMU: Remove an extraneous no_printk()David Howells1-5/+0
Remove an extraneous no_printk() in mm/nommu.c that got missed when the function got generalised from several things that used it in commit 12fdff3fc248 ("Add a dummy printk function for the maintenance of unused printks"). Without this, the following error is observed: mm/nommu.c:41: error: conflicting types for 'no_printk' include/linux/kernel.h:314: error: previous definition of 'no_printk' was here Reported-by: Michal Simek <[email protected]> Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-13mm: fix missing page table unmap for stack guard page failure caseLinus Torvalds1-1/+3
.. which didn't show up in my tests because it's a no-op on x86-64 and most other architectures. But we enter the function with the last-level page table mapped, and should unmap it at exit. Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12mm: keep a guard page below a grow-down stack segmentLinus Torvalds1-0/+23
This is a rather minimally invasive patch to solve the problem of the user stack growing into a memory mapped area below it. Whenever we fill the first page of the stack segment, expand the segment down by one page. Now, admittedly some odd application might _want_ the stack to grow down into the preceding memory mapping, and so we may at some point need to make this a process tunable (some people might also want to have more than a single page of guarding), but let's try the minimal approach first. Tested with trivial application that maps a single page just below the stack, and then starts recursing. Without this, we will get a SIGSEGV _after_ the stack has smashed the mapping. With this patch, we'll get a nice SIGBUS just as the stack touches the page just above the mapping. Requested-by: Keith Packard <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12Merge branch 'hwpoison' of ↵Linus Torvalds4-38/+260
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: hugetlb: add missing unlock in avoidcopy path in hugetlb_cow() hwpoison: rename CONFIG HWPOISON, hugetlb: support hwpoison injection for hugepage HWPOISON, hugetlb: detect hwpoison in hugetlb code HWPOISON, hugetlb: isolate corrupted hugepage HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage HWPOISON, hugetlb: enable error handling path for hugepage hugetlb, rmap: add reverse mapping for hugepage hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h Fix up trivial conflicts in mm/memory-failure.c
2010-08-12Merge branch 'stable/xen-swiotlb-0.8.6' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: x86: Detect whether we should use Xen SWIOTLB. pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions. swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough. xen/mmu: inhibit vmap aliases rather than trying to clear them out vmap: add flag to allow lazy unmap to be disabled at runtime xen: Add xen_create_contiguous_region xen: Rename the balloon lock xen: Allow unprivileged Xen domains to create iomap pages xen: use _PAGE_IOMAP in ioremap to do machine mappings Fix up trivial conflicts (adding both xen swiotlb and xen pci platform driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and include/xen/xen-ops.h
2010-08-12writeback: add comment to the dirty limit functionsWu Fengguang1-3/+28
Document global_dirty_limits() and bdi_dirty_limit(). Signed-off-by: Wu Fengguang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12writeback: avoid unnecessary calculation of bdi dirty thresholdsWu Fengguang2-38/+40
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so that the latter can be avoided when under global dirty background threshold (which is the normal state for most systems). Signed-off-by: Wu Fengguang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12writeback: balance_dirty_pages(): reduce calls to global_page_stateWu Fengguang1-62/+33
Reducing the number of times balance_dirty_pages calls global_page_state reduces the cache references and so improves write performance on a variety of workloads. 'perf stats' of simple fio write tests shows the reduction in cache access. Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10 times, dropping the fasted & slowest values then taking the average & standard deviation average (s.d.) in millions (10^6) 2.6.31-rc8 648.6 (14.6) +patch 620.1 (16.5) Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads the counters to apply the dirty_threshold and moving this check up into balance_dirty_pages where it has already read the counters. Also by rearrange the for loop to only contain one copy of the limit tests allows the pdflush test after the loop to use the local copies of the counters rather than rereading them. In the common case with no throttling it now calls global_page_state 5 fewer times and bdi_stat 2 fewer. Fengguang: This patch slightly changes behavior by replacing clip_bdi_dirty_limit() with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly accurate we don't need to do routinely clip. A simple dirty limit check would be enough. The check is necessary because, in principle we should throttle everything calling balance_dirty_pages() when we're over the total limit, as said by Peter. We now set and clear dirty_exceeded not only based on bdi dirty limits, but also on the global dirty limit. The global limit check is added in place of clip_bdi_dirty_limit() for safety and not intended as a behavior change. The bdi limits should be tight enough to keep all dirty pages under the global limit at most time; occasional small exceeding should be OK though. The change makes the logic more obvious: the global limit is the ultimate goal and shall be always imposed. We may now start background writeback work based on outdated conditions. That's safe because the bdi flush thread will (and have to) double check the states. It reduces overall overheads because the test based on old states still have good chance to be right. [[email protected]] fix uninitialized dirty_exceeded Signed-off-by: Richard Kennedy <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Cc: Jan Kara <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12mm: fix fatal kernel-doc errorRandy Dunlap1-1/+1
Fix a fatal kernel-doc error due to a #define coming between a function's kernel-doc notation and the function signature. (kernel-doc cannot handle this) Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: convert to use zone_to_nid() from bare zone->zone_pgdat->node_idKOSAKI Motohiro1-3/+3
We have zone_to_nid(). this patch convert all existing users of zone->zone_pgdat->node_id. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nishimura Daisuke <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: remove nid and zid argument from mem_cgroup_soft_limit_reclaim()KOSAKI Motohiro2-8/+4
mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid and zid can be calculated from zone. So remove it. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Nishimura Daisuke <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: mem_cgroup_shrink_node_zone() doesn't need sc.nodemaskKOSAKI Motohiro2-6/+2
Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus it doesn't need to initialize sc.nodemask because shrink_zone() doesn't use it at all. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Nishimura Daisuke <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: kill unnecessary initialization in mem_cgroup_shrink_node_zone()KOSAKI Motohiro1-2/+0
sc.nr_reclaimed and sc.nr_scanned have already been initialized few lines above "struct scan_control sc = {}" statement. So, This patch remove this unnecessary code. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nishimura Daisuke <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: sc.nr_to_reclaim should be initializedKOSAKI Motohiro1-0/+1
Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0. It mean shrink_zone() only scan 32 pages and immediately return even if it doesn't reclaim any pages. This patch fixes it. Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Nishimura Daisuke <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: avoid css_get()KAMEZAWA Hiroyuki1-43/+76
Now, memory cgroup increments css(cgroup subsys state)'s reference count per a charged page. And the reference count is kept until the page is uncharged. But this has 2 bad effect. 1. Because css_get/put calls atomic_inc()/dec, heavy call of them on large smp will not scale well. 2. Because css's refcnt cannot be in a state as "ready-to-release", cgroup's notify_on_release handler can't work with memcg. 3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small. This has been a problem since the 1st merge of memcg. This is a trial to remove css's refcnt per a page. Even if we remove refcnt, pre_destroy() does enough synchronization as - check res->usage == 0. - check no pages on LRU. This patch removes css's refcnt per page. Even after this patch, at the 1st look, it seems css_get() is still called in try_charge(). But the logic is. - If a memcg of mm->owner is cached one, consume_stock() will work. At success, return immediately. - If consume_stock returns false, css_get() is called and go to slow path which may be blocked. At the end of slow path, css_put() is called and restart from the start if necessary. So, in the fast path, we don't call css_get() and can avoid access to shared counter. This patch can make the most possible case fast. Here is a result of multi-threaded page fault benchmark. [Before] 25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c 9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave 8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm <=====(*) 7.83% multi-fault-all [kernel.kallsyms] [k] down_read_trylock 5.38% multi-fault-all [kernel.kallsyms] [k] __css_put 5.29% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask 4.92% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq 4.24% multi-fault-all [kernel.kallsyms] [k] up_read 3.53% multi-fault-all [kernel.kallsyms] [k] css_put 2.11% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault 1.76% multi-fault-all [kernel.kallsyms] [k] __rmqueue 1.64% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge [After] 28.41% multi-fault-all [kernel.kallsyms] [k] clear_page_c 10.08% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq 9.58% multi-fault-all [kernel.kallsyms] [k] down_read_trylock 9.38% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.86% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask 5.65% multi-fault-all [kernel.kallsyms] [k] up_read 2.82% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault 2.64% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list 2.48% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch removes css_tryget() in it. (But yes, this is an extreme case.) Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: use find_lock_task_mm() in memory cgroups oomKAMEZAWA Hiroyuki2-4/+8
When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context. But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision. We have to use find_lock_task_mm() in memcg as generic OOM-Killer does. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: remove mem from arg of charge_commonDaisuke Nishimura1-9/+8
mem_cgroup_charge_common() is always called with @mem = NULL, so it's meaningless. This patch removes it. Signed-off-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: remove redundant codeDaisuke Nishimura1-10/+0
- try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we don't have to call them in task_in_mem_cgroup(). - *mz is not used in __mem_cgroup_uncharge_common(). - we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration() after we've cleared PCG_MIGRATION of @oldpage. - remove empty comment. - remove redundant empty line in mem_cgroup_cache_charge(). Signed-off-by: Daisuke Nishimura <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: clean up waiting move acctKAMEZAWA Hiroyuki1-22/+29
Now, for checking a memcg is under task-account-moving, we do css_tryget() against mc.to and mc.from. But this is just complicating things. This patch makes the check easier. This patch adds a spinlock to move_charge_struct and guard modification of mc.to and mc.from. By this, we don't have to think about complicated races arount this not-critical path. [[email protected]: don't crash on a null memcg being passed] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11memcg: clean up try_charge main loopKAMEZAWA Hiroyuki1-100/+148
mem_cgroup_try_charge() has a big loop in it and seems to be hard to read. Most of routines are for slow path. This patch moves codes out from the loop and make it clear what's done. Summary: - refactoring a function to detect a memcg is under acccount move or not. - refactoring a function to wait for the end of moving task acct. - refactoring a main loop('s slow path) as a function and make it clear why we retry or quit by return code. - add fatal_signal_pending() check for bypassing charge loops. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()Naoya Horiguchi1-1/+3
This patch fixes possible deadlock in hugepage lock_page() by adding missing unlock_page(). libhugetlbfs test will hit this bug when the next patch in this patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied. Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Jun'ichi Nomura <[email protected]> Acked-by: Fengguang Wu <[email protected]> Signed-off-by: Andi Kleen <[email protected]>
2010-08-11hwpoison: rename CONFIGNaoya Horiguchi1-2/+2
CONFIG_HUGETLBFS controls hugetlbfs interface code. OTOH, CONFIG_HUGETLB_PAGE controls hugepage management code. So we should use CONFIG_HUGETLB_PAGE here. Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andi Kleen <[email protected]>
2010-08-11HWPOISON, hugetlb: support hwpoison injection for hugepageNaoya Horiguchi2-6/+11
This patch enables hwpoison injection through debug/hwpoison interfaces, with which we can test memory error handling for free or reserved hugepages (which cannot be tested by madvise() injector). [AK: Export PageHuge too for the injection module] Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andrew Morton <[email protected]> Acked-by: Fengguang Wu <[email protected]> Signed-off-by: Andi Kleen <[email protected]>
2010-08-11HWPOISON, hugetlb: detect hwpoison in hugetlb codeNaoya Horiguchi1-0/+40
This patch enables to block access to hwpoisoned hugepage and also enables to block unmapping for it. Dependency: "HWPOISON, hugetlb: enable error handling path for hugepage" Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andrew Morton <[email protected]> Acked-by: Fengguang Wu <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andi Kleen <[email protected]>
2010-08-11HWPOISON, hugetlb: isolate corrupted hugepageNaoya Horiguchi2-8/+36
If error hugepage is not in-use, we can fully recovery from error by dequeuing it from freelist, so return RECOVERY. Otherwise whether or not we can recovery depends on user processes, so return DELAYED. Dependency: "HWPOISON, hugetlb: enable error handling path for hugepage" Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andrew Morton <[email protected]> Acked-by: Fengguang Wu <[email protected]> Signed-off-by: Andi Kleen <[email protected]>