aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2010-11-25mm: remove call to find_vma in pagewalk for non-hugetlbfsDavid Sterba1-2/+3
Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in walk_page_range()") introduces a check if a vma is a hugetlbfs one and later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is left behind and its result is not used anywhere else in the function. The side-effect of caching vma for @addr inside walk->mm is neither utilized in walk_page_range() nor in called functions. Signed-off-by: David Sterba <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Matt Mackall <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-25mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly ↵KAMEZAWA Hiroyuki1-9/+5
called under stop_machine_run() During memory hotplug, build_allzonelists() may be called under stop_machine_run(). In this function, setup_zone_pageset() is called. But it's bug because it will do page allocation under stop_machine_run(). Here is a report from Alok Kataria. BUG: sleeping function called from invalid context at kernel/mutex.c:94 in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0 Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1 Call Trace: [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0 [<ffffffff81468245>] mutex_lock+0x24/0x50 [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee [<ffffffff81048888>] ? load_balance+0xbe/0x60e [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed [<ffffffff8110f237>] __alloc_percpu+0x10/0x12 [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198 [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5 [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5 [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198 [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198 [<ffffffff81065f29>] kthread+0x7f/0x87 [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 [<ffffffff81065eaa>] ? kthread+0x0/0x87 [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456 Policy zone: Normal This patch tries to fix the issue by moving setup_zone_pageset() out from stop_machine_run(). It's obviously not necessary to be called under stop_machine_run(). [[email protected]: remove unneeded local] Reported-by: Alok Kataria <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Petr Vandrovec <[email protected]> Cc: Pekka Enberg <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-25cgroups: make swap accounting default behavior configurableMichal Hocko1-2/+19
Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP configuration option and then it is turned on by default. There is a boot option (noswapaccount) which can disable this feature. This makes it hard for distributors to enable the configuration option as this feature leads to a bigger memory consumption and this is a no-go for general purpose distribution kernel. On the other hand swap accounting may be very usuful for some workloads. This patch adds a new configuration option which controls the default behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED). If the option is selected then the feature is turned on by default. It also adds a new boot parameter swapaccount[=1|0] which enhances the original noswapaccount parameter semantic by means of enable/disable logic (defaults to 1 if no value is provided to be still consistent with noswapaccount). The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well) Signed-off-by: Michal Hocko <[email protected]> Acked-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-25memcg: avoid deadlock between move charge and try_charge()Daisuke Nishimura1-17/+26
__mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g. mlock does it). This means it can cause deadlock if it races with move charge: Ex.1) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | down_write(&mmap_sem) mc.moving_task = current | .. mem_cgroup_precharge_mc() | __mem_cgroup_try_charge() mem_cgroup_count_precharge() | prepare_to_wait() down_read(&mmap_sem) | if (mc.moving_task) -> cannot aquire the lock | -> true | schedule() Ex.2) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | mc.moving_task = current | mem_cgroup_precharge_mc() | mem_cgroup_count_precharge() | down_read(&mmap_sem) | .. | up_read(&mmap_sem) | | down_write(&mmap_sem) mem_cgroup_move_task() | .. mem_cgroup_move_charge() | __mem_cgroup_try_charge() down_read(&mmap_sem) | prepare_to_wait() -> cannot aquire the lock | if (mc.moving_task) | -> true | schedule() To avoid this deadlock, we do all the move charge works (both can_attach() and attach()) under one mmap_sem section. And after this patch, we set/clear mc.moving_task outside mc.lock, because we use the lock only to check mc.from/to. Signed-off-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-25memcg: fix false positive VM_BUG on non-SMPKirill A. Shutemov1-1/+1
Fix this: kernel BUG at mm/memcontrol.c:2155! invalid opcode: 0000 [#1] last sysfs file: Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0 EIP is at mem_cgroup_move_account+0xe2/0xf0 EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000 ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000) Stack: 00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000 08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50 c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340 Call Trace: [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130 [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130 [<c106c75e>] ? walk_page_range+0xee/0x1d0 [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90 [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130 [<c1072570>] ? mem_cgroup_move_task+0x0/0x90 [<c1042616>] ? cgroup_attach_task+0x136/0x200 [<c1042878>] ? cgroup_tasks_write+0x48/0xc0 [<c1041e9e>] ? cgroup_file_write+0xde/0x220 [<c101398d>] ? do_page_fault+0x17d/0x3f0 [<c108a79d>] ? alloc_fd+0x2d/0xd0 [<c1041dc0>] ? cgroup_file_write+0x0/0x220 [<c1077ba2>] ? vfs_write+0x92/0xc0 [<c1077c81>] ? sys_write+0x41/0x70 [<c1140e3d>] ? syscall_call+0x7/0xb Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3 EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30 ---[ end trace 7daa1582159b6532 ]--- lock_page_cgroup and unlock_page_cgroup are implemented using bit_spinlock. bit_spinlock doesn't touch the bit if we are on non-SMP machine, so we can't use the bit to check whether the lock was taken. Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead of PageCgroupLocked to fix it. [[email protected]: s/is_page_cgroup_locked/page_is_cgroup_locked/] Signed-off-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-25nommu: yield CPU while disposing VMSteven J. Magnani1-0/+1
Depending on processor speed, page size, and the amount of memory a process is allowed to amass, cleanup of a large VM may freeze the system for many seconds. This can result in a watchdog timeout. Make sure other tasks receive some service when cleaning up large VMs. Signed-off-by: Steven J. Magnani <[email protected]> Cc: Greg Ungerer <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-14Merge branch 'for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Fix slub_lock down/up imbalance
2010-11-14slub: Fix slub_lock down/up imbalancePavel Emelyanov1-1/+2
There are two places, that do not release the slub_lock. Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal of slab caches during boot). Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Pavel Emelyanov <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-11-12radix-tree: fix RCU bugNick Piggin1-16/+10
Salman Qazi describes the following radix-tree bug: In the following case, we get can get a deadlock: 0. The radix tree contains two items, one has the index 0. 1. The reader (in this case find_get_pages) takes the rcu_read_lock. 2. The reader acquires slot(s) for item(s) including the index 0 item. 3. The non-zero index item is deleted, and as a consequence the other item is moved to the root of the tree. The place where it used to be is queued for deletion after the readers finish. 3b. The zero item is deleted, removing it from the direct slot, it remains in the rcu-delayed indirect node. 4. The reader looks at the index 0 slot, and finds that the page has 0 ref count 5. The reader looks at it again, hoping that the item will either be freed or the ref count will increase. This never happens, as the slot it is looking at will never be updated. Also, this slot can never be reclaimed because the reader is holding rcu_read_lock and is in an infinite loop. The fix is to re-use the same "indirect" pointer case that requires a slot lookup retry into a general "retry the lookup" bit. Signed-off-by: Nick Piggin <[email protected]> Reported-by: Salman Qazi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-12vmscan: avoid setting zone congested if no page dirtyShaohua Li1-1/+1
nr_dirty and nr_congested are increased only when the page is dirty. So if all pages are clean, both them will be zero. In this case, we should not mark the zone congested. Signed-off-by: Shaohua Li <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-12mm/vfs: revalidate page->mapping in do_generic_file_read()Dave Hansen1-0/+3
70 hours into some stress tests of a 2.6.32-based enterprise kernel, we ran into a NULL dereference in here: int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, unsigned long from) { ----> struct inode *inode = page->mapping->host; It looks like page->mapping was the culprit. (xmon trace is below). After closer examination, I realized that do_generic_file_read() does a find_get_page(), and eventually locks the page before calling block_is_partially_uptodate(). However, it doesn't revalidate the page->mapping after the page is locked. So, there's a small window between the find_get_page() and ->is_partially_uptodate() where the page could get truncated and page->mapping cleared. We _have_ a reference, so it can't get reclaimed, but it certainly can be truncated. I think the correct thing is to check page->mapping after the trylock_page(), and jump out if it got truncated. This patch has been running in the test environment for a month or so now, and we have not seen this bug pop up again. xmon info: 1f:mon> e cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770] pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100 lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770 sp: c0000002ae36f9f0 msr: 8000000000009032 dar: 0 dsisr: 40000000 current = 0xc000000378f99e30 paca = 0xc000000000f66300 pid = 21946, comm = bash 1f:mon> r R00 = 0025c0500000006d R16 = 0000000000000000 R01 = c0000002ae36f9f0 R17 = c000000362cd3af0 R02 = c000000000e8cd80 R18 = ffffffffffffffff R03 = c0000000031d0f88 R19 = 0000000000000001 R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0 R05 = 0000000000000000 R21 = c0000002ae36fa68 R06 = 0000000000000000 R22 = 0000000000000000 R07 = 0000000000000001 R23 = c0000002ae36fbb0 R08 = 0000000000000002 R24 = 0000000000000000 R09 = 0000000000000000 R25 = c000000362cd3a80 R10 = 0000000000000000 R26 = 0000000000000002 R11 = c0000000001e7b60 R27 = 0000000000000000 R12 = 0000000042000484 R28 = 0000000000000001 R13 = c000000000f66300 R29 = c0000003bb97b9b8 R14 = 0000000000000001 R30 = c000000000e28a08 R15 = 000000000000ffff R31 = c0000000031d0f88 pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100 lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770 msr = 8000000000009032 cr = 22000488 ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300 dar = 0000000000000000 dsisr = 40000000 1f:mon> t [link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770 [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable) [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160 [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0 [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0 [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 00000080a840bc54 SP (fffca15df30) is in userspace 1f:mon> di c0000000001e7a6c c0000000001e7a6c e9290000 ld r9,0(r9) c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100 c0000000001e7a74 e9440008 ld r10,8(r4) c0000000001e7a78 78a80020 clrldi r8,r5,32 c0000000001e7a7c 3c000001 lis r0,1 c0000000001e7a80 812900a8 lwz r9,168(r9) c0000000001e7a84 39600001 li r11,1 c0000000001e7a88 7c080050 subf r0,r8,r0 c0000000001e7a8c 7f805040 cmplw cr7,r0,r10 c0000000001e7a90 7d6b4830 slw r11,r11,r9 c0000000001e7a94 796b0020 clrldi r11,r11,32 c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100 c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11 c0000000001e7aa0 7d004214 add r8,r0,r8 c0000000001e7aa4 79080020 clrldi r8,r8,32 c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100 Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Johannes Weiner <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-12memcg: null dereference on allocation failureDan Carpenter1-7/+9
The original code had a null dereference if alloc_percpu() failed. This was introduced in commit 711d3d2c9bc3 ("memcg: cpu hotplug aware percpu count updates") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-09perf_events: Fix perf_counter_mmap() hook in mprotect()Pekka Enberg1-1/+1
As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to merging. Fix the problem by moving perf_event_mmap() hook to mprotect_fixup(). Note: there's another successful return path from mprotect_fixup() if old flags equal to new flags. We don't, however, need to call perf_event_mmap() there because 'perf' already knows the VMA is executable. Reported-by: Dave Jones <[email protected]> Analyzed-by: Linus Torvalds <[email protected]> Cc: Ingo Molnar <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Signed-off-by: Pekka Enberg <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-03vmstat: fix offset calculation on void*Wu Fengguang1-1/+1
Fix regression introduced by commit 79da826aee6 ("writeback: report dirty thresholds in /proc/vmstat"). The incorrect pointer arithmetic can result in problems like this: BUG: unable to handle kernel paging request at 07c06d16 IP: [<c050c336>] strnlen+0x6/0x20 Call Trace: [<c050a249>] ? string+0x39/0xe0 [<c042be6b>] ? __wake_up_common+0x4b/0x80 [<c050afcc>] ? vsnprintf+0x1ec/0x380 [<c04b380e>] ? seq_printf+0x2e/0x60 [<c04829a6>] ? vmstat_show+0x26/0x30 [<c04b3bb6>] ? seq_read+0xa6/0x380 [<c04b3b10>] ? seq_read+0x0/0x380 [<c04d5d2f>] ? proc_reg_read+0x5f/0x90 [<c049c4a1>] ? vfs_read+0xa1/0x140 [<c04d5cd0>] ? proc_reg_read+0x0/0x90 [<c049c981>] ? sys_read+0x41/0x70 [<c0402bd0>] ? sysenter_do_call+0x12/0x26 Reported-by: Tetsuo Handa <[email protected]> Cc: Michael Rubin <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-11-02Release page reference during page fault retryMichel Lespinasse1-1/+3
This slipped by when unifying the filemap and swap versions of lock_page_or_retry()... Signed-off-by: Michel Lespinasse <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-30audit mmapAl Viro2-0/+4
Normal syscall audit doesn't catch 5th argument of syscall. It also doesn't catch the contents of userland structures pointed to be syscall argument, so for both old and new mmap(2) ABI it doesn't record the descriptor we are mapping. For old one it also misses flags. Signed-off-by: Al Viro <[email protected]>
2010-10-29convert get_sb_nodev() usersAl Viro1-5/+5
Signed-off-by: Al Viro <[email protected]>
2010-10-28numa: fix slab_node(MPOL_BIND)Eric Dumazet1-1/+1
When a node contains only HighMem memory, slab_node(MPOL_BIND) dereferences a NULL pointer. [ This code seems to go back all the way to commit 19770b32609b: "mm: filter based on a nodemask as well as a gfp_mask". Which was back in April 2008, and it got merged into 2.6.26. - Linus ] Signed-off-by: Eric Dumazet <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300Linus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: (44 commits) MN10300: Save frame pointer in thread_info struct rather than global var MN10300: Change "Matsushita" to "Panasonic". MN10300: Create a defconfig for the ASB2364 board MN10300: Update the ASB2303 defconfig MN10300: ASB2364: Add support for SMSC911X and SMC911X MN10300: ASB2364: Handle the IRQ multiplexer in the FPGA MN10300: Generic time support MN10300: Specify an ELF HWCAP flag for MN10300 Atomic Operations Unit support MN10300: Map userspace atomic op regs as a vmalloc page MN10300: And Panasonic AM34 subarch and implement SMP MN10300: Delete idle_timestamp from irq_cpustat_t MN10300: Make various interrupt priority settings configurable MN10300: Optimise do_csum() MN10300: Implement atomic ops using atomic ops unit MN10300: Make the FPU operate in non-lazy mode under SMP MN10300: SMP TLB flushing MN10300: Use the [ID]PTEL2 registers rather than [ID]PTEL for TLB control MN10300: Make the use of PIDR to mark TLB entries controllable MN10300: Rename __flush_tlb*() to local_flush_tlb*() MN10300: AM34 erratum requires MMUCTR read and write on exception entry ...
2010-10-27fuse: use release_pages()Miklos Szeredi1-0/+1
Replace iterated page_cache_release() with release_pages(), which is faster and shorter. Needs release_pages() to be exported to modules. Suggested-by: Andrew Morton <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27memcg: generic filestat update interfaceKAMEZAWA Hiroyuki1-7/+18
This patch extracts the core logic from mem_cgroup_update_file_mapped() as mem_cgroup_update_file_stat() and adds a wrapper. As a planned future update, memory cgroup has to count dirty pages to implement dirty_ratio/limit. And more, the number of dirty pages is required to kick flusher thread to start writeback. (Now, no kick.) This patch is preparation for it and makes other statistics implementation clearer. Just a clean up. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Balbir Singh <[email protected]> Reviewed-by: Greg Thelen <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27memcg: cpu hotplug aware quick acount_move detectionKAMEZAWA Hiroyuki1-7/+30
An event counter MEM_CGROUP_ON_MOVE is used for quick check whether file stat update can be done in async manner or not. Now, it use percpu counter and for_each_possible_cpu to update. This patch replaces for_each_possible_cpu to for_each_online_cpu and adds necessary synchronization logic at CPU HOTPLUG. [[email protected]: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27memcg: cpu hotplug aware percpu count updatesKAMEZAWA Hiroyuki1-9/+93
Now, memcgroup's per cpu coutner uses for_each_possible_cpu() to get the value. It's better to use for_each_online_cpu() and a cpu hotplug handler. This patch only handles statistics counter. MEM_CGROUP_ON_MOVE will be handled in another patch. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27memcg: use for_each_mem_cgroupKAMEZAWA Hiroyuki1-87/+83
In memory cgroup management, we sometimes have to walk through subhierarchy of cgroup to gather informaiton, or lock something, etc. Now, to do that, mem_cgroup_walk_tree() function is provided. It calls given callback function per cgroup found. But the bad thing is that it has to pass a fixed style function and argument, "void*" and it adds much type casting to memcontrol.c. To make the code clean, this patch replaces walk_tree() with for_each_mem_cgroup_tree(iter, root) An iterator style call. The good point is that iterator call doesn't have to assume what kind of function is called under it. A bad point is that it may cause reference-count leak if a caller use "break" from the loop by mistake. I think the benefit is larger. The modified code seems straigtforward and easy to read because we don't have misterious callbacks and pointer cast. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27memcg: avoid lock in updating file_mapped (Was fix race in file_mapped ↵KAMEZAWA Hiroyuki1-14/+85
accouting flag management At accounting file events per memory cgroup, we need to find memory cgroup via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee pc->mem_cgroup is not overwritten while we make use of it. But, considering the context which page-cgroup for files are accessed, we can use alternative light-weight mutual execusion in the most case. At handling file-caches, the only race we have to take care of is "moving" account, IOW, overwriting page_cgroup->mem_cgroup. (See comment in the patch) Unlike charge/uncharge, "move" happens not so frequently. It happens only when rmdir() and task-moving (with a special settings.) This patch adds a race-checker for file-cache-status accounting v.s. account moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added. The routine for account move 1. Increment it before start moving 2. Call synchronize_rcu() 3. Decrement it after the end of moving. By this, file-status-counting routine can check it needs to call lock_page_cgroup(). In most case, I doesn't need to call it. Following is a perf data of a process which mmap()/munmap 32MB of file cache in a minute. Before patch: 28.25% mmap mmap [.] main 22.64% mmap [kernel.kallsyms] [k] page_fault 9.96% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped 3.67% mmap [kernel.kallsyms] [k] filemap_fault 3.50% mmap [kernel.kallsyms] [k] unmap_vmas 2.99% mmap [kernel.kallsyms] [k] __do_fault 2.76% mmap [kernel.kallsyms] [k] find_get_page After patch: 30.00% mmap mmap [.] main 23.78% mmap [kernel.kallsyms] [k] page_fault 5.52% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped 3.81% mmap [kernel.kallsyms] [k] unmap_vmas 3.26% mmap [kernel.kallsyms] [k] find_get_page 3.18% mmap [kernel.kallsyms] [k] __do_fault 3.03% mmap [kernel.kallsyms] [k] filemap_fault 2.40% mmap [kernel.kallsyms] [k] handle_mm_fault 2.40% mmap [kernel.kallsyms] [k] do_page_fault This patch reduces memcg's cost to some extent. (mem_cgroup_update_file_mapped is called by both of map/unmap) Note: It seems some more improvements are required..but no idea. maybe removing set/unset flag is required. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27memcg: fix race in file_mapped accouting flag managementKAMEZAWA Hiroyuki1-1/+2
Presently memory cgroup accounts file-mapped by counter and flag. counter is working in the same way with zone_stat but FileMapped flag only exists in memcg (for helping move_account). This flag can be updated wrongly in a case. Assume CPU0 and CPU1 and a thread mapping a page on CPU0, another thread unmapping it on CPU1. CPU0 CPU1 rmv rmap (mapcount 1->0) add rmap (mapcount 0->1) lock_page_cgroup() memcg counter+1 (some delay) set MAPPED FLAG. unlock_page_cgroup() lock_page_cgroup() memcg counter-1 clear MAPPED flag In the above sequence counter is properly updated but FLAG is not. This means that representing a state by a flag which is maintained by counter needs some special care. To handle this, when clearing a flag, this patch check mapcount directly and clear the flag only when mapcount == 0. (if mapcount >0, someone will make it to zero later and flag will be cleared.) Reverse case, dec-after-inc cannot be a problem because page_table_lock() works well for it. (IOW, to make above sequence, 2 processes should touch the same page at once with map/unmap.) Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27mm,x86: fix kmap_atomic_push vs ioremap_32.cPeter Zijlstra1-1/+5
It appears i386 uses kmap_atomic infrastructure regardless of CONFIG_HIGHMEM which results in a compile error when highmem is disabled. Cure this by providing the needed few bits for both CONFIG_HIGHMEM and CONFIG_X86_32. Signed-off-by: Peter Zijlstra <[email protected]> Reported-by: Chris Wilson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-27MN10300: Save frame pointer in thread_info struct rather than global varDavid Howells1-1/+1
Save the current exception frame pointer in the thread_info struct rather than in a global variable as the latter makes SMP tricky, especially when preemption is also enabled. This also replaces __frame with current_frame() and rearranges header file inclusions to make it all compile. Signed-off-by: David Howells <[email protected]> Acked-by: Akira Takeuchi <[email protected]>
2010-10-26Merge branch 'for-linus' of ↵Linus Torvalds2-6/+7
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) split invalidate_inodes() fs: skip I_FREEING inodes in writeback_sb_inodes fs: fold invalidate_list into invalidate_inodes fs: do not drop inode_lock in dispose_list fs: inode split IO and LRU lists fs: switch bdev inode bdi's correctly fs: fix buffer invalidation in invalidate_list fsnotify: use dget_parent smbfs: use dget_parent exportfs: use dget_parent fs: use RCU read side protection in d_validate fs: clean up dentry lru modification fs: split __shrink_dcache_sb fs: improve DCACHE_REFERENCED usage fs: use percpu counter for nr_dentry and nr_dentry_unused fs: simplify __d_free fs: take dcache_lock inside __d_path fs: do not assign default i_ino in new_inode fs: introduce a per-cpu last_ino allocator new helper: ihold() ...
2010-10-26kernel: remove PF_FLUSHERPeter Zijlstra1-1/+1
PF_FLUSHER is only ever set, not tested, remove it. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26use clear_page()/copy_page() in favor of memset()/memcpy() on whole pagesJan Beulich1-1/+1
After all that's what they are intended for. Signed-off-by: Jan Beulich <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26replace nested max/min macros with {max,min}3 macroHagen Paul Pfeifer1-1/+1
Use the new {max,min}3 macros to save some cycles and bytes on the stack. This patch substitutes trivial nested macros with their counterpart. Signed-off-by: Hagen Paul Pfeifer <[email protected]> Cc: Joe Perches <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hartley Sweeten <[email protected]> Cc: Russell King <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Roland Dreier <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: do_migrate_range: reduce list_empty() checkBob Liu1-12/+9
Simple code for reducing list_empty(&source) check. Signed-off-by: Bob Liu <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Wu Fengguang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: do_migrate_range: exit loop if not_managed is trueBob Liu1-4/+6
If not_managed is true all pages will be putback to lru, so break the loop earlier to skip other pages isolate. Signed-off-by: Bob Liu <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Wu Fengguang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: page_isolation: codeclean fix comment and rm unneeded val initBob Liu1-2/+1
__test_page_isolated_in_pageblock() returns 1 if all pages in the range are isolated, so fix the comment. Variable `pfn' will be initialised in the following loop so remove it. Signed-off-by: Bob Liu <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: fix is_mem_section_removable() page_order BUG_ON checkKAMEZAWA Hiroyuki1-1/+1
page_order() is called by memory hotplug's user interface to check the section is removable or not. (is_mem_section_removable()) It calls page_order() withoug holding zone->lock. So, even if the caller does if (PageBuddy(page)) ret = page_order(page) ... The caller may hit BUG_ON(). For fixing this, there are 2 choices. 1. add zone->lock. 2. remove BUG_ON(). is_mem_section_removable() is used for some "advice" and doesn't need to be 100% accurate. This is_removable() can be called via user program.. We don't want to take this important lock for long by user's request. So, this patch removes BUG_ON(). Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Wu Fengguang <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm/hugetlb.c: add missing spin_lock() to hugetlb_cow()Dean Nelson1-1/+4
Add missing spin_lock() of the page_table_lock before an error return in hugetlb_cow(). Callers of hugtelb_cow() expect it to be held upon return. Signed-off-by: Dean Nelson <[email protected]> Cc: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: fix error reporting in move_pages() syscallGleb Natapov1-2/+2
The vma returned by find_vma does not necessarily include the target address. If this happens the code tries to follow a page outside of any vma and returns ENOENT instead of EFAULT. Signed-off-by: Gleb Natapov <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26/proc/swaps: support pollingKay Sievers1-1/+48
System management wants to subscribe to changes in swap configuration. Make /proc/swaps pollable like /proc/mounts. [[email protected]: document proc_poll_event] Signed-off-by: Kay Sievers <[email protected]> Acked-by: Greg KH <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: add vzalloc() and vzalloc_node() helpersDave Young2-3/+92
Add vzalloc() and vzalloc_node() to encapsulate the vmalloc-then-memset-zero operation. Use __GFP_ZERO to zero fill the allocated memory. Signed-off-by: Dave Young <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: Greg Ungerer <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm/memory_hotplug.c: make scan_lru_pages() staticAndrew Morton1-1/+1
Reported-by: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26vmstat: include compaction.h when CONFIG_COMPACTIONNamhyung Kim1-0/+2
This removes following warning from sparse: mm/vmstat.c:466:5: warning: symbol 'fragmentation_index' was not declared. Should it be static? [[email protected]: move the include to top-of-file] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26vmalloc: annotate lock context change on s_start/stop()Namhyung Kim1-0/+2
s_start() and s_stop() grab/release vmlist_lock but were missing proper annotations. Add them. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26vmalloc: rename temporary variable in __insert_vmap_area()Namhyung Kim1-4/+4
Rename redundant 'tmp' to fix following sparse warnings: mm/vmalloc.c:296:34: warning: symbol 'tmp' shadows an earlier one mm/vmalloc.c:293:24: originally declared here Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26rmap: make anon_vma_chain_free() staticNamhyung Kim1-1/+1
Make anon_vma_chain_free() static. It is called only in rmap.c and the corresponding alloc function is already static. Signed-off-by: Namhyung Kim <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26rmap: wrap page_check_address() using __cond_lock()Namhyung Kim1-1/+1
The page_check_address() conditionally grabs *@ptlp in case of returning non-NULL. Rename and wrap it using __cond_lock() removes following warnings from sparse: mm/rmap.c:472:9: warning: context imbalance in 'page_mapped_in_vma' - unexpected unlock mm/rmap.c:524:9: warning: context imbalance in 'page_referenced_one' - unexpected unlock mm/rmap.c:706:9: warning: context imbalance in 'page_mkclean_one' - unexpected unlock mm/rmap.c:1066:9: warning: context imbalance in 'try_to_unmap_one' - unexpected unlock Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26rmap: annotate lock context change on page_[un]lock_anon_vma()Namhyung Kim1-1/+3
The page_lock_anon_vma() conditionally grabs RCU and anon_vma lock but page_unlock_anon_vma() releases them unconditionally. This leads sparse to complain about context imbalance. Annotate them. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: wrap follow_pte() using __cond_lock()Namhyung Kim1-1/+12
The follow_pte() conditionally grabs *@ptlp in case of returning 0. Rename and wrap it using __cond_lock() removes following warnings: mm/memory.c:2337:9: warning: context imbalance in 'do_wp_page' - unexpected unlock mm/memory.c:3142:19: warning: context imbalance in 'handle_mm_fault' - different lock contexts for basic block Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: add lock release annotation on do_wp_page()Namhyung Kim1-0/+1
The do_wp_page() releases @ptl but was missing proper annotation. Add it. This removes following warnings from sparse: mm/memory.c:2337:9: warning: context imbalance in 'do_wp_page' - unexpected unlock mm/memory.c:3142:19: warning: context imbalance in 'handle_mm_fault' - different lock contexts for basic block Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-26mm: wrap get_locked_pte() using __cond_lock()Namhyung Kim1-1/+1
The get_locked_pte() conditionally grabs 'ptl' in case of returning non-NULL. This leads sparse to complain about context imbalance. Rename and wrap it using __cond_lock() to make sparse happy. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>