aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2010-07-19Merge branch 'kmemleak' of ↵Linus Torvalds2-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm * 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm: kmemleak: Add support for NO_BOOTMEM configurations kmemleak: Annotate false positive in init_section_page_cgroup()
2010-07-16slub: Use kmem_cache flags to detect if slab is in debugging mode.Christoph Lameter1-21/+12
The cacheline with the flags is reachable from the hot paths after the percpu allocator changes went in. So there is no need anymore to put a flag into each slab page. Get rid of the SlubDebug flag and use the flags in kmem_cache instead. Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-07-16slub: Allow removal of slab caches during bootChristoph Lameter1-0/+7
If a slab cache is removed before we have setup sysfs then simply skip over the sysfs handling. Cc: Benjamin Herrenschmidt <[email protected]> Cc: Roland Dreier <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-07-16slub: Check kasprintf results in kmem_cache_init()Christoph Lameter1-3/+6
Small allocations may fail during slab bringup which is fatal. Add a BUG_ON() so that we fail immediately rather than failing later during sysfs processing. Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-07-16SLUB: Constants need ULChristoph Lameter1-2/+2
UL suffix is missing in some constants. Conform to how slab.h uses constants. Acked-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-07-16slub: Use a constant for a unspecified node.Christoph Lameter1-7/+7
kmalloc_node() and friends can be passed a constant -1 to indicate that no choice was made for the node from which the object needs to come. Use NUMA_NO_NODE instead of -1. CC: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-07-16SLOB: Free objects to their own listBob Liu1-1/+8
SLOB has alloced smaller objects from their own list in reduce overall external fragmentation and increase repeatability, free to their own list also. This is /proc/meminfo result in my test machine: without this patch: === MemTotal: 1030720 kB MemFree: 750012 kB Buffers: 15496 kB Cached: 160396 kB SwapCached: 0 kB Active: 105024 kB Inactive: 145604 kB Active(anon): 74816 kB Inactive(anon): 2180 kB Active(file): 30208 kB Inactive(file): 143424 kB Unevictable: 16 kB .... with this patch: === MemTotal: 1030720 kB MemFree: 751908 kB Buffers: 15492 kB Cached: 160280 kB SwapCached: 0 kB Active: 102720 kB Inactive: 146140 kB Active(anon): 73168 kB Inactive(anon): 2180 kB Active(file): 29552 kB Inactive(file): 143960 kB Unevictable: 16 kB ... The result shows an improvement of 1 MB! And when I tested it on a embeded system with 64 MB, I found this path is never called during kernel bootup. Acked-by: Matt Mackall <[email protected]> Signed-off-by: Bob Liu <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2010-07-14lmb: rename to memblockYinghai Lu3-0/+546
via following scripts FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/lmb/memblock/g' \ -e 's/LMB/MEMBLOCK/g' \ $FILES for N in $(find . -name lmb.[ch]); do M=$(echo $N | sed 's/lmb/memblock/g') mv $N $M done and remove some wrong change like lmbench and dlmb etc. also move memblock.c from lib/ to mm/ Suggested-by: Ingo Molnar <[email protected]> Acked-by: "H. Peter Anvin" <[email protected]> Acked-by: Benjamin Herrenschmidt <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
2010-07-09x86, ioremap: Fix incorrect physical address handling in PAE modeKenji Kaneshige1-1/+1
Current x86 ioremap() doesn't handle physical address higher than 32-bit properly in X86_32 PAE mode. When physical address higher than 32-bit is passed to ioremap(), higher 32-bits in physical address is cleared wrongly. Due to this bug, ioremap() can map wrong address to linear address space. In my case, 64-bit MMIO region was assigned to a PCI device (ioat device) on my system. Because of the ioremap()'s bug, wrong physical address (instead of MMIO region) was mapped to linear address space. Because of this, loading ioatdma driver caused unexpected behavior (kernel panic, kernel hangup, ...). Signed-off-by: Kenji Kaneshige <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-07-08Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2-15/+5
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: writeback: simplify the write back thread queue writeback: split writeback_inodes_wb writeback: remove writeback_inodes_wbc fs-writeback: fix kernel-doc warnings splice: check f_mode for seekable file splice: direct_splice_actor() should not use pos in sd
2010-07-06writeback: simplify the write back thread queueChristoph Hellwig1-11/+3
First remove items from work_list as soon as we start working on them. This means we don't have to track any pending or visited state and can get rid of all the RCU magic freeing the work items - we can simply free them once the operation has finished. Second use a real completion for tracking synchronous requests - if the caller sets the completion pointer we complete it, otherwise use it as a boolean indicator that we can free the work item directly. Third unify struct wb_writeback_args and struct bdi_work into a single data structure, wb_writeback_work. Previous we set all parameters into a struct wb_writeback_args, copied it into struct bdi_work, copied it again on the stack to use it there. Instead of just allocate one structure dynamically or on the stack and use it all the way through the stack. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2010-07-06writeback: remove writeback_inodes_wbcChristoph Hellwig2-4/+2
This was just an odd wrapper around writeback_inodes_wb. Removing this also allows to get rid of the bdi member of struct writeback_control which was rather out of place there. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2010-07-05Merge commit 'v2.6.35-rc4' into perf/coreIngo Molnar3-8/+10
Merge reason: Pick up the latest perf fixes Signed-off-by: Ingo Molnar <[email protected]>
2010-07-01Merge branch 'linus' into core/rcuIngo Molnar4-14/+40
Conflicts: fs/fs-writeback.c Merge reason: Resolve the conflict Note, i picked the version from Linus's tree, which effectively reverts the fs-writeback.c bits of: b97181f: fs: remove all rcu head initializations, except on_stack initializations As the upstream changes to this file changed this code heavily and the first attempt to resolve the conflict resulted in a non-booting kernel. It's safer to re-try this portion of the commit cleanly. Signed-off-by: Ingo Molnar <[email protected]>
2010-06-29mempolicy: fix dangling reference to tmpfs superblock mpolLee Schermerhorn1-4/+5
My patch to "Factor out duplicate put/frees in mpol_shared_policy_init() to a common return path"; and Dan Carpenter's fix thereto both left a dangling reference to the incoming tmpfs superblock mempolicy structure. A similar leak was introduced earlier when the nodemask was moved offstack to the scratch area despite the note in the comment block regarding the incoming ref. Move the remaining 'put of the incoming "mpol" to the common exit path to drop the reference. Signed-off-by: Lee Schermerhorn <[email protected]> Acked-by: Dan Carpenter <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-06-29memcg: fix wake up in oom wait queueKAMEZAWA Hiroyuki1-1/+3
OOM-waitqueue should be waken up when oom_disable is canceled. This is a fix for 3c11ecf448eff8f1 ("memcg: oom kill disable and oom status"). How to test: Create a cgroup A... 1. set memory.limit and memory.memsw.limit to be small value 2. echo 1 > /cgroup/A/memory.oom_control, this disables oom-kill. 3. run a program which must cause OOM. A program executed in 3 will sleep by oom_waiqueue in memcg. Then, how to wake it up is problem. 1. echo 0 > /cgroup/A/memory.oom_control (enable OOM-killer) 2. echo big mem > /cgroup/A/memory.memsw.limit_in_bytes(allow more swap) etc.. Without the patch, a task in slept can not be waken up. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-06-29Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-3/+2
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Don't count_vm_events for discard bio in submit_bio. cfq: fix recursive call in cfq_blkiocg_update_completion_stats() cfq-iosched: Fixed boot warning with BLK_CGROUP=y and CFQ_GROUP_IOSCHED=n cfq: Don't allow queue merges for queues that have no process references block: fix DISCARD_BARRIER requests cciss: set SCSI max cmd len to 16, as default is wrong cpqarray: fix two more wrong section type cpqarray: fix wrong __init type on pci probe function drbd: Fixed a race between disk-attach and unexpected state changes writeback: fix pin_sb_for_writeback writeback: add missing requeue_io in writeback_inodes_wb writeback: simplify and split bdi_start_writeback writeback: simplify wakeup_flusher_threads writeback: fix writeback_inodes_wb from writeback_inodes_sb writeback: enforce s_umount locking in writeback_inodes_sb writeback: queue work on stack in writeback_inodes_sb writeback: fix writeback completion notifications
2010-06-28Merge branch 'linus' into perf/coreThomas Gleixner1-6/+30
Reason: Further changes conflict with upstream fixes Signed-off-by: Thomas Gleixner <[email protected]>
2010-06-27percpu: allow limited allocation before slab is onlineTejun Heo1-12/+40
This patch updates percpu allocator such that it can serve limited amount of allocation before slab comes online. This is primarily to allow slab to depend on working percpu allocator. Two parameters, PERCPU_DYNAMIC_EARLY_SIZE and SLOTS, determine how much memory space and allocation map slots are reserved. If this reserved area is exhausted, WARN_ON_ONCE() will trigger and allocation will fail till slab comes online. The following changes are made to implement early alloc. * pcpu_mem_alloc() now checks slab_is_available() * Chunks are allocated using pcpu_mem_alloc() * Init paths make sure ai->dyn_size is at least as large as PERCPU_DYNAMIC_EARLY_SIZE. * Initial alloc maps are allocated in __initdata and copied to kmalloc'd areas once slab is online. Signed-off-by: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]>
2010-06-27percpu: make @dyn_size always mean min dyn_size in first chunk init functionsTejun Heo1-25/+10
In pcpu_build_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum size. There's no use case for forcing 0 and the upcoming early alloc support always requires non-zero dynamic size. Make @dyn_size always mean minimum dyn_size. While at it, make pcpu_build_alloc_info() static which doesn't have any external caller as suggested by David Rientjes. Signed-off-by: Tejun Heo <[email protected]> Cc: David Rientjes <[email protected]>
2010-06-18percpu: fix first chunk match in per_cpu_ptr_to_phys()Tejun Heo1-3/+28
per_cpu_ptr_to_phys() determines whether the passed in @addr belongs to the first_chunk or not by just matching the address against the address range of the base unit (unit0, used by cpu0). When an adress from another cpu was passed in, it will always determine that the address doesn't belong to the first chunk even when it does. This makes the function return a bogus physical address which may lead to crash. This problem was discovered by Cliff Wickman while investigating a crash during kdump on a SGI UV system. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Cliff Wickman <[email protected]> Tested-by: Cliff Wickman <[email protected]> Cc: [email protected]
2010-06-18Merge commit 'v2.6.35-rc3' into perf/coreIngo Molnar3-27/+41
Merge reason: Go from -rc1 base to -rc3 base, merge in fixes.
2010-06-17percpu: fix trivial bugs in pcpu_build_alloc_info()Pavel V. Panteleev1-3/+2
Fix the following two trivial bugs in pcpu_build_alloc_info() * we should memset group_cnt to 0 by size of group_cnt, not size of group_map (both are of the same size, so the bug isn't dangerous) * we can delete useless variable group_cnt_max. Signed-off-by: Pavel V. Panteleev <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2010-06-14mm: remove all rcu head initializationsPaul E. McKenney2-2/+0
Remove all rcu head inits. We don't care about the RCU head state before passing it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can keep track of objects on stack. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Matt Mackall <[email protected]> Cc: Andrew Morton <[email protected]>
2010-06-11writeback: simplify and split bdi_start_writebackChristoph Hellwig1-3/+2
bdi_start_writeback now never gets a superblock passed, so we can just remove that case. And to further untangle the code and flatten the call stack split it into two trivial helpers for it's two callers. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2010-06-09Merge branch 'perf/core' of ↵Ingo Molnar3-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core
2010-06-09tracing: Remove kmemtrace ftrace pluginLi Zefan3-3/+3
We have been resisting new ftrace plugins and removing existing ones, and kmemtrace has been superseded by kmem trace events and perf-kmem, so we remove it. Signed-off-by: Li Zefan <[email protected]> Acked-by: Pekka Enberg <[email protected]> Acked-by: Eduard - Gabriel Munteanu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Steven Rostedt <[email protected]> [ remove kmemtrace from the makefile, handle slob too ] Signed-off-by: Frederic Weisbecker <[email protected]>
2010-06-09perf: Add non-exec mmap() trackingEric B Munson1-1/+5
Add the capacility to track data mmap()s. This can be used together with PERF_SAMPLE_ADDR for data profiling. Signed-off-by: Anton Blanchard <[email protected]> [Updated code for stable perf ABI] Signed-off-by: Eric B Munson <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Steven Rostedt <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-06-08writeback: limit write_cache_pages integrity scanning to current EOFDave Chinner1-0/+15
sync can currently take a really long time if a concurrent writer is extending a file. The problem is that the dirty pages on the address space grow in the same direction as write_cache_pages scans, so if the writer keeps ahead of writeback, the writeback will not terminate until the writer stops adding dirty pages. For a data integrity sync, we only need to write the pages dirty at the time we start the writeback, so we can stop scanning once we get to the page that was at the end of the file at the time the scan started. This will prevent operations like copying a large file preventing sync from completing as it will not write back pages that were dirtied after the sync was started. This does not impact the existing integrity guarantees, as any dirty page (old or new) within the EOF range at the start of the scan will still be captured. This patch will not prevent sync from blocking on large writes into holes. That requires more complex intervention while this patch only addresses the common append-case of this sync holdoff. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-06-08writeback: pay attention to wbc->nr_to_write in write_cache_pagesDave Chinner1-10/+5
If a filesystem writes more than one page in ->writepage, write_cache_pages fails to notice this and continues to attempt writeback when wbc->nr_to_write has gone negative - this trace was captured from XFS: wbc_writeback_start: towrt=1024 wbc_writepage: towrt=1024 wbc_writepage: towrt=0 wbc_writepage: towrt=-1 wbc_writepage: towrt=-5 wbc_writepage: towrt=-21 wbc_writepage: towrt=-85 This has adverse effects on filesystem writeback behaviour. write_cache_pages() needs to terminate after a certain number of pages are written, not after a certain number of calls to ->writepage are made. This is a regression introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add no_nrwrite_index_update writeback control flag"), but cannot be reverted directly due to subsequent bug fixes that have gone in on top of it. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-06-04Merge branch 'for-linus' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Minix: Clean up left over label fix truncate inode time modification breakage fix setattr error handling in sysfs, configfs fcntl: return -EFAULT if copy_to_user fails wrong type for 'magic' argument in simple_fill_super() fix the deadlock in qib_fs mqueue doesn't need make_bad_inode()
2010-06-04Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-2/+2
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits) block: make blk_init_free_list and elevator_init idempotent block: avoid unconditionally freeing previously allocated request_queue pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface pipe: change the privilege required for growing a pipe beyond system max pipe: adjust minimum pipe size to 1 page block: disable preemption before using sched_clock() cciss: call BUG() earlier Preparing 8.3.8rc2 drbd: Reduce verbosity drbd: use drbd specific ratelimit instead of global printk_ratelimit drbd: fix hang on local read errors while disconnected drbd: Removed the now empty w_io_error() function drbd: removed duplicated #includes drbd: improve usage of MSG_MORE drbd: need to set socket bufsize early to take effect drbd: improve network latency, TCP_QUICKACK drbd: Revert "drbd: Create new current UUID as late as possible" brd: support discard Revert "writeback: fix WB_SYNC_NONE writeback from umount" Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync" ...
2010-06-04vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failureKOSAKI Motohiro1-13/+16
Greg Thelen reported recent Johannes's stack diet patch makes kernel hang. His test is following. mount -t cgroup none /cgroups -o memory mkdir /cgroups/cg1 echo $$ > /cgroups/cg1/tasks dd bs=1024 count=1024 if=/dev/null of=/data/foo echo $$ > /cgroups/tasks echo 1 > /cgroups/cg1/memory.force_empty Actually, This OOM hard to try logic have been corrupted since following two years old patch. commit a41f24ea9fd6169b147c53c2392e2887cc1d9247 Author: Nishanth Aravamudan <[email protected]> Date: Tue Apr 29 00:58:25 2008 -0700 page allocator: smarter retry of costly-order allocations Original intention was "return success if the system have shrinkable zones though priority==0 reclaim was failure". But the above patch changed to "return nr_reclaimed if .....". Oh, That forgot nr_reclaimed may be 0 if priority==0 reclaim failure. And Johannes's patch 0aeb2339e54e ("vmscan: remove all_unreclaimable scan control") made it more corrupt. Originally, priority==0 reclaim failure on memcg return 0, but this patch changed to return 1. It totally confused memcg. This patch fixes it completely. Reported-by: Greg Thelen <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Tested-by: Greg Thelen <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-06-04fix truncate inode time modification breakageNick Piggin1-2/+3
mtime and ctime should be changed only if the file size has actually changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate sequence has caused regressions where they always update timestamps. There is some strange cases in POSIX where truncate(2) must not update times unless the size has acutally changed, see 6e656be89. This area is all still rather buggy in different ways in a lot of filesystems and needs a cleanup and audit (ideally the vfs will provide a simple attribute or call to direct all filesystems exactly which attributes to change). But coming up with the best solution will take a while and is not appropriate for rc anyway. So fix recent regression for now. Signed-off-by: Nick Piggin <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-06-01Merge branch 'master' into for-linusJens Axboe27-761/+2574
Conflicts: fs/pipe.c Signed-off-by: Jens Axboe <[email protected]>
2010-06-01Revert "writeback: fix WB_SYNC_NONE writeback from umount"Jens Axboe1-2/+2
This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Conflicts: fs/fs-writeback.c mm/page-writeback.c Signed-off-by: Jens Axboe <[email protected]>
2010-05-30Merge branch 'slub/urgent' of ↵Linus Torvalds1-22/+11
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slub/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLUB: Allow full duplication of kmalloc array for 390 slub: move kmem_cache_node into it's own cacheline
2010-05-30Merge branch 'for-linus' of ↵Linus Torvalds2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: mm: export generic_pipe_buf_*() to modules fuse: support splice() reading from fuse device fuse: allow splice to move pages mm: export remove_from_page_cache() to modules mm: export lru_cache_add_*() to modules fuse: support splice() writing to fuse device fuse: get page reference for readpages fuse: use get_user_pages_fast() fuse: remove unneeded variable
2010-05-27tmpfs: convert to use the new truncate convention[email protected]1-21/+22
Cc: Christoph Hellwig <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Nick Piggin <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-05-27fs: introduce new truncate sequence[email protected]1-5/+5
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <[email protected]> Acked-by: Jan Kara <[email protected]> Signed-off-by: Nick Piggin <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-05-27rename the generic fsync implementationsChristoph Hellwig1-1/+1
We don't name our generic fsync implementations very well currently. The no-op implementation for in-memory filesystems currently is called simple_sync_file which doesn't make too much sense to start with, the the generic one for simple filesystems is called simple_fsync which can lead to some confusion. This patch renames the generic file fsync method to generic_file_fsync to match the other generic_file_* routines it is supposed to be used with, and the no-op implementation to noop_fsync to make it obvious what to expect. In addition add some documentation for both methods. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-05-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds1-5/+31
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits) Btrfs: add more error checking to btrfs_dirty_inode Btrfs: allow unaligned DIO Btrfs: drop verbose enospc printk Btrfs: Fix block generation verification race Btrfs: fix preallocation and nodatacow checks in O_DIRECT Btrfs: avoid ENOSPC errors in btrfs_dirty_inode Btrfs: move O_DIRECT space reservation to btrfs_direct_IO Btrfs: rework O_DIRECT enospc handling Btrfs: use async helpers for DIO write checksumming Btrfs: don't walk around with task->state != TASK_RUNNING Btrfs: do aio_write instead of write Btrfs: add basic DIO read/write support direct-io: do not merge logically non-contiguous requests direct-io: add a hook for the fs to provide its own submit_bio function fs: allow short direct-io reads to be completed via buffered IO Btrfs: Metadata ENOSPC handling for balance Btrfs: Pre-allocate space for data relocation Btrfs: Metadata ENOSPC handling for tree log Btrfs: Metadata reservation for orphan inodes Btrfs: Introduce global metadata reservation ...
2010-05-27numa: slab: use numa_mem_id() for slab local memory nodeLee Schermerhorn1-21/+22
Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Nick Piggin <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Whitney <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27numa: introduce numa_mem_id()- effective local memory node idLee Schermerhorn1-1/+44
Introduce numa_mem_id(), based on generic percpu variable infrastructure to track "nearest node with memory" for archs that support memoryless nodes. Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES if/when they support them. Archs can override definitions of: numa_mem_id() - returns node number of "local memory" node set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem' cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue Generic initialization of 'numa_mem' occurs in __build_all_zonelists(). This will initialize the boot cpu at boot time, and all cpus on change of numa_zonelist_order, or when node or memory hot-plug requires zonelist rebuild. Archs that support memoryless nodes will need to initialize 'numa_mem' for secondary cpus as they're brought on-line. [[email protected]: fix build] Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Nick Piggin <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Whitney <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27numa: add generic percpu var numa_node_id() implementationLee Schermerhorn1-0/+5
Rework the generic version of the numa_node_id() function to use the new generic percpu variable infrastructure. Guard the new implementation with a new config option: CONFIG_USE_PERCPU_NUMA_NODE_ID. Archs which support this new implemention will default this option to 'y' when NUMA is configured. This config option could be removed if/when all archs switch over to the generic percpu implementation of numa_node_id(). Arch support involves: 1) converting any existing per cpu variable implementations to use this implementation. x86_64 is an instance of such an arch. 2) archs that don't use a per cpu variable for numa_node_id() will need to initialize the new per cpu variable "numa_node" as cpus are brought on-line. ia64 is an example. 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., when NUMA is configured. This is required because I have retained the old implementation by default to allow archs to be modified incrementally, as desired. Subsequent patches will convert x86_64 and ia64 to use this implemenation. Signed-off-by: Lee Schermerhorn <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Mel Gorman <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Nick Piggin <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eric Whitney <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27slab: convert cpu notifier to return encapsulate errno valueAkinobu Mita1-1/+1
By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for slab. Signed-off-by: Akinobu Mita <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Matt Mackall <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27cpusets: new round-robin rotor for SLAB allocationsJack Steiner1-1/+1
We have observed several workloads running on multi-node systems where memory is assigned unevenly across the nodes in the system. There are numerous reasons for this but one is the round-robin rotor in cpuset_mem_spread_node(). For example, a simple test that writes a multi-page file will allocate pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it allocates on odd nodes & skips even nodes). An example is shown below. The program "lfile" writes a file consisting of 10 pages. The program then mmaps the file & uses get_mempolicy(..., MPOL_F_NODE) to determine the nodes where the file pages were allocated. The output is shown below: # ./lfile allocated on nodes: 2 4 6 0 1 2 6 0 2 There is a single rotor that is used for allocating both file pages & slab pages. Writing the file allocates both a data page & a slab page (buffer_head). This advances the RR rotor 2 nodes for each page allocated. A quick confirmation seems to confirm this is the cause of the uneven allocation: # echo 0 >/dev/cpuset/memory_spread_slab # ./lfile allocated on nodes: 6 7 8 9 0 1 2 3 4 5 This patch introduces a second rotor that is used for slab allocations. Signed-off-by: Jack Steiner <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Paul Menage <[email protected]> Cc: Jack Steiner <[email protected]> Cc: Robin Holt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27memcg: clean up memory thresholdsKirill A. Shutemov1-85/+66
Introduce struct mem_cgroup_thresholds. It helps to reduce number of checks of thresholds type (memory or mem+swap). [[email protected]: repair comment] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Phil Carmody <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Paul Menage <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27cgroups: make cftype.unregister_event() void-returningKirill A. Shutemov1-24/+41
Since we are unable to handle an error returned by cftype.unregister_event() properly, let's make the callback void-returning. mem_cgroup_unregister_event() has been rewritten to be a "never fail" function. On mem_cgroup_usage_register_event() we save old buffer for thresholds array and reuse it in mem_cgroup_usage_unregister_event() to avoid allocation. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Phil Carmody <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Paul Menage <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27memcg: fix mis-accounting of file mapped racy with migration[email protected]2-39/+98
FILE_MAPPED per memcg of migrated file cache is not properly updated, because our hook in page_add_file_rmap() can't know to which memcg FILE_MAPPED should be counted. Basically, this patch is for fixing the bug but includes some big changes to fix up other messes. Now, at migrating mapped file, events happen in following sequence. 1. allocate a new page. 2. get memcg of an old page. 3. charge ageinst a new page before migration. But at this point, no changes to new page's page_cgroup, no commit for the charge. (IOW, PCG_USED bit is not set.) 4. page migration replaces radix-tree, old-page and new-page. 5. page migration remaps the new page if the old page was mapped. 6. Here, the new page is unlocked. 7. memcg commits the charge for newpage, Mark the new page's page_cgroup as PCG_USED. Because "commit" happens after page-remap, we can count FILE_MAPPED at "5", because we should avoid to trust page_cgroup->mem_cgroup. if PCG_USED bit is unset. (Note: memcg's LRU removal code does that but LRU-isolation logic is used for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is not on LRU or page_cgroup->mem_cgroup is NULL.) We can lose file_mapped accounting information at 5 because FILE_MAPPED is updated only when mapcount changes 0->1. So we should catch it. BTW, historically, above implemntation comes from migration-failure of anonymous page. Because we charge both of old page and new page with mapcount=0, we can't catch - the page is really freed before remap. - migration fails but it's freed before remap or .....corner cases. New migration sequence with memcg is: 1. allocate a new page. 2. mark PageCgroupMigration to the old page. 3. charge against a new page onto the old page's memcg. (here, new page's pc is marked as PageCgroupUsed.) 4. page migration replaces radix-tree, page table, etc... 5. At remapping, new page's page_cgroup is now makrked as "USED" We can catch 0->1 event and FILE_MAPPED will be properly updated. And we can catch SWAPOUT event after unlock this and freeing this page by unmap() can be caught. 7. Clear PageCgroupMigration of the old page. So, FILE_MAPPED will be correctly updated. Then, for what MIGRATION flag is ? Without it, at migration failure, we may have to charge old page again because it may be fully unmapped. "charge" means that we have to dive into memory reclaim or something complated. So, it's better to avoid charge it again. Before this patch, __commit_charge() was working for both of the old/new page and fixed up all. But this technique has some racy condtion around FILE_MAPPED and SWAPOUT etc... Now, the kernel use MIGRATION flag and don't uncharge old page until the end of migration. I hope this change will make memcg's page migration much simpler. This page migration has caused several troubles. Worth to add a flag for simplification. Reviewed-by: Daisuke Nishimura <[email protected]> Tested-by: Daisuke Nishimura <[email protected]> Reported-by: Daisuke Nishimura <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>