aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2010-03-06includecheck fix for kernel/params.cJaswinder Singh Rajput1-1/+0
Fix the following 'make includecheck' warning: kernel/params.c: linux/string.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <[email protected]> Cc: AndrĂ© Goddard Rosa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06splice: comparing unsigned int < 0Dan Carpenter1-2/+3
"ret" needs to be signed or the error handling for splice_to_pipe() won't work correctly. Signed-off-by: Dan Carpenter <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06lkdtm: add debugfs access and loosen KPROBE tiesSimon Kagstrom3-85/+430
Add adds a debugfs interface and additional failure modes to LKDTM to provide similar functionality to the provoke-crash driver submitted here: http://lwn.net/Articles/371208/ Crashes can now be induced either through module parameters (as before) or through the debugfs interface as in provoke-crash. The patch also provides a new "direct" interface, where KPROBES are not used, i.e., the crash is invoked directly upon write to the debugfs file. When built without KPROBES configured, only this mode is available. Signed-off-by: Simon Kagstrom <[email protected]> Cc: M. Mohan Kumar <[email protected]> Cc: Americo Wang <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "Eric W. Biederman" <[email protected]>, Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06eisa: fix coding style for eisa bus codeThadeu Lima de Souza Cascardo1-112/+128
Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]> Cc: Kay Sievers <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Alan Cox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06drivers/misc/iwmc3200top/main.c: eliminate useless codeJulia Lawall1-2/+0
The variable priv is initialized twice to the same (side effect-free) expression. Drop one initialization. A simplified version of the semantic match that finds this problem is: (http://coccinelle.lip6.fr/) // <smpl> @forall@ idexpression *x; identifier f!=ERR_PTR; @@ x = f(...) ... when != x ( x = f(...,<+...x...+>,...) | * x = f(...) ) // </smpl> Signed-off-by: Julia Lawall <[email protected]> Cc: Tomas Winkler <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06init/main.c: make setup_max_cpus static for !SMPH Hartley Sweeten1-1/+1
The only in tree external users of the symbol setup_max_cpus are in arch/x86/. The files ./kernel/alternative.c, ./kernel/visws_quirks.c, and ./mm/kmemcheck/kmemcheck.c are all guarded by CONFIG_SMP being defined. For this case the symbol is an unsigned int and declared as an extern in include/linux/smp.h. When CONFIG_SMP is not defined the symbol setup_max_cpus is a constant value that is only used in init/main.c. Make the symbol static for this case. Signed-off-by: H Hartley Sweeten <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06smp: fix documentation in include/linux/smp.hRakib Mullick1-1/+1
smp: Fix documentation. Fix documentation in include/linux/smp.h: smp_processor_id() Signed-off-by: Rakib Mullick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06nodemask.h: remove macro any_online_nodeH Hartley Sweeten3-15/+4
The macro any_online_node() is prone to producing sparse warnings due to the local symbol 'node'. Since all the in-tree users are really requesting the first online node (the mask argument is either NODE_MASK_ALL or node_online_map) just use the first_online_node macro and remove the any_online_node macro since there are no users. Signed-off-by: H Hartley Sweeten <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Acked-by: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Milton Miller <[email protected]> Cc: Nathan Fontenot <[email protected]> Cc: Geoff Levand <[email protected]> Cc: Grant Likely <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Neil Brown <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: David S. Miller <[email protected]> Cc: Benny Halevy <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Ricardo Labiaga <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06fs: use rlimit helpersJiri Slaby8-12/+12
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06cpumask: let num_*_cpus() function always return unsigned valuesHeiko Carstens1-4/+4
Dependent on CONFIG_SMP the num_*_cpus() functions return unsigned or signed values. Let them always return unsigned values to avoid strange casts. Fixes at least one warning: kernel/kprobes.c: In function 'register_kretprobe': kernel/kprobes.c:1038: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Heiko Carstens <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ananth N Mavinakayanahalli <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06init/initramfs.c: fix "symbol shadows an earlier one" noiseH Hartley Sweeten1-6/+6
The symbol 'count' is a local global variable in this file. The function clean_rootfs() should use a different symbol name to prevent "symbol shadows an earlier one" noise. Signed-off-by: H Hartley Sweeten <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06init/main.c: improve usability in case of init binary failureAndreas Mohr2-1/+51
- new Documentation/init.txt file describing various forms of failure trying to load the init binary after kernel bootup - extend the init/main.c init failure message to direct to Documentation/init.txt Signed-off-by: Andreas Mohr <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/cpu.c: delete deprecated definition in cpu_up()Chen Gong1-1/+1
Additional_cpus is only supported for IA64 now. X86_64 should not be included. Signed-off-by: Chen Gong <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06MFGPT: move clocksource menuRandy Dunlap3-11/+9
Move the CS5535 MFGPT hrtimer kconfig option to be with the other MFGPT options. This makes it easier to find and also removes it from the main "Device Drivers" menu, where it should not have been. Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Andres Salomon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06um: tell git to ignore generated filesWANG Cong2-0/+4
Tell git to ignore the generated files under um, except: include/shared/kern_constants.h include/shared/user_constants.h which will be moved to include/generated. Signed-off-by: WANG Cong <[email protected]> Cc: Al Viro <[email protected]> Cc: Jeff Dike <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06uml: line.c: avoid NULL pointer dereferenceAlexander Beregalov1-2/+2
Assign tty only if line is not NULL. [[email protected]: simplification] Signed-off-by: Alexander Beregalov <[email protected]> Cc: Jeff Dike <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06cris v32: typo in crisv32_arbiter_unwatch()?Roel Kluin1-1/+1
With id 1 the wrong bp was unwatched. Signed-off-by: Roel Kluin <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Jesper Nilsson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06cryptocop: fix assertion in create_output_descriptors()Roel Kluin1-1/+1
size_t desc_len cannot be less than 0, test before the subtraction. Signed-off-by: Roel Kluin <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Jesper Nilsson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06cris: convert to use arch_gettimeoffset()john stultz2-66/+8
Convert cris to use GENERIC_TIME via the arch_getoffset() infrastructure, reducing the amount of arch specific code we need to maintain. Signed-off-by: John Stultz <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Jesper Nilsson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06cpuidle menu: remove 8 bytes of padding on 64 bit buildsRichard Kennedy1-1/+1
Reorder struct menu_device to remove 8 bytes of padding on 64 bit builds. Size drops from 136 to 128 bytes, so possibly needing one fewer cache lines. Signed-off-by: Richard Kennedy <[email protected]> Cc: Arjan van de Ven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06alpha: PTR_ERR overwrites -EINVAL in syscall osf_mountRoel Kluin1-1/+2
The initial -EINVAL value is overwritten by `retval = PTR_ERR(name)'. If this isn't an error pointer and typenr is not 1, 6 or 9, then this retval, a pointer cast to a long, is returned. Signed-off-by: Roel Kluin <[email protected]> Acked-by: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06frv: remove pci_dma_sync_single() and pci_dma_sync_sg()FUJITA Tomonori1-37/+0
No architecture except for frv has pci_dma_sync_single() and pci_dma_sync_sg(). The APIs are deprecated. Signed-off-by: FUJITA Tomonori <[email protected]> Acked-by: David S. Miller <[email protected]> Acked-by: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: add comment on swap_duplicate's error codeHugh Dickins1-1/+5
swap_duplicate()'s loop appears to miss out on returning the error code from __swap_duplicate(), except when that's -ENOMEM. In fact this is intentional: prior to -ENOMEM for swap_count_continuation, swap_duplicate() was void (and the case only occurs when copy_one_pte() hits a corrupt pte). But that's surprising behaviour, which certainly deserves a comment. Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Huang Shijie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06nommu: get_user_pages(): pin last page on non-page-aligned startSteven J. Magnani1-2/+2
The noMMU version of get_user_pages() fails to pin the last page when the start address isn't page-aligned. The patch fixes this in a way that makes find_extend_vma() congruent to its MMU cousin. Signed-off-by: Steven J. Magnani <[email protected]> Acked-by: Paul Mundt <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: use the same log level for show_mem()Amerigo Wang1-7/+7
Use the same log level for printk's in show_mem(), so that those messages can be shown completely when using log level 6. Signed-off-by: WANG Cong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: add comment about deprecation of __GFP_NOFAILDavid Rientjes1-1/+2
__GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new users should be added. Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vmscan: detect mapped file pages used only onceJohannes Weiner3-14/+36
The VM currently assumes that an inactive, mapped and referenced file page is in use and promotes it to the active list. However, every mapped file page starts out like this and thus a problem arises when workloads create a stream of such pages that are used only for a short time. By flooding the active list with those pages, the VM quickly gets into trouble finding eligible reclaim canditates. The result is long allocation latencies and eviction of the wrong pages. This patch reuses the PG_referenced page flag (used for unmapped file pages) to implement a usage detection that scales with the speed of LRU list cycling (i.e. memory pressure). If the scanner encounters those pages, the flag is set and the page cycled again on the inactive list. Only if it returns with another page table reference it is activated. Otherwise it is reclaimed as 'not recently used cache'. This effectively changes the minimum lifetime of a used-once mapped file page from a full memory cycle to an inactive list cycle, which allows it to occur in linear streams without affecting the stable working set of the system. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: OSAKI Motohiro <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vmscan: drop page_mapping_inuse()Johannes Weiner1-23/+2
page_mapping_inuse() is a historic predicate function for pages that are about to be reclaimed or deactivated. According to it, a page is in use when it is mapped into page tables OR part of swap cache OR backing an mmapped file. This function is used in combination with page_referenced(), which checks for young bits in ptes and the page descriptor itself for the PG_referenced bit. Thus, checking for unmapped swap cache pages is meaningless as PG_referenced is not set for anonymous pages and unmapped pages do not have young ptes. The test makes no difference. Protecting file pages that are not by themselves mapped but are part of a mapped file is also a historic leftover for short-lived things like the exec() code in libc. However, the VM now does reference accounting and activation of pages at unmap time and thus the special treatment on reclaim is obsolete. This patch drops page_mapping_inuse() and switches the two callsites to use page_mapped() directly. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: OSAKI Motohiro <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vmscan: factor out page reference checksJohannes Weiner1-13/+43
The used-once mapped file page detection patchset. It is meant to help workloads with large amounts of shortly used file mappings, like rtorrent hashing a file or git when dealing with loose objects (git gc on a bigger site?). Right now, the VM activates referenced mapped file pages on first encounter on the inactive list and it takes a full memory cycle to reclaim them again. When those pages dominate memory, the system no longer has a meaningful notion of 'working set' and is required to give up the active list to make reclaim progress. Obviously, this results in rather bad scanning latencies and the wrong pages being reclaimed. This patch makes the VM be more careful about activating mapped file pages in the first place. The minimum granted lifetime without another memory access becomes an inactive list cycle instead of the full memory cycle, which is more natural given the mentioned loads. This test resembles a hashing rtorrent process. Sequentially, 32MB chunks of a file are mapped into memory, hashed (sha1) and unmapped again. While this happens, every 5 seconds a process is launched and its execution time taken: python2.4 -c 'import pydoc' old: max=2.31s mean=1.26s (0.34) new: max=1.25s mean=0.32s (0.32) find /etc -type f old: max=2.52s mean=1.44s (0.43) new: max=1.92s mean=0.12s (0.17) vim -c ':quit' old: max=6.14s mean=4.03s (0.49) new: max=3.48s mean=2.41s (0.25) mplayer --help old: max=8.08s mean=5.74s (1.02) new: max=3.79s mean=1.32s (0.81) overall hash time (stdev): old: time=1192.30 (12.85) thruput=25.78mb/s (0.27) new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%) I also tested kernbench with regular IO streaming in the background to see whether the delayed activation of frequently used mapped file pages had a negative impact on performance in the presence of pressure on the inactive list. The patch made no significant difference in timing, neither for kernbench nor for the streaming IO throughput. The first patch submission raised concerns about the cost of the extra faults for actually activated pages on machines that have no hardware support for young page table entries. I created an artificial worst case scenario on an ARM machine with around 300MHz and 64MB of memory to figure out the dimensions involved. The test would mmap a file of 20MB, then 1. touch all its pages to fault them in 2. force one full scan cycle on the inactive file LRU -- old: mapping pages activated -- new: mapping pages inactive 3. touch the mapping pages again -- old and new: fault exceptions to set the young bits 4. force another full scan cycle on the inactive file LRU 5. touch the mapping pages one last time -- new: fault exceptions to set the young bits The test showed an overall increase of 6% in time over 100 iterations of the above (old: ~212sec, new: ~225sec). 13 secs total overhead / (100 * 5k pages), ignoring the execution time of the test itself, makes for about 25us overhead for every page that gets actually activated. Note: 1. File mapping the size of one third of main memory, _completely_ in active use across memory pressure - i.e., most pages referenced within one LRU cycle. This should be rare to non-existant, especially on such embedded setups. 2. Many huge activation batches. Those batches only occur when the working set fluctuates. If it changes completely between every full LRU cycle, you have problematic reclaim overhead anyway. 3. Access of activated pages at maximum speed: sequential loads from every single page without doing anything in between. In reality, the extra faults will get distributed between actual operations on the data. So even if a workload manages to get the VM into the situation of activating a third of memory in one go on such a setup, it will take 2.2 seconds instead 2.1 without the patch. Comparing the numbers (and my user-experience over several months), I think this change is an overall improvement to the VM. Patch 1 is only refactoring to break up that ugly compound conditional in shrink_page_list() and make it easy to document and add new checks in a readable fashion. Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not strictly related to #3, but it was in the original submission and is a net simplification, so I kept it. Patch 3 implements used-once detection of mapped file pages. This patch: Moving the big conditional into its own predicate function makes the code a bit easier to read and allows for better commenting on the checks one-by-one. This is just cleaning up, no semantics should have been changed. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: OSAKI Motohiro <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: document /sys/devices/system/node/nodeXMel Gorman1-0/+7
Add a bare description of what /sys/devices/system/node/nodeX is. Others will follow in time but right now, none of that tree is documented. The existence of this file might at least encourage people to document new entries. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: document /proc/pagetypeinfoMel Gorman1-1/+44
Add documentation for /proc/pagetypeinfo. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: suppress pfn range output for zones without pagesDavid Rientjes1-2/+6
free_area_init_nodes() emits pfn ranges for all zones on the system. There may be no pages on a higher zone, however, due to memory limitations or the use of the mem= kernel parameter. For example: Zone PFN ranges: DMA 0x00000001 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x00100000 The implementation copies the previous zone's highest pfn, if any, as the next zone's lowest pfn. If its highest pfn is then greater than the amount of addressable memory, the upper memory limit is used instead. Thus, both the lowest and highest possible pfn for higher zones without memory may be the same. The pfn range for zones without memory is now shown as "empty" instead. Signed-off-by: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/pm: force GFP_NOIO during suspend/hibernation and resumeRafael J. Wysocki5-5/+41
There are quite a few GFP_KERNEL memory allocations made during suspend/hibernation and resume that may cause the system to hang, because the I/O operations they depend on cannot be completed due to the underlying devices being suspended. Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in gfp_allowed_mask before suspend/hibernation and restoring the original values of these bits in gfp_allowed_mask durig the subsequent resume. [[email protected]: fix CONFIG_PM=n linkage] Signed-off-by: Rafael J. Wysocki <[email protected]> Reported-by: Maxim Levitsky <[email protected]> Cc: Sebastian Ott <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/swapfile.c: fix swapon size off-by-oneHugh Dickins1-13/+18
There's an off-by-one disagreement between mkswap and swapon about the meaning of swap_header last_page: mkswap (in all versions I've looked at: util-linux-ng and BusyBox and old util-linux; probably as far back as 1999) consistently means the offset (in page units) of the last page of the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3) strangely takes it to mean the size (in page units) of the swap area. This disagreement is the safe way round; but it's worrying people, and loses us one page of swap. The fix is not just to add one to nr_good_pages: we need to get maxpages (the size of the swap_map array) right before that; and though that is an unsigned long, be careful not to overflow the unsigned int p->max which later holds it (probably why header uses __u32 last_page instead of size). Why did we subtract one from the maximum swp_offset to calculate maxpages? Though it was probably me who made that change in 2.4.10, I don't get it: and now we should be adding one (without risk of overflow in this case). Fix the handling of swap_header badpages: it could have overrun the swap_map when very large swap area used on a more limited architecture. Remove pre-initializations of swap_header, nr_good_pages and maxpages: those date from when sys_swapon was supporting other versions of header. Reported-by: Nitin Gupta <[email protected]> Reported-by: Jarkko Lavinen <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: remove VM_LOCK_RMAP codeRik van Riel3-31/+0
When a VMA is in an inconsistent state during setup or teardown, the worst that can happen is that the rmap code will not be able to find the page. The mapping is in the process of being torn down (PTEs just got invalidated by munmap), or set up (no PTEs have been instantiated yet). It is also impossible for the rmap code to follow a pointer to an already freed VMA, because the rmap code holds the anon_vma->lock, which the VMA teardown code needs to take before the VMA is removed from the anon_vma chain. Hence, we should not need the VM_LOCK_RMAP locking at all. Signed-off-by: Rik van Riel <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06rmap: move exclusively owned pages to own anon_vma in do_wp_page()Rik van Riel3-0/+32
When the parent process breaks the COW on a page, both the original which is mapped at child and the new page which is mapped parent end up in that same anon_vma. Generally this won't be a problem, but for some workloads it could preserve the O(N) rmap scanning complexity. A simple fix is to ensure that, when a page which is mapped child gets reused in do_wp_page, because we already are the exclusive owner, the page gets moved to our own exclusive child's anon_vma. Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06rmap: remove obsolete check from __page_check_anon_rmap()Rik van Riel1-3/+0
When an anonymous page is inherited from a parent process, the vma->anon_vma can differ from the page anon_vma. This can trip up __page_check_anon_rmap, which is indirectly called from do_swap_page(). Remove that obsolete check to prevent an oops. Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: change anon_vma linking to fix multi-process server scalability issueRik van Riel14-85/+298
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [[email protected]: cleanups] Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/memcontrol.c: fix "integer as NULL pointer" sparse warningThiago Farina1-1/+1
mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06include/linux/fs.h: convert FMODE_* constants to hexAndrew Morton1-11/+11
It was tolerable until Eric went and added 8388608. Cc: Eric Paris <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOMWu Fengguang3-1/+18
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance: a 16K read will be carried out in 4 _sync_ 1-page reads. In other places, ra_pages==0 means - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs - some IO error happened where multi-page read IO won't help or should be avoided. POSIX_FADV_RANDOM actually want a different semantics: to disable the *heuristic* readahead algorithm, and to use a dumb one which faithfully submit read IO for whatever application requests. So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM. Note that the random hint is not likely to help random reads performance noticeably. And it may be too permissive on huge request size (its IO size is not limited by read_ahead_kb). In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall (NFS read) performance of the application increased by 313%! Tested-by: Quentin Barnes <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: David Howells <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Chuck Lever <[email protected]> Cc: <[email protected]> [2.6.33.x] Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vfs: take f_lock on modifying f_mode after open timeWu Fengguang2-0/+4
We'll introduce FMODE_RANDOM which will be runtime modified. So protect all runtime modification to f_mode with f_lock to avoid races. Signed-off-by: Wu Fengguang <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Chuck Lever <[email protected]> Cc: <[email protected]> [2.6.33.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/migrate.c: kill anon local variable from migrate_page_copyKOSAKI Motohiro1-4/+0
commit 01b1ae63c2 ("memcg: simple migration handling") removed mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local variable `anon' is now unused. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/mempolicy.c: fix indentation of the comments of do_migrate_pagesKOSAKI Motohiro1-30/+30
Currently, do_migrate_pages() have very long comment and this is not indent properly. I often misunderstand it is function starting commnents and confused it. this patch fixes it. note: this patch doesn't break 80 column rule. I guess original author intended this indentaion, but an accident corrupted it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06memory-hotplug: create /sys/firmware/memmap entry for new memory[email protected]3-24/+43
A memmap is a directory in sysfs which includes 3 text files: start, end and type. For example: start: 0x100000 end: 0x7e7b1cff type: System RAM Interface firmware_map_add was not called explicitly. Remove it and add function firmware_map_add_hotplug as hotplug interface of memmap. Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs does not export memmap entry for it. We add a call in function add_memory to function firmware_map_add_hotplug. Add a new function add_sysfs_fw_map_entry() to create memmap entry, it will be called when initialize memmap and hot-add memory. [[email protected]: un-kernedoc a no longer kerneldoc comment] Signed-off-by: Shaohui Zheng <[email protected]> Acked-by: Andi Kleen <[email protected]> Acked-by: Yasunori Goto <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: fix mbind vma merge problemKOSAKI Motohiro1-13/+39
Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible. Unfortunately, many vma can reduce performance... This patch fixes it. reproduced program ---------------------------------------------------------------- #include <numaif.h> #include <numa.h> #include <sys/mman.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> static unsigned long pagesize; int main(int argc, char** argv) { void* addr; int ch; int node; struct bitmask *nmask = numa_allocate_nodemask(); int err; int node_set = 0; char buf[128]; while ((ch = getopt(argc, argv, "n:")) != -1){ switch (ch){ case 'n': node = strtol(optarg, NULL, 0); numa_bitmask_setbit(nmask, node); node_set = 1; break; default: ; } } argc -= optind; argv += optind; if (!node_set) numa_bitmask_setbit(nmask, 0); pagesize = getpagesize(); addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, 0, 0); if (addr == MAP_FAILED) perror("mmap "), exit(1); fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr); /* make page populate */ memset(addr, 0, pagesize*3); /* first mbind */ err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL); if (err) error("mbind1 "); /* second mbind */ err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0); if (err) error("mbind2 "); sprintf(buf, "cat /proc/%d/maps", getpid()); system(buf); return 0; } ---------------------------------------------------------------- result without this patch addr = 0x7fe26ef09000 [snip] 7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0 7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0 7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0 7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0 => 0x7fe26ef09000-0x7fe26ef0c000 have three vmas. result with this patch addr = 0x7fc9ebc76000 [snip] 7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0 7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack] => 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma. [[email protected]: fix file offset passed to vma_merge()] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: restore zone->all_unreclaimable to independence wordKOSAKI Motohiro4-23/+14
commit e815af95 ("change all_unreclaimable zone member to flags") changed all_unreclaimable member to bit flag. But it had an undesireble side effect. free_one_page() is one of most hot path in linux kernel and increasing atomic ops in it can reduce kernel performance a bit. Thus, this patch revert such commit partially. at least all_unreclaimable shouldn't share memory word with other zone flags. [[email protected]: fix patch interaction] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Huang Shijie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: remove free_hot_page()Li Hong3-9/+5
free_hot_page() is just a wrapper around free_hot_cold_page() with parameter 'cold = 0'. After adding a clear comment for free_hot_cold_page(), it is reasonable to remove a level of call. [[email protected]: fix build] Signed-off-by: Li Hong <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Li Ming Chun <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Americo Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/page_alloc.c: adjust a call site to trace_mm_page_free_directLi Hong1-1/+1
Move a call of trace_mm_page_free_direct() from free_hot_page() to free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(), as it is done in function __free_pages_ok(). Signed-off-by: Li Hong <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Li Ming Chun <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/page_alloc.c: remove duplicate call to trace_mm_page_free_directLi Hong1-1/+1
trace_mm_page_free_direct() is called in function __free_pages(). But it is called again in free_hot_page() if order == 0 and produce duplicate records in trace file for mm_page_free_direct event. As below: K-PID CPU# TIMESTAMP FUNCTION gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 This patch removes the first call and adds a call to trace_mm_page_free_direct() in __free_pages_ok(). Signed-off-by: Li Hong <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Li Ming Chun <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>