aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2011-05-25um: include linux/prefetch.hRichard Weinberger1-0/+2
Fix build failures on UML. Signed-off-by: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25um: print info about fatal segfaultsRichard Weinberger1-0/+24
Print a short info about fatal segfaults like other archs do. Signed-off-by: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25um: add ucast ethernet transportNolan Leake8-311/+413
The ucast transport is similar to the mcast transport (and, in fact, shares most of its code), only it uses UDP unicast to move packets. Obviously this is only useful for point-to-point connections between virtual ethernet devices. Signed-off-by: Nolan Leake <[email protected]> Signed-off-by: Richard Weinberger <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25um: add earlyprintk supportRichard Weinberger5-0/+50
User Mode Linux can also benefit from earlyprintk. UML's earlyprintk writes kernel messages directly to stdout. Signed-off-by: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25um: remove SIGHUP handlerRichard Weinberger1-1/+0
The UML kernel ignores SIGHUP anyway. This handler is in vain. Signed-off-by: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25um: fix UML_LIB_PATHRichard Weinberger3-2/+8
UML_LIB_PATH is hardcoded to /usr/lib/uml/, on 64bit systems UML_LIB_PATH needs to be /usr/lib64/uml/. Signed-off-by: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25cris: convert old cpumask API into new oneKOSAKI Motohiro2-17/+20
Adapt to the new API. [[email protected]: coding-style fixes] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: Thiago Farina <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mn10300: convert old cpumask API into new oneKOSAKI Motohiro4-63/+68
Adapt to the new API. We plan to remove old cpumask APIs later. Thus this patch converts them into the new one. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: David Howells <[email protected]> Cc: Koichi Yasutake <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Chris Metcalf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25alpha: hook up gpiolib supportMark Brown2-0/+59
Allow people to use gpiolib on Alpha if they want to, mostly for build coverage. The header is a stright copy of that for Microblaze, which in turn was taken from PowerPC. [[email protected]: define GENERIC_GPIO] Signed-off-by: Mark Brown <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Acked-by: Grant Likely <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25alpha: replace with new cpumask APIsKOSAKI Motohiro5-12/+14
We plan to remove cpu_xx() old APIs. Thus convert them. This patch has no functional change. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25nommu: add page alignment to mmapBob Liu1-9/+14
Currently on nommu arch mmap(),mremap() and munmap() doesn't do page_align() which isn't consist with mmu arch and cause some issues. First, some drivers' mmap() function depends on vma->vm_end - vma->start is page aligned which is true on mmu arch but not on nommu. eg: uvc camera driver. Second munmap() may return -EINVAL[split file] error in cases when end is not page aligned(passed into from userspace) but vma->vm_end is aligned dure to split or driver's mmap() ops. Add page alignment to fix those issues. [[email protected]: coding-style fixes] Signed-off-by: Bob Liu <[email protected]> Cc: David Howells <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: batch activate_page() to reduce lock contentionShaohua Li1-5/+40
The zone->lru_lock is heavily contented in workload where activate_page() is frequently used. We could do batch activate_page() to reduce the lock contention. The batched pages will be added into zone list when the pool is full or page reclaim is trying to drain them. For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes, processes shared map to the file. Each process read access the whole file and then exit. The process exit will do unmap_vmas() and cause a lot of activate_page() call. In such workload, we saw about 58% total time reduction with below patch. Other workloads with a lot of activate_page also benefits a lot too. Andrew Morton suggested activate_page() and putback_lru_pages() should follow the same path to active pages, but this is hard to implement (see commit 7a608572a282a ("Revert "mm: batch activate_page() to reduce lock contention")). On the other hand, do we really need putback_lru_pages() to follow the same path? I tested several FIO/FFSB benchmark (about 20 scripts for each benchmark) in 3 machines here from 2 sockets to 4 sockets. My test doesn't show anything significant with/without below patch (there is slight difference but mostly some noise which we found even without below patch before). Below patch basically returns to the same as my first post. I tested some microbenchmarks: case-anon-cow-rand-mt 0.58% case-anon-cow-rand -3.30% case-anon-cow-seq-mt -0.51% case-anon-cow-seq -5.68% case-anon-r-rand-mt 0.23% case-anon-r-rand 0.81% case-anon-r-seq-mt -0.71% case-anon-r-seq -1.99% case-anon-rx-rand-mt 2.11% case-anon-rx-seq-mt 3.46% case-anon-w-rand-mt -0.03% case-anon-w-rand -0.50% case-anon-w-seq-mt -1.08% case-anon-w-seq -0.12% case-anon-wx-rand-mt -5.02% case-anon-wx-seq-mt -1.43% case-fork 1.65% case-fork-sleep -0.07% case-fork-withmem 1.39% case-hugetlb -0.59% case-lru-file-mmap-read-mt -0.54% case-lru-file-mmap-read 0.61% case-lru-file-mmap-read-rand -2.24% case-lru-file-readonce -0.64% case-lru-file-readtwice -11.69% case-lru-memcg -1.35% case-mmap-pread-rand-mt 1.88% case-mmap-pread-rand -15.26% case-mmap-pread-seq-mt 0.89% case-mmap-pread-seq -69.72% case-mmap-xread-rand-mt 0.71% case-mmap-xread-seq-mt 0.38% The most significent are: case-lru-file-readtwice -11.69% case-mmap-pread-rand -15.26% case-mmap-pread-seq -69.72% which use activate_page a lot. others are basically variations because each run has slightly difference. In UP case, 'size mm/swap.o' before the two patches: text data bss dec hex filename 6466 896 4 7366 1cc6 mm/swap.o after the two patches: text data bss dec hex filename 6343 896 4 7243 1c4b mm/swap.o Signed-off-by: Shaohua Li <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Hiroyuki Kamezawa <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25asm-generic/cacheflush.h: flush icache when copying to user pagesMike Frysinger1-1/+4
The copy_to_user_page() function is supposed to flush the icache on the memory that was written, but the current asm-generic version lacks that logic. While normally it isn't a big deal as the asm-generic version of icache flushing is a stub, it is a deal for ports that want to use the asm-generic version as a baseline and then overlay its own specific parts (like icache flushing). Signed-off-by: Mike Frysinger <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm/page_alloc.c: prevent unending loop in __alloc_pages_slowpath()Andrew Barry1-1/+1
I believe I found a problem in __alloc_pages_slowpath, which allows a process to get stuck endlessly looping, even when lots of memory is available. Running an I/O and memory intensive stress-test I see a 0-order page allocation with __GFP_IO and __GFP_WAIT, running on a system with very little free memory. Right about the same time that the stress-test gets killed by the OOM-killer, the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even though most of the systems memory was freed by the oom-kill of the stress-test. The utility ends up looping from the rebalance label down through the wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact skips the call to get_page_from_freelist. Because all of the reclaimable memory on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then jumps back to rebalance without ever trying to get_page_from_freelist. This loop repeats infinitely. The test case is pretty pathological. Running a mix of I/O stress-tests that do a lot of fork() and consume all of the system memory, I can pretty reliably hit this on 600 nodes, in about 12 hours. 32GB/node. Signed-off-by: Andrew Barry <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Rik van Riel<[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: add SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macroDaniel Kiper1-0/+3
Add SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macro which aligns given pfn to upper section and lower section boundary accordingly. Required for the latest memory hotplug support for the Xen balloon driver. Signed-off-by: Daniel Kiper <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25memsw: remove noswapaccount kernel parameterMichal Hocko2-13/+3
The noswapaccount parameter has been deprecated since 2.6.38 without any complaints from users so we can remove it. swapaccount=0|1 can be used instead. As we are removing the parameter we can also clean up swapaccount because it doesn't have to accept an empty string anymore (to match noswapaccount) and so we can push = into __setup macro rather than checking "=1" resp. "=0" strings Signed-off-by: Michal Hocko <[email protected]> Cc: Hiroyuki Kamezawa <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25proc: allocate storage for numa_maps statistics onceStephen Wilson1-9/+27
In show_numa_map() we collect statistics into a numa_maps structure. Since the number of NUMA nodes can be very large, this structure is not a candidate for stack allocation. Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map() is invoked, perform the allocation just once when /proc/pid/numa_maps is opened. Performing the allocation when numa_maps is opened, and thus before a reference to the target tasks mm is taken, eliminates a potential stalemate condition in the oom-killer as originally described by Hugh Dickins: ... imagine what happens if the system is out of memory, and the mm we're looking at is selected for killing by the OOM killer: while we wait in __get_free_page for more memory, no memory is freed from the selected mm because it cannot reach exit_mmap while we hold that reference. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25proc: make struct proc_maps_private truly privateStephen Wilson2-8/+8
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps there is no need to export struct proc_maps_private to the world. Move it to fs/proc/internal.h instead. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: proc: move show_numa_map() to fs/proc/task_mmu.cStephen Wilson2-185/+182
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several issues. - Having the show() operation "miles away" from the corresponding seq_file iteration operations is a maintenance burden. - The need to export ad hoc info like struct proc_maps_private is eliminated. - The implementation of show_numa_map() can be improved in a simple manner by cooperating with the other seq_file operations (start, stop, etc) -- something that would be messy to do without this change. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: declare mpol_to_str() when CONFIG_TMPFS=nStephen Wilson1-2/+2
When CONFIG_TMPFS=n mpol_to_str() is not declared in mempolicy.h. However, in the NUMA case, the definition is always compiled. Since it is not strictly true that tmpfs is the only client, and since the symbol was always lurking around anyways, export mpol_to_str() unconditionally. Furthermore, this will allow us to move show_numa_map() out of mempolicy.c and into the procfs subsystem. Signed-off-by: Stephen Wilson <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: remove check_huge_range()Stephen Wilson1-35/+0
This function has been superseded by gather_hugetbl_stats() and is no longer needed. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: make gather_stats() type-safe and remove forward declarationStephen Wilson1-4/+4
Improve the prototype of gather_stats() to take a struct numa_maps as argument instead of a generic void *. Update all callers to make the required type explicit. Since gather_stats() is not needed before its definition and is scheduled to be moved out of mempolicy.c the declaration is removed as well. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: remove MPOL_MF_STATSStephen Wilson1-4/+1
Mapping statistics in a NUMA environment is now computed using the generic walk_page_range() logic. Remove the old/equivalent functionality. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: use walk_page_range() instead of custom page table walking codeStephen Wilson1-7/+68
Converting show_numa_map() to use the generic routine decouples the function from mempolicy.c, allowing it to be moved out of the mm subsystem and into fs/proc. Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk logic implemented by check_pte_range() failed to account for such pages as they were not applicable to the page migration case. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: export get_vma_policy()Stephen Wilson2-1/+4
In commit 48fce3429d ("mempolicies: unexport get_vma_policy()") get_vma_policy() was marked static as all clients were local to mempolicy.c. However, the decision to generate /proc/pid/numa_maps in the numa memory policy code and outside the procfs subsystem introduces an artificial interdependency between the two systems. Exporting get_vma_policy() once again is the first step to clean up this interdependency. Signed-off-by: Stephen Wilson <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: remove last trace of shmem_get_unmapped_areaHugh Dickins1-8/+0
Remove noMMU declaration of shmem_get_unmapped_area() from mm.h: it fell out of use in 2.6.21 and ceased to exist in 2.6.29. Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25tmpfs: implement generic xattr supportEric Paris3-56/+290
Implement generic xattrs for tmpfs filesystems. The Feodra project, while trying to replace suid apps with file capabilities, realized that tmpfs, which is used on the build systems, does not support file capabilities and thus cannot be used to build packages which use file capabilities. Xattrs are also needed for overlayfs. The xattr interface is a bit odd. If a filesystem does not implement any {get,set,list}xattr functions the VFS will call into some random LSM hooks and the running LSM can then implement some method for handling xattrs. SELinux for example provides a method to support security.selinux but no other security.* xattrs. As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have xattr handler routines specifically to handle acls. Because of this tmpfs would loose the VFS/LSM helpers to support the running LSM. To make up for that tmpfs had stub functions that did nothing but call into the LSM hooks which implement the helpers. This new patch does not use the LSM fallback functions and instead just implements a native get/set/list xattr feature for the full security.* and trusted.* namespace like a normal filesystem. This means that tmpfs can now support both security.selinux and security.capability, which was not previously possible. The basic implementation is that I attach a: struct shmem_xattr { struct list_head list; /* anchored by shmem_inode_info->xattr_list */ char *name; size_t size; char value[0]; }; Into the struct shmem_inode_info for each xattr that is set. This implementation could easily support the user.* namespace as well, except some care needs to be taken to prevent large amounts of unswappable memory being allocated for unprivileged users. [[email protected]: new config option, suport trusted.*, support symlinks] Signed-off-by: Eric Paris <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Acked-by: Serge Hallyn <[email protected]> Tested-by: Serge Hallyn <[email protected]> Cc: Kyle McMartin <[email protected]> Acked-by: Hugh Dickins <[email protected]> Tested-by: Jordi Pujol <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high()Yinghai Lu1-23/+0
The bootmem wrapper with memblock supports top-down now, so we no longer need this trick. Signed-off-by: Yinghai LU <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Olaf Hering <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Lucas De Marchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25memblock/nobootmem: allow alloc_bootmem() to take 0 as low limitYinghai Lu1-9/+16
The bootmem wrapper with memblock supports top-down now, so we do not need to set the low limit to __pa(MAX_DMA_ADDRESS). The logic should be: good to allocate above __pa(MAX_DMA_ADDRESS), but it is ok if we can not find memory above 16M on system that has a small amount of RAM. Signed-off-by: Yinghai LU <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Olaf Hering <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Lucas De Marchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: delete non-atomic mm counter implementationMatt Fleming2-43/+10
The problem with having two different types of counters is that developers adding new code need to keep in mind whether it's safe to use both the atomic and non-atomic implementations. For example, when adding new callers of the *_mm_counter() functions a developer needs to ensure that those paths are always executed with page_table_lock held, in case we're using the non-atomic implementation of mm counters. Hugh Dickins introduced the atomic mm counters in commit f412ac08c986 ("[PATCH] mm: fix rss and mmlist locking"). When asked why he left the non-atomic counters around he said, | The only reason was to avoid adding costly atomic operations into a | configuration that had no need for them there: the page_table_lock | sufficed. | | Certainly it would be simpler just to delete the non-atomic variant. | | And I think it's fair to say that any configuration on which we're | measuring performance to that degree (rather than "does it boot fast?" | type measurements), would already be going the split ptlocks route. Removing the non-atomic counters eases the maintenance burden because developers no longer have to mindful of the two implementations when using *_mm_counter(). Note that all architectures provide a means of atomically updating atomic_long_t variables, even if they have to revert to the generic spinlock implementation because they don't support 64-bit atomic instructions (see lib/atomic64.c). Signed-off-by: Matt Fleming <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: fail GFP_DMA allocations when ZONE_DMA is not configuredDavid Rientjes1-0/+4
The page allocator will improperly return a page from ZONE_NORMAL even when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller expects DMA memory, perhaps for ISA devices with 16-bit address registers, and may get higher memory resulting in undefined behavior. This patch causes the page allocator to return NULL in such circumstances with a warning emitted to the kernel log on the first occurrence. Signed-off-by: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: do not define PFN_SECTION_SHIFT if !CONFIG_SPARSEMEMDaniel Kiper1-4/+0
Do not define PFN_SECTION_SHIFT if !CONFIG_SPARSEMEM. Signed-off-by: Daniel Kiper <[email protected]> Acked-by: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: pfn_to_section_nr()/section_nr_to_pfn() is valid only in ↵Daniel Kiper1-3/+3
CONFIG_SPARSEMEM context pfn_to_section_nr()/section_nr_to_pfn() is valid only in CONFIG_SPARSEMEM context. Move it to proper place. Signed-off-by: Daniel Kiper <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: enable set_page_section() only if CONFIG_SPARSEMEM and ↵Daniel Kiper1-6/+8
!CONFIG_SPARSEMEM_VMEMMAP set_page_section() is meaningful only in CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP context. Move it to proper place and amend accordingly functions which are using it. Signed-off-by: Daniel Kiper <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: remove dependency on CONFIG_FLATMEM from online_page()Daniel Kiper1-4/+0
online_pages() is only compiled for CONFIG_MEMORY_HOTPLUG_SPARSE, so there is no need to support CONFIG_FLATMEM code within it. This patch removes code that is never used. Signed-off-by: Daniel Kiper <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: filter unevictable page out in deactivate_page()Minchan Kim1-0/+7
It's pointless that deactive_page's operates on unevictable pages. This patch removes unnecessary overhead which might be a bit problem in case that there are many unevictable page in system(ex, mprotect workload) [[email protected]: tidy up comment] Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Rik van Riel<[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25readahead: trigger mmap sequential readahead on PG_readaheadWu Fengguang1-4/+2
Previously the mmap sequential readahead is triggered by updating ra->prev_pos on each page fault and compare it with current page offset. It costs dirtying the cache line on each _minor_ page fault. So remove the ra->prev_pos recording, and instead tag PG_readahead to trigger the possible sequential readahead. It's not only more simple, but also will work more reliably and reduce cache line bouncing on concurrent page faults on shared struct file. In the mosbench exim benchmark which does multi-threaded page faults on shared struct file, the ra->mmap_miss and ra->prev_pos updates are found to cause excessive cache line bouncing on tmpfs, which actually disabled readahead totally (shmem_backing_dev_info.ra_pages == 0). So remove the ra->prev_pos recording, and instead tag PG_readahead to trigger the possible sequential readahead. It's not only more simple, but also will work more reliably on concurrent reads on shared struct file. Signed-off-by: Wu Fengguang <[email protected]> Tested-by: Tim Chen <[email protected]> Reported-by: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25readahead: reduce unnecessary mmap_miss increasesAndi Kleen1-1/+2
The original INT_MAX is too large, reduce it to - avoid unnecessarily dirtying/bouncing the cache line - restore mmap read-around faster on changed access pattern Background: in the mosbench exim benchmark which does multi-threaded page faults on shared struct file, the ra->mmap_miss updates are found to cause excessive cache line bouncing on tmpfs. The ra state updates are needless for tmpfs because it actually disabled readahead totally (shmem_backing_dev_info.ra_pages == 0). Tested-by: Tim Chen <[email protected]> Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25readahead: return early when readahead is disabledWu Fengguang1-6/+6
Reduce readahead overheads by returning early in do_sync_mmap_readahead(). tmpfs has ra_pages=0 and it can page fault really fast (not constraint by IO if not swapping). Signed-off-by: Wu Fengguang <[email protected]> Tested-by: Tim Chen <[email protected]> Reported-by: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25vmscan: change shrinker API by passing shrink_control structYing Han21-61/+95
Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [[email protected]: fix build] [[email protected]: fix warning] [[email protected]: fix up new shrinker API] [[email protected]: fix xfs warning] [[email protected]: update gfs2] Signed-off-by: Ying Han <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Steven Whitehouse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25vmscan: change shrink_slab() interfaces by passing shrink_controlYing Han4-17/+55
Consolidate the existing parameters to shrink_slab() into a new shrink_control struct. This is needed later to pass the same struct to shrinkers. Signed-off-by: Ying Han <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25readahead: readahead page allocations are OK to failWu Fengguang2-1/+7
Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations. readahead page allocations are completely optional. They are OK to fail and in particular shall not trigger OOM on themselves. Reported-by: Dave Young <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: check if any page in a pageblock is reserved before marking it ↵Arve Hjønnevåg1-2/+17
MIGRATE_RESERVE This fixes a problem where the first pageblock got marked MIGRATE_RESERVE even though it only had a few free pages. eg, On current ARM port, The kernel starts at offset 0x8000 to leave room for boot parameters, and the memory is freed later. This in turn caused no contiguous memory to be reserved and frequent kswapd wakeups that emptied the caches to get more contiguous memory. Unfortunatelly, ARM needs order-2 allocation for pgd (see arm/mm/pgd.c#pgd_alloc()). Therefore the issue is not minor nor easy avoidable. [[email protected]: added some explanation] [[email protected]: add !pfn_valid_within() to check] [[email protected]: check end_pfn in pageblock_is_reserved] Signed-off-by: John Stultz <[email protected]> Signed-off-by: Arve Hjønnevåg <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25m32r, mm: set all online nodes in N_NORMAL_MEMORYDavid Rientjes1-0/+1
For m32r, N_NORMAL_MEMORY represents all nodes that have present memory since it does not support HIGHMEM. This patch sets the bit at the time the node is initialized. If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it uses this nodemask to setup per-cache kmem_cache_node data structures. Signed-off-by: David Rientjes <[email protected]> Cc: Hirokazu Takata <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25alpha, mm: set all online nodes in N_NORMAL_MEMORYDavid Rientjes1-0/+1
For alpha, N_NORMAL_MEMORY represents all nodes that have present memory since it does not support HIGHMEM. This patch sets the bit at the time the node is initialized. If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it uses this nodemask to setup per-cache kmem_cache_node data structures. Signed-off-by: David Rientjes <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: strictly require elevated page refcount in isolate_lru_page()Konstantin Khlebnikov1-1/+4
isolate_lru_page() must be called only with stable reference to the page, this is what is written in the comment above it, this is reasonable. current isolate_lru_page() users and its page extra reference sources: mm/huge_memory.c: __collapse_huge_page_isolate() - reference from pte mm/memcontrol.c: mem_cgroup_move_parent() - get_page_unless_zero() mem_cgroup_move_charge_pte_range() - reference from pte mm/memory-failure.c: soft_offline_page() - fixed, reference from get_any_page() delete_from_lru_cache() - reference from caller or get_page_unless_zero() [ seems like there bug, because __memory_failure() can call page_action() for hpages tail, but it is ok for isolate_lru_page(), tail getted and not in lru] mm/memory_hotplug.c: do_migrate_range() - fixed, get_page_unless_zero() mm/mempolicy.c: migrate_page_add() - reference from pte mm/migrate.c: do_move_page_to_node_array() - reference from follow_page() mlock.c: - various external references mm/vmscan.c: putback_lru_page() - reference from isolate_lru_page() It seems that all isolate_lru_page() users are ready now for this restriction. So, let's replace redundant get_page_unless_zero() with get_page() and add page initial reference count check with VM_BUG_ON() Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Andi Kleen <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mem-hwpoison: fix page refcount around isolate_lru_page()Konstantin Khlebnikov1-5/+6
Drop first page reference only after calling isolate_lru_page() to keep page stable reference while isolating. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Andi Kleen <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mem-hotplug: call isolate_lru_page with elevated refcountKonstantin Khlebnikov1-1/+3
isolate_lru_page() must be called only with stable reference to page. So, let's grab normal page reference. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Andi Kleen <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: print vmalloc() state after allocation failuresDave Hansen1-2/+7
I was tracking down a page allocation failure that ended up in vmalloc(). Since vmalloc() uses 0-order pages, if somebody asks for an insane amount of memory, we'll still get a warning with "order:0" in it. That's not very useful. During recovery, vmalloc() also nicely frees all of the memory that it got up to the point of the failure. That is wonderful, but it also quickly hides any issues. We have a much different sitation if vmalloc() repeatedly fails 10GB in to: vmalloc(100 * 1<<30); versus repeatedly failing 4096 bytes in to a: vmalloc(8192); This patch will print out messages that look like this: [ 68.123503] vmalloc: allocation failure, allocated 6680576 of 13426688 bytes [ 68.124218] bash: page allocation failure: order:0, mode:0xd2 [ 68.124811] Pid: 3770, comm: bash Not tainted 2.6.39-rc3-00082-g85f2e68-dirty #333 [ 68.125579] Call Trace: [ 68.125853] [<ffffffff810f6da6>] warn_alloc_failed+0x146/0x170 [ 68.126464] [<ffffffff8107e05c>] ? printk+0x6c/0x70 [ 68.126791] [<ffffffff8112b5d4>] ? alloc_pages_current+0x94/0xe0 [ 68.127661] [<ffffffff8111ed37>] __vmalloc_node_range+0x237/0x290 ... The 'order' variable is added for clarity when calling warn_alloc_failed() to avoid having an unexplained '0' as an argument. The 'tmp_mask' is because adding an open-coded '| __GFP_NOWARN' would take us over 80 columns for the alloc_pages_node() call. If we are going to add a line, it might as well be one that makes the sucker easier to read. As a side issue, I also noticed that ctl_ioctl() does vmalloc() based solely on an unverified value passed in from userspace. Granted, it's under CAP_SYS_ADMIN, but it still frightens me a bit. Signed-off-by: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Nazarewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-05-25mm: break out page allocation warning codeDave Hansen2-21/+43
This originally started as a simple patch to give vmalloc() some more verbose output on failure on top of the plain page allocator messages. Johannes suggested that it might be nicer to lead with the vmalloc() info _before_ the page allocator messages. But, I do think there's a lot of value in what __alloc_pages_slowpath() does with its filtering and so forth. This patch creates a new function which other allocators can call instead of relying on the internal page allocator warnings. It also gives this function private rate-limiting which separates it from other printk_ratelimit() users. Signed-off-by: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Nazarewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>