Age | Commit message (Collapse) | Author | Files | Lines |
|
Andrea Arcangeli pointed out to me that a check in __memory_failure()
which was intended to prevent THP tail pages from being checked for the
absence of the PG_lru flag (something that is always the case), was also
preventing THP head pages from being checked.
A THP head page could actually benefit from the call to shake_page() by
ending up being put back to a LRU, provided it had been waiting in a
pagevec array.
Andrea suggested that the "!PageTransCompound(p)" in the if-statement
should be replaced by a "!PageTransTail(p)", thus allowing THP head pages
to be checked and possibly shaken.
Signed-off-by: Dean Nelson <[email protected]>
Cc: Jin Dongming <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Hidetoshi Seto <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Adds to generic xattr support introduced in Linux 3.0 by implementing
initxattrs callback. This enables consulting of security attributes from
LSM and EVM when inode is created.
[[email protected]: moved under CONFIG_TMPFS_XATTR, with memcpy in shmem_xattr_alloc]
Signed-off-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: James Morris <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The oom killer chooses not to kill a thread if:
- an eligible thread has already been oom killed and has yet to exit,
and
- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.
SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.
For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.
Signed-off-by: David Rientjes <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Stack for a new thread is mapped by userspace code and passed via
sys_clone. This memory is currently seen as anonymous in
/proc/<pid>/maps, which makes it difficult to ascertain which mappings
are being used for thread stacks. This patch uses the individual task
stack pointers to determine which vmas are actually thread stacks.
For a multithreaded program like the following:
#include <pthread.h>
void *thread_main(void *foo)
{
while(1);
}
int main()
{
pthread_t t;
pthread_create(&t, NULL, thread_main, NULL);
pthread_join(t, NULL);
}
proc/PID/maps looks like the following:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
that is not always a reliable way to find out which vma is a thread
stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
content.
With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
as the task would see it' and hence, only the vma that that task uses as
stack is marked as [stack]. All other 'stack' vmas are marked as
anonymous memory. /proc/PID/maps acts as a thread group level view,
where all thread stack vmas are marked as [stack:TID] where TID is the
process ID of the task that uses that vma as stack, while the process
stack is marked as [stack].
So /proc/PID/maps will look like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Thus marking all vmas that are used as stacks by the threads in the
thread group along with the process stack. The task level maps will
however like this:
00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out
00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out
019ef000-01a10000 rw-p 00000000 00:00 0 [heap]
7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so
7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so
7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so
7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
where only the vma that is being used as a stack by *that* task is
marked as [stack].
Analogous changes have been made to /proc/PID/smaps,
/proc/PID/numa_maps, /proc/PID/task/TID/smaps and
/proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
numa_maps:
[siddhesh@localhost ~ ]$ pgrep a.out
1441
[siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442]
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack]
[siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
7fff6273a000 default stack anon=3 dirty=3 N0=3
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
7f8a44492000 default stack anon=2 dirty=2 N0=2
[siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
7fff6273a000 default stack anon=3 dirty=3 N0=3
[[email protected]: checkpatch fixes]
[[email protected]: fix build]
Signed-off-by: Siddhesh Poyarekar <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jamie Lokier <[email protected]>
Cc: Mike Frysinger <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Matt Mackall <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When unmapping a given VM range, we could bail out if a reference page is
supplied and is unmapped, which is a minor optimization.
Signed-off-by: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The behavior of THP can either be toggled through sysfs in runtime or
using a kernel cmdline parameter 'transparent_hugepage='. Document the
latter in kernel-parameters.txt
Signed-off-by: Jiri Kosina <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When shrinking inactive lru list, isolated pages are queued on locally
private list, so the lock-hold time could be reduced if pages are counted
without lock protection.
To achieve that, firstly updating reclaim stat is delayed until the
putback stage, after reacquiring the lru lock.
Secondly, operations related to vm and zone stats are now proteced with
preemption disabled as they are per-cpu operations.
Signed-off-by: Hillf Danton <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This declaration is not used anymore, remove it.
Signed-off-by: Xiao Guangrong <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reduce code duplication by calling anon_vma_chain_link() from
anon_vma_prepare().
Also move anon_vmal_chain_link() to a more suitable location in the file.
Signed-off-by: Kautuk Consul <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Reviewed-by: KAMEZWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When gathering surplus pages, the number of needed pages is recomputed
after reacquiring hugetlb lock to catch changes in resv_huge_pages and
free_huge_pages. Plus it is recomputed with the number of newly allocated
pages involved.
Thus freeing pages can be deferred a bit to see if the final page request
is satisfied, though pages could be allocated less than needed.
Signed-off-by: Hillf Danton <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
highmem
Stuart Foster reported on bugzilla that copying large amounts of data
from NTFS caused an OOM kill on 32-bit X86 with 16G of memory. Andrew
Morton correctly identified that the problem was NTFS was using 512
blocks meaning each page had 8 buffer_heads in low memory pinning it.
In the past, direct reclaim used to scan highmem even if the allocating
process did not specify __GFP_HIGHMEM but not any more. kswapd no longer
will reclaim from zones that are above the high watermark. The intention
in both cases was to minimise unnecessary reclaim. The downside is on
machines with large amounts of highmem that lowmem can be fully consumed
by buffer_heads with nothing trying to free them.
The following patch is based on a suggestion by Andrew Morton to extend
the buffer_heads_over_limit case to force kswapd and direct reclaim to
scan the highmem zone regardless of the allocation request or watermarks.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=42578
[[email protected]: move buffer_heads_over_limit check up]
[[email protected]: buffer_heads_over_limit is unlikely]
Reported-by: Stuart Foster <[email protected]>
Tested-by: Stuart Foster <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: stable <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(),
IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
for checking config options in C expressions.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently a local variable of pagemap entry in pagemap_pte_range() is
named pfn and typed with u64, but it's not correct (pfn should be unsigned
long.)
This patch introduces special type for pagemap entries and replaces code
with it.
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
page-types, which is a common user of pagemap, gets aware of thp with this
patch. This helps system admins and kernel hackers know about how thp
works. Here is a sample output of page-types over a thp:
$ page-types -p <pid> --raw --list
voffset offset len flags
...
7f9d40200 3f8400 1 ___U_lA____Ma_bH______t____________
7f9d40201 3f8401 1ff ________________T_____t____________
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000410000 511 1 ________________T_____t____________ compound_tail,thp
0x000000000040d868 1 0 ___U_lA____Ma_bH______t____________ uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp
Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Wu Fengguang <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This flag shows that a given page is a subpage of a transparent hugepage.
It helps us debug and test the kernel by showing physical address of thp.
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently when we check if we can handle thp as it is or we need to split
it into regular sized pages, we hold page table lock prior to check
whether a given pmd is mapping thp or not. Because of this, when it's not
"huge pmd" we suffer from unnecessary lock/unlock overhead. To remove it,
this patch introduces a optimized check function and replace several
similar logics with it.
[[email protected]: checkpatch fixes]
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Jiri Slaby <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Thp split is not necessary if we explicitly check whether pmds are mapping
thps or not. This patch introduces this check and adds code to generate
pagemap entries for pmds mapping thps, which results in less performance
impact of pagemap on thp.
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If the required size is bigger than cached_hole_size it is better to
search from free_area_cache - it is easier to get a free region,
specifically for the 64 bit process whose address space is large enough
Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c
Signed-off-by: Xiao Guangrong <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In the current code, cached_hole_size is set to the maximum value if the
unmapped vma is less that free_area_cache so the next search will search
from the base address.
Actually, we can keep cached_hole_size so that if the next required size
is more than cached_hole_size, it can search from free_area_cache.
Signed-off-by: Xiao Guangrong <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Search again only if some holes may be skipped in the first pass.
[[email protected]: clean up crazy compound definition]
Signed-off-by: Xiao Guangrong <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use/update cached_hole_size and free_area_cache properly to speedup
finding of a free region.
Signed-off-by: Xiao Guangrong <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
"order" is -1 when compacting via /proc/sys/vm/compact_memory. Making
it unsigned causes a bug in __compact_pgdat() when we test:
if (cc->order < 0 || !compaction_deferred(zone, cc->order))
compact_zone(zone, cc);
[[email protected]: make __compact_pgdat()'s comparison match other code sites]
Signed-off-by: Dan Carpenter <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
I get this lockdep warning from swapping load on linux-next, due to
"vmscan: kswapd carefully call compaction".
=================================
[ INFO: inconsistent lock state ]
3.3.0-rc2-next-20120201 #5 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
(pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
{RECLAIM_FS-ON-W} state was registered at:
[<ffffffff81099b75>] mark_held_locks+0xd7/0x103
[<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
[<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
[<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
[<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
[<ffffffff810d679f>] pcpu_alloc+0x182/0x325
[<ffffffff810d694d>] __alloc_percpu+0xb/0xd
[<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
[<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
[<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
[<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
[<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
[<ffffffff8185d044>] inet_init+0x1ad/0x252
[<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
[<ffffffff81832bc5>] kernel_init+0x9d/0x11e
[<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
irq event stamp: 656613
hardirqs last enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
softirqs last enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(pcpu_alloc_mutex);
<Interrupt>
lock(pcpu_alloc_mutex);
*** DEADLOCK ***
no locks held by kswapd0/28.
stack backtrace:
Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
Call Trace:
[<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
[<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
[<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
[<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
[<ffffffff81098684>] mark_lock+0x251/0x331
[<ffffffff81098893>] mark_irqflags+0x12f/0x141
[<ffffffff81098e32>] __lock_acquire+0x58d/0x753
[<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
[<ffffffff81099433>] lock_acquire+0x54/0x6a
[<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
[<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
[<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
[<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
[<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
[<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
[<ffffffff810d6684>] pcpu_alloc+0x67/0x325
[<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
[<ffffffff810d694d>] __alloc_percpu+0xb/0xd
[<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
[<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
[<ffffffff810f126f>] __compact_pgdat+0x20/0x182
[<ffffffff810f15c2>] compact_pgdat+0x27/0x29
[<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
[<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
[<ffffffff810ce0ed>] kswapd+0x15f/0x178
[<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
[<ffffffff8106fd11>] kthread+0x84/0x8c
[<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
[<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
[<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
[<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
[<ffffffff814e51e0>] ? gs_change+0xb/0xb
The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that
Nick hacked into lockdep a while back: I think we're intended to read that
"<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".
I'm hazy, I have not reached any conclusion as to whether it's right to
complain or not; but I believe it's uneasy about kswapd now doing the
mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails. Nor have
I reached any conclusion as to whether it's important for kswapd to do
that draining or not.
But so as not to get blocked on this, with lockdep disabled from giving
further reports, here's a patch which removes the lru_add_drain_all() from
kswapd's callpath (and calls it only once from compact_nodes(), instead of
once per node).
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.
The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.
Signed-off-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
free pages, even when it is woken for a higher order request.
This could be bad for eg. jumbo frame network allocations, which are done
from interrupt context and cannot compact memory themselves. Higher than
before allocation failure rates in the network receive path have been
observed in kernels with compaction enabled.
Teach kswapd to defragment the memory zones in a node, but only if
required and compaction is not deferred in a zone.
[[email protected]: reduce scope of zones_need_compaction]
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When built with CONFIG_COMPACTION, kswapd should not try to free
contiguous pages, because it is not trying hard enough to have a real
chance at being successful, but still disrupts the LRU enough to break
other things.
Do not do higher order page isolation unless we really are in lumpy
reclaim mode.
Stop reclaiming pages once we have enough free pages that compaction can
deal with things, and we hit the normal order 0 watermarks used by kswapd.
Also remove a line of code that increments balanced right before exiting
the function.
Signed-off-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Ever since abandoning the virtual scan of processes, for scalability
reasons, swap space has been a little more fragmented than before. This
can lead to the situation where a large memory user is killed, swap space
ends up full of "holes" and swapin readahead is totally ineffective.
On my home system, after killing a leaky firefox it took over an hour to
page just under 2GB of memory back in, slowing the virtual machines down
to a crawl.
This patch makes swapin readahead simply skip over holes, instead of
stopping at them. This allows the system to swap things back in at rates
of several MB/second, instead of a few hundred kB/second.
The checks done in valid_swaphandles are already done in
read_swap_cache_async as well, allowing us to remove a fair amount of
code.
[[email protected]: fix it for page_cluster >= 32]
Signed-off-by: Rik van Riel <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Adrian Drzewiecki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The value of nr_reclaimed is the number of pages reclaimed in the current
round of the loop, whereas nr_to_reclaim should be compared with the
number of pages reclaimed in all rounds.
In each round of the loop, reclaimed pages are cut off from the reclaim
goal, and the loop stops once the goal achieved.
Signed-off-by: Hillf Danton <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make get_mm_counter() always static inline, it is simple enough for that.
And remove unused set_mm_counter()
bloat-o-meter:
add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242)
function old new delta
try_to_unmap_one 886 952 +66
sys_remap_file_pages 1214 1230 +16
dup_mm 1684 1700 +16
do_exit 2277 2278 +1
zap_page_range 208 205 -3
unmap_region 304 296 -8
static.oom_kill_process 554 546 -8
try_to_unmap_file 1716 1700 -16
getrusage 925 909 -16
flush_old_exec 1704 1688 -16
static.dump_header 416 390 -26
acct_update_integrals 218 187 -31
do_task_stat 2986 2954 -32
get_mm_counter 34 - -34
xacct_add_tsk 371 334 -37
task_statm 172 118 -54
task_mem 383 323 -60
try_to_unmap_one() grows because update_hiwater_rss() now completely inline.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With tons of reclaim_mode (defined as one field of struct scan_control)
already in the file, it is clearer to rename the local reclaim_mode when
setting up the isolation mode.
Signed-off-by: Hillf Danton <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Warn about non-zero rss counters at final mmdrop.
This check will prevent reoccurences of bugs such as that fixed in "mm:
fix rss count leakage during migration".
I didn't hide this check under CONFIG_VM_DEBUG because it rather small and
rss counters cover whole page-table management, so this is a good
invariant.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
printk_ratelimit() uses the global ratelimit state for all printks. The
oom killer should not be subjected to this state just because another
subsystem or driver may be flooding the kernel log.
This patch introduces printk ratelimiting specifically for the oom killer.
Signed-off-by: David Rientjes <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If a thread is chosen for oom kill and is already PF_EXITING, then the oom
killer simply sets TIF_MEMDIE and returns. This allows the thread to have
access to memory reserves so that it may quickly exit. This logic is
preceeded with a comment saying there's no need to alarm the sysadmin.
This patch adds truth to that statement.
There's no need to emit any warning about the oom condition if the thread
is already exiting since it will not be killed. In this condition, just
silently return the oom killer since its only giving access to memory
reserves and is otherwise a no-op.
Acked-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
oom_kill_task() has a single caller, so fold it into its parent function,
oom_kill_process(). Slightly reduces the number of lines in the oom
killer.
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
oom_kill_task() returns non-zero iff the chosen process does not have any
threads with an attached ->mm.
In such a case, it's better to just return to the page allocator and retry
the allocation because memory could have been freed in the interim and the
oom condition may no longer exist. It's unnecessary to loop in the oom
killer and find another thread to kill.
This allows both oom_kill_task() and oom_kill_process() to be converted to
void functions. If the oom condition persists, the oom killer will be
recalled.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the new helper function introduced in commit 5e6292c0f28f ("signal:
add block_sigmask() for adding sigmask to current->blocked") which
centralises the code for updating current->blocked after successfully
delivering a signal and reduces the amount of duplicate code across
architectures. In the past some architectures got this code wrong, so
using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <[email protected]>
Acked-by: "David S. Miller" <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As described in commit e6fa16ab9c1e ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <[email protected]>
Cc: Chris Zankel <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
setup_frame() needs to return an indication of whether it succeeded or
failed in setting up the signal stack frame. If setup_frame() fails then
we must not modify current->blocked.
Acked-by: Oleg Nesterov <[email protected]>
Cc: Chris Zankel <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
get_signal_to_deliver() already resets the signal handler if SA_ONESHOT
is set in ka->sa.sa_flags, there's no need to do it again in
handle_signal().
Furthermore, because we were modifying ka->sa.sa_handler (which is a
copy of sighand->action[]) instead of sighand->action[] the original
code actually had no effect on signal delivery.
Acked-by: Oleg Nesterov <[email protected]>
Cc: Chris Zankel <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of open coding the sequence from force_sigsegv() just call it.
This also fixes a bug because we were modifying ka->sa.sa_handler (which
is a copy of sighand->action[]), whereas the intention of the code was to
modify sighand->action[] directly.
As the original code was working with a copy it had no effect on signal
delivery.
Acked-by: Oleg Nesterov <[email protected]>
Cc: Chris Zankel <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The following program illustrates the problem:
char buf[8192];
int fd = open("/proc/self/maps", O_RDONLY);
n = pread(fd, buf, sizeof(buf), 0);
printf("%d\n", n);
/* lseek(fd, 0, SEEK_CUR); */ /* Uncomment to work around */
n = pread(fd, buf, sizeof(buf), 0);
printf("%d\n", n);
The second printf() prints zero, but uncommenting the lseek() corrects its
behaviour.
To fix, make seq_read() mirror seq_lseek() when processing changes in
*ppos. Restore m->version first, then if required traverse and update
read_pos on success.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=11856
Signed-off-by: Earl Chew <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a code indentation in the function intel_idle_cpu_init that looks
confusing.o
Suggested-by: Srivatsa S. Bhat <[email protected]>
Reviewed-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Marcos Paulo de Souza <[email protected]>
Cc: "Brown, Len" <[email protected]>
Cc: Len Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
i386 allnoconfig:
fs/namei.c: In function 'has_zero':
fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type
fs/namei.c:1617: warning: integer constant is too large for 'unsigned long' type
fs/namei.c: In function 'hash_name':
fs/namei.c:1635: warning: integer constant is too large for 'unsigned long' type
There must be a tidier way of doing this.
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[[email protected]: checkpatch fixes]
Reported-by: Ulrich Obergfell <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Jones <[email protected]>
Acked-by: Larry Woodman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: <[email protected]> [2.6.38+]
Cc: Mark Salter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some BIOS's don't setup power management correctly (what else is
new) and don't allow use of PCI Express power control. Add a special
exception module parameter to allow working around this issue.
Based on slightly different patch by Knut Petersen.
Reported-by: Arkadiusz Miskiewicz <[email protected]>
Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates for 3.4 from David Teigland:
"This set includes one trivial fix, and one simple recovery speed up.
Directory recovery can use the standard hash table to find resources
rather than always searching the linear recovery list."
* tag 'dlm-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: last element of dlm_local_addr[] never used
dlm: fix slow rsb search in dir recovery
|
|
napi->skb is allocated in napi_get_frags() using
netdev_alloc_skb_ip_align(), with a reserve of NET_SKB_PAD +
NET_IP_ALIGN bytes.
However, when such skb is recycled in napi_reuse_skb(), it ends with a
reserve of NET_IP_ALIGN which is suboptimal.
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Herbert Xu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
|
|
Pull munmap/truncate race fixes from Al Viro:
"Fixes for racy use of unmap_vmas() on truncate-related codepaths"
* 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VM: make zap_page_range() callers that act on a single VMA use separate helper
VM: make unmap_vmas() return void
VM: don't bother with feeding upper limit to tlb_finish_mmu() in exit_mmap()
VM: make zap_page_range() return void
VM: can't go through the inner loop in unmap_vmas() more than once...
VM: unmap_page_range() can return void
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates for 3.4 from James Morris:
"The main addition here is the new Yama security module from Kees Cook,
which was discussed at the Linux Security Summit last year. Its
purpose is to collect miscellaneous DAC security enhancements in one
place. This also marks a departure in policy for LSM modules, which
were previously limited to being standalone access control systems.
Chromium OS is using Yama, and I believe there are plans for Ubuntu,
at least.
This patchset also includes maintenance updates for AppArmor, TOMOYO
and others."
Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key
rename.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
AppArmor: Fix location of const qualifier on generated string tables
TOMOYO: Return error if fails to delete a domain
AppArmor: add const qualifiers to string arrays
AppArmor: Add ability to load extended policy
TOMOYO: Return appropriate value to poll().
AppArmor: Move path failure information into aa_get_name and rename
AppArmor: Update dfa matching routines.
AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
AppArmor: Add const qualifiers to generated string tables
AppArmor: Fix oops in policy unpack auditing
AppArmor: Fix error returned when a path lookup is disconnected
KEYS: testing wrong bit for KEY_FLAG_REVOKED
TOMOYO: Fix mount flags checking order.
security: fix ima kconfig warning
AppArmor: Fix the error case for chroot relative path name lookup
AppArmor: fix mapping of META_READ to audit and quiet flags
AppArmor: Fix underflow in xindex calculation
AppArmor: Fix dropping of allowed operations that are force audited
AppArmor: Add mising end of structure test to caps unpacking
...
|