aboutsummaryrefslogtreecommitdiff
path: root/mm/hugetlb.c
AgeCommit message (Collapse)AuthorFilesLines
2014-06-04hugetlb: rename hugepage_migration_support() to ..._supported()Naoya Horiguchi1-1/+1
We already have a function named hugepages_supported(), and the similar name hugepage_migration_support() is a bit unconfortable, so let's rename it hugepage_migration_supported(). Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm, hugetlb: move the error handle logic out of normal code pathJianyu Zhan1-13/+13
alloc_huge_page() now mixes normal code path with error handle logic. This patches move out the error handle logic, to make normal code path more clean and redue code duplicate. Signed-off-by: Jianyu Zhan <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04hugetlb: add support for gigantic page allocation at runtimeLuiz Capitulino1-11/+155
HugeTLB is limited to allocating hugepages whose size are less than MAX_ORDER order. This is so because HugeTLB allocates hugepages via the buddy allocator. Gigantic pages (that is, pages whose size is greater than MAX_ORDER order) have to be allocated at boottime. However, boottime allocation has at least two serious problems. First, it doesn't support NUMA and second, gigantic pages allocated at boottime can't be freed. This commit solves both issues by adding support for allocating gigantic pages during runtime. It works just like regular sized hugepages, meaning that the interface in sysfs is the same, it supports NUMA, and gigantic pages can be freed. For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G gigantic pages on node 1, one can do: # echo 2 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages And to free them all: # echo 0 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages The one problem with gigantic page allocation at runtime is that it can't be serviced by the buddy allocator. To overcome that problem, this commit scans all zones from a node looking for a large enough contiguous region. When one is found, it's allocated by using CMA, that is, we call alloc_contig_range() to do the actual allocation. For example, on x86_64 we scan all zones looking for a 1GB contiguous region. When one is found, it's allocated by alloc_contig_range(). One expected issue with that approach is that such gigantic contiguous regions tend to vanish as runtime goes by. The best way to avoid this for now is to make gigantic page allocations very early during system boot, say from a init script. Other possible optimization include using compaction, which is supported by CMA but is not explicitly used by this commit. It's also important to note the following: 1. Gigantic pages allocated at boottime by the hugepages= command-line option can be freed at runtime just fine 2. This commit adds support for gigantic pages only to x86_64. The reason is that I don't have access to nor experience with other archs. The code is arch indepedent though, so it should be simple to add support to different archs 3. I didn't add support for hugepage overcommit, that is allocating a gigantic page on demand when /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't think it's reasonable to do the hard and long work required for allocating a gigantic page at fault time. But it should be simple to add this if wanted [[email protected]: coding-style fixes] Signed-off-by: Luiz Capitulino <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Rientjes <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04hugetlb: move helpers up in the fileLuiz Capitulino1-73/+73
Next commit will add new code which will want to call for_each_node_mask_to_alloc() macro. Move it, its buddy for_each_node_mask_to_free() and their dependencies up in the file so the new code can use them. This is just code movement, no logic change. Signed-off-by: Luiz Capitulino <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: David Rientjes <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04hugetlb: update_and_free_page(): don't clear PG_reserved bitLuiz Capitulino1-2/+2
Hugepages pages never get the PG_reserved bit set, so don't clear it. However, note that if the bit gets mistakenly set free_pages_check() will catch it. Signed-off-by: Luiz Capitulino <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Rientjes <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04hugetlb: add hstate_is_gigantic()Luiz Capitulino1-14/+14
Signed-off-by: Luiz Capitulino <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: David Rientjes <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04hugetlb: prep_compound_gigantic_page(): drop __init markerLuiz Capitulino1-2/+1
The HugeTLB subsystem uses the buddy allocator to allocate hugepages during runtime. This means that hugepages allocation during runtime is limited to MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes greater than MAX_ORDER), this in turn means that those pages can't be allocated at runtime. HugeTLB supports gigantic page allocation during boottime, via the boot allocator. To this end the kernel provides the command-line options hugepagesz= and hugepages=, which can be used to instruct the kernel to allocate N gigantic pages during boot. For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can be allocated and freed at runtime. If one wants to allocate 1G gigantic pages, this has to be done at boot via the hugepagesz= and hugepages= command-line options. Now, gigantic page allocation at boottime has two serious problems: 1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel evenly distributes boottime allocated hugepages among nodes. For example, suppose you have a four-node NUMA machine and want to allocate four 1G gigantic pages at boottime. The kernel will allocate one gigantic page per node. On the other hand, we do have users who want to be able to specify which NUMA node gigantic pages should allocated from. So that they can place virtual machines on a specific NUMA node. 2. Gigantic pages allocated at boottime can't be freed At this point it's important to observe that regular hugepages allocated at runtime don't have those problems. This is so because HugeTLB interface for runtime allocation in sysfs supports NUMA and runtime allocated pages can be freed just fine via the buddy allocator. This series adds support for allocating gigantic pages at runtime. It does so by allocating gigantic pages via CMA instead of the buddy allocator. Releasing gigantic pages is also supported via CMA. As this series builds on top of the existing HugeTLB interface, it makes gigantic page allocation and releasing just like regular sized hugepages. This also means that NUMA support just works. For example, to allocate two 1G gigantic pages on node 1, one can do: # echo 2 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages And, to release all gigantic pages on the same node: # echo 0 > \ /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages Please, refer to patch 5/5 for full technical details. Finally, please note that this series is a follow up for a previous series that tried to extend the command-line options set to be NUMA aware: http://marc.info/?l=linux-mm&m=139593335312191&w=2 During the discussion of that series it was agreed that having runtime allocation support for gigantic pages was a better solution. This patch (of 5): This function is going to be used by non-init code in a future commit. Signed-off-by: Luiz Capitulino <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-05-06hugetlb: ensure hugepage access is denied if hugepages are not supportedNishanth Aravamudan1-5/+14
Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [[email protected]: use pr_info(), per Mel] [[email protected]: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-18mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()Mizuma, Masayoshi1-0/+1
soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm: hugetlb: fix softlockup when a large number of hugepages are freed." can happen in return_unused_surplus_pages(), so let's fix it. Signed-off-by: Masayoshi Mizuma <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07mm: hugetlb: fix softlockup when a large number of hugepages are freed.Mizuma, Masayoshi1-0/+1
When I decrease the value of nr_hugepage in procfs a lot, softlockup happens. It is because there is no chance of context switch during this process. On the other hand, when I allocate a large number of hugepages, there is some chance of context switch. Hence softlockup doesn't happen during this process. So it's necessary to add the context switch in the freeing process as same as allocating process to avoid softlockup. When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing process occupied a CPU over 150 seconds and following softlockup message appeared twice or more. $ echo 6000000 > /proc/sys/vm/nr_hugepages $ cat /proc/sys/vm/nr_hugepages 6000000 $ grep ^Huge /proc/meminfo HugePages_Total: 6000000 HugePages_Free: 6000000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB $ echo 0 > /proc/sys/vm/nr_hugepages BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ... Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1 Call Trace: free_pool_huge_page+0xb8/0xd0 set_max_huge_pages+0x128/0x190 hugetlb_sysctl_handler_common+0x113/0x140 hugetlb_sysctl_handler+0x1e/0x20 proc_sys_call_handler+0x97/0xd0 proc_sys_write+0x14/0x20 vfs_write+0xb8/0x1a0 sys_write+0x51/0x90 __audit_syscall_exit+0x265/0x290 system_call_fastpath+0x16/0x1b I have not confirmed this problem with upstream kernels because I am not able to prepare the machine equipped with 12TB memory now. However I confirmed that the amount of decreasing hugepages was directly proportional to the amount of required time. I measured required times on a smaller machine. It showed 130-145 hugepages decreased in a millisecond. Amount of decreasing Required time Decreasing rate hugepages (msec) (pages/msec) ------------------------------------------------------------ 10,000 pages == 20GB 70 - 74 135-142 30,000 pages == 60GB 208 - 229 131-144 It means decrement of 6TB hugepages will trigger softlockup with the default threshold 20sec, in this decreasing rate. Signed-off-by: Masayoshi Mizuma <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07mm: fix 'ERROR: do not initialise globals to 0 or NULL' and coding styleChoi Gi-yong1-2/+1
Signed-off-by: Choi Gi-yong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07mm: use macros from compiler.h instead of __attribute__((...))Gideon Israel Dsouza1-1/+2
To increase compiler portability there is <linux/compiler.h> which provides convenience macros for various gcc constructs. Eg: __weak for __attribute__((weak)). I've replaced all instances of gcc attributes with the right macro in the memory management (/mm) subsystem. [[email protected]: while-we're-there consistency tweaks] Signed-off-by: Gideon Israel Dsouza <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07mm: move mmu notifier call from change_protection to change_pmd_rangeRik van Riel1-0/+2
The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time wasted can be enough to trigger soft lockup warnings with very large KVM guests. This patch moves the mmu notifier call to the pmd level, which represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier code is only called from the address in the PMD where present mappings are first encountered. The hugetlbfs code is left alone for now; hugetlb mappings are not relocatable, and as such are left alone by the NUMA code, and should never trigger this problem to begin with. Signed-off-by: Rik van Riel <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Andrea Arcangeli <[email protected]> Reported-by: Xing Gang <[email protected]> Tested-by: Chegu Vinod <[email protected]> Cc: Sasha Levin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-07mm/hugetlb.c: add NULL check of return value of huge_pte_offsetNaoya Horiguchi1-2/+3
huge_pte_offset() could return NULL, so we need NULL check to avoid potential NULL pointer dereferences. Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: mark some bootstrap functions as __initDavid Rientjes1-2/+3
Both prep_compound_huge_page() and prep_compound_gigantic_page() are only called at bootstrap and can be marked as __init. The __SetPageTail(page) in prep_compound_gigantic_page() happening before page->first_page is initialized is not concerning since this is bootstrap. Signed-off-by: David Rientjes <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: Joonsoo Kim <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: improve page-fault scalabilityDavidlohr Bueso1-13/+71
The kernel can currently only handle a single hugetlb page fault at a time. This is due to a single mutex that serializes the entire path. This lock protects from spurious OOM errors under conditions of low availability of free hugepages. This problem is specific to hugepages, because it is normal to want to use every single hugepage in the system - with normal pages we simply assume there will always be a few spare pages which can be used temporarily until the race is resolved. Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized. The hash key is selected depending on the mapping type. For shared ones it consists of the address space and file offset being faulted; while for private ones the mm and virtual address are used. The size of the table is selected based on a compromise of collisions and memory footprint of a series of database workloads. Large database workloads that make heavy use of hugepages can be particularly exposed to this issue, causing start-up times to be painfully slow. This patch reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads will naturally benefit even more. NOTE: The only downside to this patch, detected by Joonsoo Kim, is that a small race is possible in private mappings: A child process (with its own mm, after cow) can instantiate a page that is already being handled by the parent in a cow fault. When low on pages, can trigger spurious OOMs. I have not been able to think of a efficient way of handling this... but do we really care about such a tiny window? We already maintain another theoretical race with normal pages. If not, one possible way to is to maintain the single hash for private mappings -- any workloads that *really* suffer from this scaling problem should already use shared mappings. [[email protected]: remove stray + characters, go BUG if hugetlb_init() kmalloc fails] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: David Gibson <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: use vma_resv_map() map typesJoonsoo Kim1-50/+45
Util now, we get a resv_map by two ways according to each mapping type. This makes code dirty and unreadable. Unify it. [[email protected]: code cleanups] Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: remove resv_map_putJoonsoo Kim1-12/+3
This is a preparation patch to unify the use of vma_resv_map() regardless of the map type. This patch prepares it by removing resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not for all resv_maps. [[email protected]: update changelog] Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: fix race in region trackingDavidlohr Bueso1-20/+38
There is a race condition if we map a same file on different processes. Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex. When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only mmap_sem (exclusively). This doesn't prevent other tasks from modifying the region structure, so it can be modified by two processes concurrently. To solve this, introduce a spinlock to resv_map and make region manipulation function grab it before they do actual work. [[email protected]: updated changelog] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Joonsoo Kim <[email protected]> Suggested-by: Joonsoo Kim <[email protected]> Acked-by: David Gibson <[email protected]> Cc: David Gibson <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: improve, cleanup resv_map parametersJoonsoo Kim1-13/+17
To change a protection method for region tracking to find grained one, we pass the resv_map, instead of list_head, to region manipulation functions. This doesn't introduce any functional change, and it is just for preparing a next step. [[email protected]: update changelog] Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: unify region structure handlingJoonsoo Kim1-16/+21
Currently, to track reserved and allocated regions, we use two different ways, depending on the mapping. For MAP_SHARED, we use address_mapping's private_list and, while for MAP_PRIVATE, we use a resv_map. Now, we are preparing to change a coarse grained lock which protect a region structure to fine grained lock, and this difference hinder it. So, before changing it, unify region structure handling, consistently using a resv_map regardless of the kind of mapping. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: optimize put_mems_allowed() usageMel Gorman1-2/+2
Since put_mems_allowed() is strictly optional, its a seqcount retry, we don't need to evaluate the function if the allocation was in fact successful, saving a smp_rmb some loads and comparisons on some relative fast-paths. Since the naming, get/put_mems_allowed() does suggest a mandatory pairing, rename the interface, as suggested by Mel, to resemble the seqcount interface. This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(), where it is important to note that the return value of the latter call is inverted from its previous incarnation. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-23mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin1-5/+5
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [[email protected]: fix up includes] Signed-off-by: Sasha Levin <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: use memblock apis for early memory allocationsGrygorii Strashko1-5/+5
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tony Lindgren <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: call MMU notifiers when copying a hugetlb page rangeAndreas Sandberg1-5/+16
When copy_hugetlb_page_range() is called to copy a range of hugetlb mappings, the secondary MMUs are not notified if there is a protection downgrade, which breaks COW semantics in KVM. This patch adds the necessary MMU notifier calls. Signed-off-by: Andreas Sandberg <[email protected]> Acked-by: Steve Capper <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: defer PageHeadHuge() symbol exportAndrea Arcangeli1-1/+0
No actual need of it. So keep it internal. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: simplify PageHeadHuge() and PageHuge()Andrew Morton1-10/+2
Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: hugetlb: use get_page_foll() in follow_hugetlb_page()Andrea Arcangeli1-1/+1
get_page_foll() is more optimal and is always safe to use under the PT lock. More so for hugetlbfs as there's no risk of race conditions with split_huge_page regardless of the PT lock. Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-11-13mmu_notifier: call mmu_notifier_invalidate_range() from VMMJoerg Roedel1-1/+6
Add calls to the new mmu_notifier_invalidate_range() function to all places in the VMM that need it. Signed-off-by: Joerg Roedel <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Reviewed-by: Jérôme Glisse <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jay Cornwall <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Suravee Suthikulpanit <[email protected]> Cc: Jesse Barnes <[email protected]> Cc: David Woodhouse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Oded Gabbay <[email protected]>
2013-11-21mm: hugetlbfs: fix hugetlbfs optimizationAndrea Arcangeli1-0/+17
Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database caused by THP") can cause dereference of a dangling pointer if split_huge_page runs during PageHuge() if there are updates to the tail_page->private field. Also it is repeating compound_head twice for hugetlbfs and it is running compound_head+compound_trans_head for THP when a single one is needed in both cases. The new code within the PageSlab() check doesn't need to verify that the THP page size is never bigger than the smallest hugetlbfs page size, to avoid memory corruption. A longstanding theoretical race condition was found while fixing the above (see the change right after the skip_unlock label, that is relevant for the compound_lock path too). By re-establishing the _mapcount tail refcounting for all compound pages, this also fixes the below problem: echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages BUG: Bad page state in process bash pfn:59a01 page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0 page flags: 0x1c00000000008000(tail) Modules linked in: CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x55/0x76 bad_page+0xd5/0x130 free_pages_prepare+0x213/0x280 __free_pages+0x36/0x80 update_and_free_page+0xc1/0xd0 free_pool_huge_page+0xc2/0xe0 set_max_huge_pages.part.58+0x14c/0x220 nr_hugepages_store_common.isra.60+0xd0/0xf0 nr_hugepages_store+0x13/0x20 kobj_attr_store+0xf/0x20 sysfs_write_file+0x189/0x1e0 vfs_write+0xc5/0x1f0 SyS_write+0x55/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Khalid Aziz <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-21mm: thp: give transparent hugepage code a separate copy_pageDave Hansen1-34/+0
Right now, the migration code in migrate_page_copy() uses copy_huge_page() for hugetlbfs and thp pages: if (PageHuge(page) || PageTransHuge(page)) copy_huge_page(newpage, page); So, yay for code reuse. But: void copy_huge_page(struct page *dst, struct page *src) { struct hstate *h = page_hstate(src); and a non-hugetlbfs page has no page_hstate(). This works 99% of the time because page_hstate() determines the hstate from the page order alone. Since the page order of a THP page matches the default hugetlbfs page order, it works. But, if you change the default huge page size on the boot command-line (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate so page_hstate() returns null and copy_huge_page() oopses pretty fast since copy_huge_page() dereferences the hstate: void copy_huge_page(struct page *dst, struct page *src) { struct hstate *h = page_hstate(src); if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) { ... Mel noticed that the migration code is really the only user of these functions. This moves all the copy code over to migrate.c and makes copy_huge_page() work for THP by checking for it explicitly. I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add THP migration for the NUMA working set scanning fault case") [[email protected]: fix coding-style and comment text, per Naoya Horiguchi] Signed-off-by: Dave Hansen <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Tested-by: Dave Jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-15mm, hugetlb: convert hugetlbfs to use split pmd lockKirill A. Shutemov1-44/+66
Hugetlb supports multiple page sizes. We use split lock only for PMD level, but not for PUD. [[email protected]: coding-style fixes] Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Tested-by: Alex Thorlton <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "Eric W . Biederman" <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Al Viro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jones <[email protected]> Cc: David Howells <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Robin Holt <[email protected]> Cc: Sedat Dilek <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-10-16mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pagesAndrea Arcangeli1-1/+15
Commit 11feeb498086 ("kvm: optimize away THP checks in kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic compound pages. That commit depends on the assumption that PG_reserved is identical for all head and tail pages of a compound page. So that if get_user_pages returns a tail page, we don't need to check the head page in order to know if we deal with a reserved page that requires different refcounting. The assumption that PG_reserved is the same for head and tail pages is certainly correct for THP and regular hugepages, but gigantic hugepages allocated through bootmem don't clear the PG_reserved on the tail pages (the clearing of PG_reserved is done later only if the gigantic hugepage is freed). This patch corrects the gigantic compound page initialization so that we can retain the optimization in 11feeb498086. The cacheline was already modified in order to set PG_tail so this won't affect the boot time of large memory systems. [[email protected]: tweak comment layout and grammar] Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: andy123 <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-10-16mm/hugetlb.c: correct missing private flag clearingJoonsoo Kim1-0/+1
We should clear the page's private flag when returing the page to the hugepage pool. Otherwise, marked hugepage can be allocated to the user who tries to allocate the non-reserved hugepage. If this user fail to map this hugepage, he would try to return the page to the hugepage pool. Since this page has a private flag, resv_huge_pages would mistakenly increase. This patch fixes this situation. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movableNaoya Horiguchi1-18/+14
Now hugepage migration is enabled, although restricted on pmd-based hugepages for now (due to lack of testing.) So we should allocate migratable hugepages from ZONE_MOVABLE if possible. This patch makes GFP flags in hugepage allocation dependent on migration support, not only the value of hugepages_treat_as_movable. It provides no change on the behavior for architectures which do not support hugepage migration, Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: memory-hotplug: enable memory hotplug to handle hugepageNaoya Horiguchi1-2/+69
Until now we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. [[email protected]: remove duplicated include] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: mbind: add hugepage migration code to mbind()Naoya Horiguchi1-0/+14
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm: migrate: make core migration code aware of hugepageNaoya Horiguchi1-1/+22
Currently hugepage migration is available only for soft offlining, but it's also useful for some other users of page migration (clearly because users of hugepage can enjoy the benefit of mempolicy and memory hotplug.) So this patchset tries to extend such users to support hugepage migration. The target of this patchset is to enable hugepage migration for NUMA related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and memory hotplug. This patchset does not add hugepage migration for memory compaction, because users of memory compaction mainly expect to construct thp by arranging raw pages, and there's little or no need to compact hugepages. CMA, another user of page migration, can have benefit from hugepage migration, but is not enabled to support it for now (just because of lack of testing and expertise in CMA.) Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in x86_64, or hugepages in architectures like ia64) is not enabled for now (again, because of lack of testing.) As for how these are achived, I extended the API (migrate_pages()) to handle hugepage (with patch 1 and 2) and adjusted code of each caller to check and collect movable hugepages (with patch 3-7). Remaining 2 patches are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is about making sure that we only migrate pmd-based hugepages. And patch 9 is about choosing appropriate zone for hugepage allocation. My test is mainly functional one, simply kicking hugepage migration via each entry point and confirm that migration is done correctly. Test code is available here: git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git And I always run libhugetlbfs test when changing hugetlbfs's code. With this patchset, no regression was found in the test. This patch (of 9): Before enabling each user of page migration to support hugepage, this patch enables the list of pages for migration to link not only LRU pages, but also hugepages. As a result, putback_movable_pages() and migrate_pages() can handle both of LRU pages and hugepages. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Andi Kleen <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: return a reserved page to a reserved pool if failedJoonsoo Kim1-1/+12
If we fail with a reserved page, just calling put_page() is not sufficient, because put_page() invoke free_huge_page() at last step and it doesn't know whether a page comes from a reserved pool or not. So it doesn't do anything related to reserved count. This makes reserve count lower than how we need, because reserve count already decrease in dequeue_huge_page_vma(). This patch fix this situation. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: grab a page_table_lock after page_cache_releaseJoonsoo Kim1-2/+3
We don't need to grab a page_table_lock when we try to release a page. So, defer to grab a page_table_lock. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Aneesh Kumar <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: remove useless check about mapping typeJoonsoo Kim1-2/+1
is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for private. So we don't need to check whether this mapping is for shared or not. This patch is just for clean-up. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: fix subpool accounting handlingJoonsoo Kim1-4/+6
If we alloc hugepage with avoid_reserve, we don't dequeue reserved one. So, we should check subpool counter when avoid_reserve. This patch implement it. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: change variable name reservations to resvJoonsoo Kim1-13/+13
'reservations' is so long name as a variable and we use 'resv_map' to represent 'struct resv_map' in other place. To reduce confusion and unreadability, change it. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: protect reserved pages when soft offlining a hugepageJoonsoo Kim1-2/+3
Don't use the reserve pool when soft offlining a hugepage. Check we have free pages outside the reserve pool before we dequeue the huge page. Otherwise, we can steal other's reserve page. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Aneesh Kumar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cacheJoonsoo Kim1-8/+26
If a vma with VM_NORESERVE allocate a new page for page cache, we should check whether this area is reserved or not. If this address is already reserved by other process(in case of chg == 0), we should decrement reserve count, because this allocated page will go into page cache and currently, there is no way to know that this page comes from reserved pool or not when releasing inode. This may introduce over-counting problem to reserved count. With following example code, you can easily reproduce this situation. Assume 2MB, nr_hugepages = 100 size = 20 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } flag = MAP_SHARED | MAP_NORESERVE; q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (q == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); } q[0] = 'c'; After finish the program, run 'cat /proc/meminfo'. You can see below result. HugePages_Free: 100 HugePages_Rsvd: 1 To fix this, we should check our mapping type and tracked region. If our mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current allocated page will go into page cache which is already reserved region when mapping is created. In this case, we should decrease reserve count. As implementing above, this patch solve the problem. [[email protected]: fix spelling in comment] Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: remove decrement_hugepage_resv_vma()Joonsoo Kim1-21/+10
Now, Checking condition of decrement_hugepage_resv_vma() and vma_has_reserves() is same, so we can clean-up this function with vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only one call site, so we can remove function and embed it into dequeue_huge_page_vma() directly. This patch implement it. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()Joonsoo Kim1-0/+2
If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to check reserve counting and eventually we cannot be ensured to allocate a huge page in fault time. With following example code, you can easily find this situation. Assume 2MB, nr_hugepages = 100 fd = hugetlbfs_unlinked_fd(); if (fd < 0) return 1; size = 200 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } size = 2 * MB; flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); } p[0] = '0'; sleep(10); During executing sleep(10), run 'cat /proc/meminfo' on another process. HugePages_Free: 99 HugePages_Rsvd: 100 Number of free should be higher or equal than number of reserve, but this aren't. This represent that non reserved shared mapping steal a reserved page. Non reserved shared mapping should not eat into reserve space. If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean that we don't have reserved pages, then we check that we have enough free pages in dequeue_huge_page_vma(). This prevent to steal a reserved page. With this change, above test generate a SIGBUG which is correct, because all free pages are reserved and non reserved shared mapping can't get a free page. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: do not use a page in page cache for cow optimizationJoonsoo Kim1-5/+2
Currently, we use a page with mapped count 1 in page cache for cow optimization. If we find this condition, we don't allocate a new page and copy contents. Instead, we map this page directly. This may introduce a problem that writting to private mapping overwrite hugetlb file directly. You can find this situation with following code. size = 20 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } p[0] = 's'; fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]); munmap(p, size); flag = MAP_PRIVATE; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); } p[0] = 'c'; munmap(p, size); flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]); munmap(p, size); We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE WRITE: s". If we turn off this optimization to a page in page cache, the problem is disappeared. So, I change the trigger condition of optimization. If this page is not AnonPage, we don't do optimization. This makes this optimization turning off for a page cache. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()Joonsoo Kim1-5/+2
If list is empty, list_for_each_entry_safe() doesn't do anything. So, this check is redundant. Remove it. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-09-11mm, hugetlb: fix and clean-up node iteration code to alloc or freeJoonsoo Kim1-82/+61
Current node iteration code have a minor problem which do one more node rotation if we can't succeed to allocate. For example, if we start to allocate at node 0, we stop to iterate at node 0. Then we start to allocate at node 1 for next allocation. I introduce new macros "for_each_node_mask_to_[alloc|free]" and fix and clean-up node iteration code to alloc or free. This makes code more understandable. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>