aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2016-01-14mm: memcontrol: generalize the socket accounting jump labelJohannes Weiner1-0/+3
The unified hierarchy memory controller is going to use this jump label as well to control the networking callbacks. Move it to the memory controller code and give it a more generic name. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14net: tcp_memcontrol: simplify linkage between socket and page counterJohannes Weiner1-35/+22
There won't be any separate counters for socket memory consumed by protocols other than TCP in the future. Remove the indirection and link sockets directly to their owning memory cgroup. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14net: tcp_memcontrol: sanitize tcp memory accounting callbacksJohannes Weiner1-0/+32
There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace this with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. Note that this is not purely aesthetic: as a result of shoehorning the per-memcg code into the same memory accounting functions that handle the global level, the old code would compare the per-memcg consumption against the smaller of the per-memcg limit and the global limit. This allowed the total consumption of multiple sockets to exceed the global limit, as long as the individual sockets stayed within bounds. After this change, the code will always compare the per-memcg consumption to the per-memcg limit, and the global consumption to the global limit, and thus close this loophole. Without a soft limit, the per-memcg memory pressure state in sockets is generally questionable. However, we did it until now, so we continue to enter it when the hard limit is hit, and packets are dropped, to let other sockets in the cgroup know that they shouldn't grow their transmit windows, either. However, keep it simple in the new callback model and leave memory pressure lazily when the next packet is accepted (as opposed to doing it synchroneously when packets are processed). When packets are dropped, network performance will already be in the toilet, so that should be a reasonable trade-off. As described above, consumption is now checked on the per-memcg level and the global level separately. Likewise, memory pressure states are maintained on both the per-memcg level and the global level, and a socket is considered under pressure when either level asserts as much. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-labelJohannes Weiner1-31/+25
Move the jump-label from sock_update_memcg() and sock_release_memcg() to the callsite, and so eliminate those function calls when socket accounting is not enabled. This also eliminates the need for dummy functions because the calls will be optimized away if the Kconfig options are not enabled. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: David S. Miller <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm: memcontrol: export root_mem_cgroupJohannes Weiner2-4/+3
A later patch will need this symbol in files other than memcontrol.c, so export it now and replace mem_cgroup_root_css at the same time. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David S. Miller <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/ksm.c: use list_for_each_entry_safeGeliang Tang1-13/+7
Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/readahead.c, mm/vmscan.c: use lru_to_page instead of list_to_pageGeliang Tang3-7/+5
list_to_page() in readahead.c is the same as lru_to_page() in vmscan.c. So I move lru_to_page to internal.h and drop list_to_page(). Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/compaction.c: __compact_pgdat() code cleanuupJoonsoo Kim1-6/+7
This patch uses is_via_compact_memory() to distinguish compaction from sysfs or sysctl. And, this patch also reduces indentation on compaction_defer_reset() by filtering these cases first before checking watermark. There is no functional change. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Yaowei Bai <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/swapfile.c: use list_{next,first}_entryGeliang Tang1-10/+4
To make the intention clearer, use list_{next,first}_entry instead of list_entry(). Signed-off-by: Geliang Tang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/memblock: introduce for_each_memblock_type()Alexander Kuleshov1-16/+16
We already have the for_each_memblock() macro in <linux/memblock.h> which provides ability to iterate over memblock regions of a known type. The for_each_memblock() macro allows us to pass the pointer to the struct memblock_type, instead we need to pass name of the type. This patch introduces a new macro for_each_memblock_type() which allows us iterate over memblock regions with the given type when the type is unknown. Signed-off-by: Alexander Kuleshov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/memblock: remove rgnbase and rgnsize variablesAlexander Kuleshov1-6/+3
Remove rgnbase and rgnsize variables from memblock_overlaps_region(). We use these variables only for passing to the memblock_addrs_overlap() function and that's all. Let's remove them. Signed-off-by: Alexander Kuleshov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, oom: give __GFP_NOFAIL allocations access to memory reservesMichal Hocko1-1/+14
__GFP_NOFAIL is a big hammer used to ensure that the allocation request can never fail. This is a strong requirement and as such it also deserves a special treatment when the system is OOM. The primary problem here is that the allocation request might have come with some locks held and the oom victim might be blocked on the same locks. This is basically an OOM deadlock situation. This patch tries to reduce the risk of such a deadlocks by giving __GFP_NOFAIL allocations a special treatment and let them dive into memory reserves after oom killer invocation. This should help them to make a progress and release resources they are holding. The OOM victim should compensate for the reserves consumption. Signed-off-by: Michal Hocko <[email protected]> Suggested-by: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_alloc.c: use list_for_each_entry in mark_free_pages()Geliang Tang1-5/+5
Use list_for_each_entry instead of list_for_each + list_entry to simplify the code. Signed-off-by: Geliang Tang <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_alloc.c: use list_{first,last}_entry instead of list_entryGeliang Tang1-12/+11
To make the intention clearer, use list_{first,last}_entry instead of list_entry. Signed-off-by: Geliang Tang <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_alloc.c: remove unnecessary parameter from __rmqueueMel Gorman1-3/+3
Commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") added an unnecessary and unused parameter to __rmqueue. It was a parameter that was used in an earlier version of the patch and then left behind. This patch cleans it up. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm: page_alloc: generalize the dirty balance reserveJohannes Weiner2-20/+15
The dirty balance reserve that dirty throttling has to consider is merely memory not available to userspace allocations. There is nothing writeback-specific about it. Generalize the name so that it's reusable outside of that context. Signed-off-by: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm: allow GFP_{FS,IO} for page_cache_read page cache allocationMichal Hocko2-5/+21
page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is called from the vm_operations_struct::fault() context during the page fault. This context doesn't need the reclaim protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. There is simply no reason to deliberately use a weaker allocation context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Signed-off-by: Michal Hocko <[email protected]> Reported-by: Tetsuo Handa <[email protected]> Acked-by: Jan Kara <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/compaction: improve comment for compact_memory tunable knob handlerYaowei Bai1-1/+4
sysctl_compaction_handler() is the handler function for compact_memory tunable knob under /proc/sys/vm, add the missing knob name to make this more accurate in comment. No functional change. Signed-off-by: Yaowei Bai <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm: mmap: add new /proc tunable for mmap_base ASLRDaniel Cashman1-0/+12
Address Space Layout Randomization (ASLR) provides a barrier to exploitation of user-space processes in the presence of security vulnerabilities by making it more difficult to find desired code/data which could help an attack. This is done by adding a random offset to the location of regions in the process address space, with a greater range of potential offset values corresponding to better protection/a larger search-space for brute force, but also to greater potential for fragmentation. The offset added to the mmap_base address, which provides the basis for the majority of the mappings for a process, is set once on process exec in arch_pick_mmap_layout() and is done via hard-coded per-arch values, which reflect, hopefully, the best compromise for all systems. The trade-off between increased entropy in the offset value generation and the corresponding increased variability in address space fragmentation is not absolute, however, and some platforms may tolerate higher amounts of entropy. This patch introduces both new Kconfig values and a sysctl interface which may be used to change the amount of entropy used for offset generation on a system. The direct motivation for this change was in response to the libstagefright vulnerabilities that affected Android, specifically to information provided by Google's project zero at: http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html The attack presented therein, by Google's project zero, specifically targeted the limited randomness used to generate the offset added to the mmap_base address in order to craft a brute-force-based attack. Concretely, the attack was against the mediaserver process, which was limited to respawning every 5 seconds, on an arm device. The hard-coded 8 bits used resulted in an average expected success rate of defeating the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a piece). With this patch, and an accompanying increase in the entropy value to 16 bits, the same attack would take an average expected time of over 45 hours (32768 tries), which makes it both less feasible and more likely to be noticed. The introduced Kconfig and sysctl options are limited by per-arch minimum and maximum values, the minimum of which was chosen to match the current hard-coded value and the maximum of which was chosen so as to give the greatest flexibility without generating an invalid mmap_base address, generally a 3-4 bits less than the number of bits in the user-space accessible virtual address space. When decided whether or not to change the default value, a system developer should consider that mmap_base address could be placed anywhere up to 2^(value) bits away from the non-randomized location, which would introduce variable-sized areas above and below the mmap_base address such that the maximum vm_area_struct size may be reduced, preventing very large allocations. This patch (of 4): ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Signed-off-by: Daniel Cashman <[email protected]> Cc: Russell King <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Don Zickus <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Heinrich Schuchardt <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mark Salyzyn <[email protected]> Cc: Jeff Vander Stoep <[email protected]> Cc: Nick Kralevich <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Hector Marco-Gisbert <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_regionPiotr Kwapulinski1-3/+0
The following flag comparison in mmap_region makes no sense: if (!(vm_flags & MAP_FIXED)) return -ENOMEM; The condition is always false and thus the above "return -ENOMEM" is never executed. The vm_flags must not be compared with MAP_FIXED flag. The vm_flags may only be compared with VM_* flags. MAP_FIXED has the same value as VM_MAYREAD. Hitting the rlimit is a slow path and find_vma_intersection should realize that there is no overlapping VMA for !MAP_FIXED case pretty quickly. Signed-off-by: Piotr Kwapulinski <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Chris Metcalf <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, vmscan: consider isolated pages in zone_reclaimable_pagesMichal Hocko1-2/+4
zone_reclaimable_pages counts how many pages are reclaimable in the given zone. This currently includes all pages on file lrus and anon lrus if there is an available swap storage. We do not consider NR_ISOLATED_{ANON,FILE} counters though which is not correct because these counters reflect temporarily isolated pages which are still reclaimable because they either get back to their LRU or get freed either by the page reclaim or page migration. The number of these pages might be sufficiently high to confuse users of zone_reclaimable_pages (e.g. mbind can migrate large ranges of memory at once). Signed-off-by: Michal Hocko <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14memcg: do not allow to disable tcp accounting after limit is setVladimir Davydov1-1/+1
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former is never cleared while the latter can be cleared by unsetting the limit. This allows to disable tcp socket accounting for new sockets after it was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still guaranteeing that memcg_socket_limit_enabled static key will be decremented on memcg destruction. This functionality looks dubious, because it is not clear what a use case would be. By enabling tcp accounting a user accepts the price. If they then find the performance degradation unacceptable, they can always restart their workload with tcp accounting disabled. It does not seem there is any need to flip it while the workload is running. Besides, it contradicts to how kmem accounting API works: writing whatever to memory.kmem.limit_in_bytes enables kmem accounting for the cgroup in question, after which it cannot be disabled. Therefore one might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just enables socket accounting w/o limiting it, which might be useful by itself, but it isn't true. Since this API peculiarity is not documented anywhere, I propose to drop it. This will allow to simplify the code by dropping cg_proto->flags. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14vmscan: do not force-scan file lru if its absolute size is smallVladimir Davydov1-3/+9
We assume there is enough inactive page cache if the size of inactive file lru is greater than the size of active file lru, in which case we force-scan file lru ignoring anonymous pages. While this logic works fine when there are plenty of page cache pages, it fails if the size of file lru is small (several MB): in this case (lru_size >> prio) will be 0 for normal scan priorities, as a result, if inactive file lru happens to be larger than active file lru, anonymous pages of a cgroup will never get evicted unless the system experiences severe memory pressure, even if there are gigabytes of unused anonymous memory there, which is unfair in respect to other cgroups, whose workloads might be page cache oriented. This patch attempts to fix this by elaborating the "enough inactive page cache" check: it makes it not only check that inactive lru size > active lru size, but also that we will scan something from the cgroup at the current scan priority. If these conditions do not hold, we proceed to SCAN_FRACT as usual. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, vmalloc: remove VM_VPAGESDavid Rientjes1-6/+2
VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when reading /proc/vmallocinfo. [[email protected]: remove VM_VPAGES reference via kvfree()] Signed-off-by: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, thp: use list_first_entry_or_null()Geliang Tang1-6/+3
Simplify the code with list_first_entry_or_null(). Signed-off-by: Geliang Tang <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, shmem: add internal shmem resident memory accountingJerome Marchand3-31/+16
Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported "shmem-rss" from "file-rss". [[email protected]: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappingsVlastimil Babka1-27/+38
Following the previous patch, further reduction of /proc/pid/smaps cost is possible for private writable shmem mappings with unpopulated areas where the page walk invokes the .pte_hole function. We can use radix tree iterator for each such area instead of calling find_get_entry() in a loop. This is possible at the extra maintenance cost of introducing another shmem function shmem_partial_swap_usage(). To demonstrate the diference, I have measured this on a process that creates a private writable 2GB mapping of a partially swapped out /dev/shm/file (which cannot employ the optimizations from the prvious patch) and doesn't populate it at all. I time how long does it take to cat /proc/pid/smaps of this process 100 times. Before this patch: real 0m3.831s user 0m0.180s sys 0m3.212s After this patch: real 0m1.176s user 0m0.180s sys 0m0.684s The time is similar to the case where a radix tree iterator is employed on the whole mapping. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm, proc: reduce cost of /proc/pid/smaps for shmem mappingsVlastimil Babka1-0/+70
The previous patch has improved swap accounting for shmem mapping, which however made /proc/pid/smaps more expensive for shmem mappings, as we consult the radix tree for each pte_none entry, so the overal complexity is O(n*log(n)). We can reduce this significantly for mappings that cannot contain COWed pages, because then we can either use the statistics tha shmem object itself tracks (if the mapping contains the whole object, or the swap usage of the whole object is zero), or use the radix tree iterator, which is much more effective than repeated find_get_entry() calls. This patch therefore introduces a function shmem_swap_usage(vma) and makes /proc/pid/smaps use it when possible. Only for writable private mappings of shmem objects (i.e. tmpfs files) with the shmem object itself (partially) swapped outwe have to resort to the find_get_entry() approach. Hopefully such mappings are relatively uncommon. To demonstrate the diference, I have measured this on a process that creates a 2GB mapping and dirties single pages with a stride of 2MB, and time how long does it take to cat /proc/pid/smaps of this process 100 times. Private writable mapping of a /dev/shm/file (the most complex case): real 0m3.831s user 0m0.180s sys 0m3.212s Shared mapping of an almost full mapping of a partially swapped /dev/shm/file (which needs to employ the radix tree iterator). real 0m1.351s user 0m0.096s sys 0m0.768s Same, but with /dev/shm/file not swapped (so no radix tree walk needed) real 0m0.935s user 0m0.128s sys 0m0.344s Private anonymous mapping: real 0m0.949s user 0m0.116s sys 0m0.348s The cost is now much closer to the private anonymous mapping case, unless the shmem mapping is private and writable. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/mmzone.c: memmap_valid_within() can be booleanYaowei Bai1-4/+4
Make memmap_valid_within return bool due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/vmalloc.c: use list_{next,first}_entryGeliang Tang1-5/+4
To make the intention clearer, use list_{next,first}_entry instead of list_entry. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_alloc.c: do not loop over ALLOC_NO_WATERMARKS without triggering reclaimMichal Hocko1-14/+18
__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if __GFP_NOFAIL is requested. This is fragile because we are basically relying on somebody else to make the reclaim (be it the direct reclaim or OOM killer) for us. The caller might be holding resources (e.g. locks) which block other other reclaimers from making any progress for example. Remove the retry loop and rely on __alloc_pages_slowpath to invoke all allowed reclaim steps and retry logic. We have to be careful about __GFP_NOFAIL allocations from the PF_MEMALLOC context even though this is a very bad idea to begin with because no progress can be gurateed at all. We shouldn't break the __GFP_NOFAIL semantic here though. It could be argued that this is essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC is much harder to check for existing users because they might happen deep down the code path performed much later after setting the flag so we cannot really rule out there is no kernel path triggering this combination. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_alloc.c: get rid of __alloc_pages_high_priority()Michal Hocko1-27/+9
__alloc_pages_high_priority doesn't do anything special other than it calls get_page_from_freelist and loops around GFP_NOFAIL allocation until it succeeds. It would be better if the first part was done in __alloc_pages_slowpath where we modify the zonelist because this would be easier to read and understand. Opencoding the function into its only caller allows to simplify it a bit as well. This patch doesn't introduce any functional changes. [[email protected]: coding-style fixes] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/zonelist: enumerate zonelists array indexYaowei Bai1-5/+4
Hardcoding index to zonelists array in gfp_zonelist() is not a good idea, let's enumerate it to improve readability. No functional change. [[email protected]: coding-style fixes] [[email protected]: fix CONFIG_NUMA=n build] [[email protected]: fix warning in comparing enumerator] Signed-off-by: Yaowei Bai <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/memblock.c: memblock_is_memory()/reserved() can be booleanYaowei Bai1-2/+2
Make memblock_is_memory() and memblock_is_reserved return bool to improve readability due to these particular functions only using either one or zero as their return value. No functional change. Signed-off-by: Yaowei Bai <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm: change mm_vmscan_lru_shrink_inactive() proto typesyalin wang1-5/+2
Move node_id zone_idx shrink flags into trace function, so thay we don't need caculate these args if the trace is disabled, and will make this function have less arguments. Signed-off-by: yalin wang <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/cma: always check which page caused allocation failureJoonsoo Kim1-3/+20
Now, we have tracepoint in test_pages_isolated() to notify pfn which cannot be isolated. But, in alloc_contig_range(), some error path doesn't call test_pages_isolated() so it's still hard to know exact pfn that causes allocation failure. This patch change this situation by calling test_pages_isolated() in almost error path. In allocation failure case, some overhead is added by this change, but, allocation failure is really rare event so it would not matter. In fatal signal pending case, we don't call test_pages_isolated() because this failure is intentional one. There was a bogus outer_start problem due to unchecked buddy order and this patch also fix it. Before this patch, it didn't matter, because end result is same thing. But, after this patch, tracepoint will report failed pfn so it should be accurate. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_isolation.c: add new tracepoint, test_pages_isolatedJoonsoo Kim1-0/+5
cma allocation should be guranteeded to succeed. But sometimes it can fail in the current implementation. To track down the problem, we need to know which page is problematic and this new tracepoint will report it. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/page_isolation.c: return last tested pfn rather than failure indicatorJoonsoo Kim1-7/+6
This is preparation step to report test failed pfn in new tracepoint to analyze cma allocation failure problem. There is no functional change in this patch. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/mempolicy.c: convert the shared_policy lock to a rwlockNathan Zimmer1-13/+17
When running the SPECint_rate gcc on some very large boxes it was noticed that the system was spending lots of time in mpol_shared_policy_lookup(). The gamess benchmark can also show it and is what I mostly used to chase down the issue since the setup for that I found to be easier. To be clear the binaries were on tmpfs because of disk I/O requirements. We then used text replication to avoid icache misses and having all the copies banging on the memory where the instruction code resides. This results in us hitting a bottleneck in mpol_shared_policy_lookup() since lookup is serialised by the shared_policy lock. I have only reproduced this on very large (3k+ cores) boxes. The problem starts showing up at just a few hundred ranks getting worse until it threatens to livelock once it gets large enough. For example on the gamess benchmark at 128 ranks this area consumes only ~1% of time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over 90%. To alleviate the contention in this area I converted the spinlock to an rwlock. This allows a large number of lookups to happen simultaneously. The results were quite good reducing this consumtion at max ranks to around 2%. [[email protected]: tidy up code comments] Signed-off-by: Nathan Zimmer <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Nadia Yvette Chambers <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/vmscan.c: change trace_mm_vmscan_writepage() proto typeyalin wang1-1/+1
Move trace_reclaim_flags() into trace function, so that we don't need caculate these flags if the trace is disabled. Signed-off-by: yalin wang <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/mmap.c: remove redundant local variables for may_expand_vm()Chen Gang1-8/+1
Simplify may_expand_vm(). [[email protected]: further simplification, per Naoya Horiguchi] Signed-off-by: Chen Gang <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/mlock.c: drop unneeded initialization in munlock_vma_pages_range()Alexey Klimov1-1/+1
Before usage page pointer initialized by NULL is reinitialized by follow_page_mask(). Drop useless init of page pointer in the beginning of loop. Signed-off-by: Alexey Klimov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14kmemcg: account certain kmem allocations to memcgVladimir Davydov3-4/+6
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [[email protected]: coding-style fixes] Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14vmalloc: allow to account vmalloc to memcgVladimir Davydov1-3/+3
Make vmalloc family functions allocate vmalloc area pages with alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted to memcg. This is needed, at least, to account alloc_fdmem allocations. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14slab: add SLAB_ACCOUNT flagVladimir Davydov4-4/+14
Currently, if we want to account all objects of a particular kmem cache, we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed to kmem_cache_create will force accounting for every allocation from this cache even if __GFP_ACCOUNT is not passed. This patch does not make any of the existing caches use this flag - it will be done later in the series. Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and hence cannot have different sets of SLAB_* flags. Thus using this flag will probably reduce the number of merged slabs even if kmem accounting is not used (only compiled in). Signed-off-by: Vladimir Davydov <[email protected]> Suggested-by: Tejun Heo <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14memcg: only account kmem allocations marked as __GFP_ACCOUNTVladimir Davydov1-1/+2
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So this patch switches kmem accounting to the white-policy: now only those kmem allocations that are marked as __GFP_ACCOUNT are accounted to memcg. Currently, no kmem allocations are marked like this. The following patches will mark several kmem allocations that are known to be easily triggered from userspace and therefore should be accounted to memcg. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14Revert "gfp: add __GFP_NOACCOUNT"Vladimir Davydov1-2/+1
This reverts commit 8f4fc071b192 ("gfp: add __GFP_NOACCOUNT"). Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So it was decided to switch to the white-list policy. This patch reverts bits introducing the black-list policy. The white-list policy will be introduced later in the series. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/slab.c: add a helper function get_first_slabGeliang Tang1-18/+21
Add a new helper function get_first_slab() that get the first slab from a kmem_cache_node. Signed-off-by: Geliang Tang <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/slab.c: use list_for_each_entry in cache_flusharrayGeliang Tang1-7/+2
Simplify the code with list_for_each_entry(). Signed-off-by: Geliang Tang <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-01-14mm/slab.c use list_first_entry_or_null()Geliang Tang1-12/+12
Simplify the code with list_first_entry_or_null(). Signed-off-by: Geliang Tang <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>