aboutsummaryrefslogtreecommitdiff
path: root/mm/memcontrol.c
AgeCommit message (Collapse)AuthorFilesLines
2015-02-11mm: memcontrol: default hierarchy interface for memoryJohannes Weiner1-11/+218
Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. This interface versioning allows us to address fundamental design issues in the existing memory cgroup interface, further explained below. The old interface will be maintained indefinitely, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually. The control files are thus: - memory.current shows the current consumption of the cgroup and its descendants, in bytes. - memory.low configures the lower end of the cgroup's expected memory consumption range. The kernel considers memory below that boundary to be a reserve - the minimum that the workload needs in order to make forward progress - and generally avoids reclaiming it, unless there is an imminent risk of entering an OOM situation. - memory.high configures the upper end of the cgroup's expected memory consumption range. A cgroup whose consumption grows beyond this threshold is forced into direct reclaim, to work off the excess and to throttle new allocations heavily, but is generally allowed to continue and the OOM killer is not invoked. - memory.max configures the hard maximum amount of memory that the cgroup is allowed to consume before the OOM killer is invoked. - memory.events shows event counters that indicate how often the cgroup was reclaimed while below memory.low, how often it was forced to reclaim excess beyond memory.high, how often it hit memory.max, and how often it entered OOM due to memory.max. This allows users to identify configuration problems when observing a degradation in workload performance. An overcommitted system will have an increased rate of low boundary breaches, whereas increased rates of high limit breaches, maximum hits, or even OOM situations will indicate internally overcommitted cgroups. For existing users of memory cgroups, the following deviations from the current interface are worth pointing out and explaining: - The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that global reclaim prefers is opt-in, rather than opt-out. The costs for optimizing these mostly negative lookups are so high that the implementation, despite its enormous size, does not even provide the basic desirable behavior. First off, the soft limit has no hierarchical meaning. All configured groups are organized in a global rbtree and treated like equal peers, regardless where they are located in the hierarchy. This makes subtree delegation impossible. Second, the soft limit reclaim pass is so aggressive that it not just introduces high allocation latencies into the system, but also impacts system performance due to overreclaim, to the point where the feature becomes self-defeating. The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it and all its ancestors are below their low boundaries, which makes delegation of subtrees possible. Secondly, new cgroups have no reserve per default and in the common case most cgroups are eligible for the preferred reclaim pass. This allows the new low boundary to be efficiently implemented with just a minor addition to the generic reclaim code, without the need for out-of-band data structures and reclaim passes. Because the generic reclaim code considers all cgroups except for the ones running low in the preferred first reclaim pass, overreclaim of individual groups is eliminated as well, resulting in much better overall workload performance. - The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. But this generally goes against the goal of making the most out of the available memory. The memory consumption of workloads varies during runtime, and that requires users to overcommit. But doing that with a strict upper limit requires either a fairly accurate prediction of the working set size or adding slack to the limit. Since working set size estimation is hard and error prone, and getting it wrong results in OOM kills, most users tend to err on the side of a looser limit and end up wasting precious resources. The memory.high boundary on the other hand can be set much more conservatively. When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer. As a result, a high boundary that is chosen too aggressively will not terminate the processes, but instead it will lead to gradual performance degradation. The user can monitor this and make corrections until the minimal memory footprint that still gives acceptable performance is found. In extreme cases, with many concurrent allocations and a complete breakdown of reclaim progress within the group, the high boundary can be exceeded. But even then it's mostly better to satisfy the allocation from the slack available in other groups or the rest of the system than killing the group. Otherwise, memory.max is there to limit this type of spillover and ultimately contain buggy or even malicious applications. - The original control file names are unwieldy and inconsistent in many different ways. For example, the upper boundary hit count is exported in the memory.failcnt file, but an OOM event count has to be manually counted by listening to memory.oom_control events, and lower boundary / soft limit events have to be counted by first setting a threshold for that value and then counting those events. Also, usage and limit files encode their units in the filename. That makes the filenames very long, even though this is not information that a user needs to be reminded of every time they type out those names. To address these naming issues, as well as to signal clearly that the new interface carries a new configuration model, the naming conventions in it necessarily differ from the old interface. - The original limit files indicate the state of an unset limit with a very high number, and a configured limit can be unset by echoing -1 into those files. But that very high number is implementation and architecture dependent and not very descriptive. And while -1 can be understood as an underflow into the highest possible value, -2 or -10M etc. do not work, so it's not inconsistent. memory.low, memory.high, and memory.max will use the string "infinity" to indicate and set the highest possible value. [[email protected]: use seq_puts() for basic strings] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: page_counter: pull "-1" handling out of page_counter_memparse()Johannes Weiner1-2/+2
The unified hierarchy interface for memory cgroups will no longer use "-1" to mean maximum possible resource value. In preparation for this, make the string an argument and let the caller supply it. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11memcg: add BUILD_BUG_ON() for string tablesGreg Thelen1-0/+4
Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync with corresponding enums. There aren't currently any issues with these tables. This is just defensive. Signed-off-by: Greg Thelen <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11vmscan: force scan offline memory cgroupsVladimir Davydov1-0/+14
Since commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups") pages charged to a memory cgroup are not reparented when the cgroup is removed. Instead, they are supposed to be reclaimed in a regular way, along with pages accounted to online memory cgroups. However, an lruvec of an offline memory cgroup will sooner or later get so small that it will be scanned only at low scan priorities (see get_scan_count()). Therefore, if there are enough reclaimable pages in big lruvecs, pages accounted to offline memory cgroups will never be scanned at all, wasting memory. Fix this by unconditionally forcing scanning dead lruvecs from kswapd. [[email protected]: fix build] Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-11mm: memcontrol: track move_lock state internallyJohannes Weiner1-29/+39
The complexity of memcg page stat synchronization is currently leaking into the callsites, forcing them to keep track of the move_lock state and the IRQ flags. Simplify the API by tracking it in the memcg. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-10memcg: zap memcg_slab_caches and memcg_slab_mutexVladimir Davydov1-141/+15
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to the given cgroup. Currently, it is only used on css free in order to destroy all caches corresponding to the memory cgroup being freed. The list is protected by memcg_slab_mutex. The mutex is also used to protect kmem_cache->memcg_params->memcg_caches arrays and synchronizes kmem_cache_destroy vs memcg_unregister_all_caches. However, we can perfectly get on without these two. To destroy all caches corresponding to a memory cgroup, we can walk over the global list of kmem caches, slab_caches, and we can do all the synchronization stuff using the slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid of the memcg_slab_caches and memcg_slab_mutex. Apart from this nice cleanup, it also: - assures that rcu_barrier() is called once at max when a root cache is destroyed or a memory cgroup is freed, no matter how many caches have SLAB_DESTROY_BY_RCU flag set; - fixes the race between kmem_cache_destroy and kmem_cache_create that exists, because memcg_cleanup_cache_params, which is called from kmem_cache_destroy after checking that kmem_cache->refcount=0, releases the slab_mutex, which gives kmem_cache_create a chance to make an alias to a cache doomed to be destroyed. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-10memcg: zap memcg_name argument of memcg_create_kmem_cacheVladimir Davydov1-4/+1
Instead of passing the name of the memory cgroup which the cache is created for in the memcg_name_argument, let's obtain it immediately in memcg_create_kmem_cache. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-10memcg: zap __memcg_{charge,uncharge}_slabVladimir Davydov1-18/+3
They are simple wrappers around memcg_{charge,uncharge}_kmem, so let's zap them and call these functions directly. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-10mm: remove rest usage of VM_NONLINEAR and pte_file()Kirill A. Shutemov1-5/+2
One bit in ->vm_flags is unused now! Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-05memcg, shmem: fix shmem migration to use lrucareMichal Hocko1-1/+1
It has been reported that 965GM might trigger VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage) in mem_cgroup_migrate when shmem wants to replace a swap cache page because of shmem_should_replace_page (the page is allocated from an inappropriate zone). shmem_replace_page expects that the oldpage is not on LRU list and calls mem_cgroup_migrate without lrucare. This is obviously incorrect because swapcache pages might be on the LRU list (e.g. swapin readahead page). Fix this by enabling lrucare for the migration in shmem_replace_page. Also clarify that lrucare should be used even if one of the pages might be on LRU list. The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even without that the migration code might leave the old page on an inappropriate memcg' LRU which is not that critical because the page would get removed with its last reference but it is still confusing. Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") Signed-off-by: Michal Hocko <[email protected]> Reported-by: Chris Wilson <[email protected]> Reported-by: Dave Airlie <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: <[email protected]> [3.17+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-01-26memcg: remove extra newlines from memcg oom kill logGreg Thelen1-2/+2
Commit e61734c55c24 ("cgroup: remove cgroup->name") added two extra newlines to memcg oom kill log messages. This makes dmesg hard to read and parse. The issue affects 3.15+. Example: Task in /t <<< extra #1 killed as a result of limit of /t <<< extra #2 memory: usage 102400kB, limit 102400kB, failcnt 274712 Remove the extra newlines from memcg oom kill messages, so the messages look like: Task in /t killed as a result of limit of /t memory: usage 102400kB, limit 102400kB, failcnt 240649 Fixes: e61734c55c24 ("cgroup: remove cgroup->name") Signed-off-by: Greg Thelen <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-01-08memcg: fix destination cgroup leak on task charges migrationVladimir Davydov1-12/+0
We are supposed to take one css reference per each memory page and per each swap entry accounted to a memory cgroup. However, during task charges migration we take a reference to the destination cgroup twice per each swap entry: first in mem_cgroup_do_precharge()->try_charge() and then in mem_cgroup_move_swap_account(), permanently leaking the destination cgroup. The hunk taking the second reference seems to be a leftover from the pre-00501b531c472 ("mm: memcontrol: rewrite charge API") era. Remove it to fix the leak. Fixes: e8ea14cc6ead (mm: memcontrol: take a css reference for each charged page) Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-01-08mm: memcontrol: switch soft limit default back to infinityJohannes Weiner1-1/+4
Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") accidentally switched the soft limit default from infinity to zero, which turns all memcgs with even a single page into soft limit excessors and engages soft limit reclaim on all of them during global memory pressure. This makes global reclaim generally more aggressive, but also inverts the meaning of existing soft limit configurations where unset soft limits are usually more generous than set ones. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memcontrol.c: remove unused mem_cgroup_lru_names_not_uptodate()Rickard Strandqvist1-5/+2
Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON() to the beginning of memcg_stat_show(). This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: fix possible use-after-free in memcg_kmem_get_cache()Vladimir Davydov1-35/+16
Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memcontrol.c: fix defined but not used compiler warningMichele Curti1-1/+2
test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so move it into the compiler if statement [[email protected]: clean up layout] Signed-off-by: Michele Curti <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13oom: don't assume that a coredumping thread will exit soonOleg Nesterov1-1/+1
oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() "forever" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [[email protected]: add comment] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Cong Wang <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache()Zhang Zhen1-2/+1
The gfp was passed in but never used in this function. Signed-off-by: Zhang Zhen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: turn memcg_kmem_skip_account into a bit fieldVladimir Davydov1-33/+2
It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on the task_struct. Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to set/clear the flag inline. Regarding the overwhelming comment to the helpers, which is removed by this patch too, we already have a compact yet accurate explanation in memcg_schedule_cache_create, no need in yet another one. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cacheVladimir Davydov1-28/+0
__memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if the cgroup's kmem cache doesn't exist), because kmalloc may call __memcg_kmem_get_cache internally again. To avoid the recursion, we use the task_struct->memcg_kmem_skip_account flag. However, there's no need checking the flag in memcg_kmem_newpage_charge, because there's no way how this function could result in recursion, if called from memcg_kmem_get_cache. So let's remove the redundant code. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: zap kmem_account_flagsVladimir Davydov1-21/+10
The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of having a separate flags field. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: do not abuse memcg_kmem_skip_accountVladimir Davydov1-7/+0
task_struct->memcg_kmem_skip_account was initially introduced to avoid recursion during kmem cache creation: memcg_kmem_get_cache, which is called by kmem_cache_alloc to determine the per-memcg cache to account allocation to, may issue lazy cache creation if the needed cache doesn't exist, which means issuing yet another kmem_cache_alloc. We can't just pass a flag to the nested kmem_cache_alloc disabling kmem accounting, because there are hidden allocations, e.g. in INIT_WORK. So we introduced a flag on the task_struct, memcg_kmem_skip_account, making memcg_kmem_get_cache return immediately. By its nature, the flag may also be used to disable accounting for allocations shared among different cgroups, and currently it is used this way in memcg_activate_kmem. Using it like this looks like abusing it to me. If we want to disable accounting for some allocations (which we will definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG flag in order to blacklist/whitelist some allocations. For now, let's simply remove memcg_stop/resume_kmem_account from memcg_activate_kmem. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: don't check mm in __memcg_kmem_{get_cache,newpage_charge}Vladimir Davydov1-2/+2
We already assured the current task has mm in memcg_kmem_should_charge, no need to double check. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-13memcg: __mem_cgroup_free: remove stale disarm_static_keys commentVladimir Davydov1-11/+0
cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10Merge branch 'akpm' (patchbomb from Andrew)Linus Torvalds1-1201/+505
Merge first patchbomb from Andrew Morton: - a few minor cifs fixes - dma-debug upadtes - ocfs2 - slab - about half of MM - procfs - kernel/exit.c - panic.c tweaks - printk upates - lib/ updates - checkpatch updates - fs/binfmt updates - the drivers/rtc tree - nilfs - kmod fixes - more kernel/exit.c - various other misc tweaks and fixes * emailed patches from Andrew Morton <[email protected]>: (190 commits) exit: pidns: fix/update the comments in zap_pid_ns_processes() exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting exit: exit_notify: re-use "dead" list to autoreap current exit: reparent: call forget_original_parent() under tasklist_lock exit: reparent: avoid find_new_reaper() if no children exit: reparent: introduce find_alive_thread() exit: reparent: introduce find_child_reaper() exit: reparent: document the ->has_child_subreaper checks exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting exit: proc: don't try to flush /proc/tgid/task/tgid exit: release_task: fix the comment about group leader accounting exit: wait: drop tasklist_lock before psig->c* accounting exit: wait: don't use zombie->real_parent exit: wait: cleanup the ptrace_reparented() checks usermodehelper: kill the kmod_thread_locker logic usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper() fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ...
2014-12-10mm: move page->mem_cgroup bad page handling into generic codeJohannes Weiner1-15/+0
Now that the external page_cgroup data structure and its lookup is gone, let the generic bad_page() check for page->mem_cgroup sanity. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: page_cgroup: rename file to mm/swap_cgroup.cJohannes Weiner1-1/+1
Now that the external page_cgroup data structure and its lookup is gone, the only code remaining in there is swap slot accounting. Rename it and move the conditional compilation into mm/Makefile. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: David S. Miller <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: embed the memcg pointer directly into struct pageJohannes Weiner1-89/+35
Memory cgroups used to have 5 per-page pointers. To allow users to disable that amount of overhead during runtime, those pointers were allocated in a separate array, with a translation layer between them and struct page. There is now only one page pointer remaining: the memcg pointer, that indicates which cgroup the page is associated with when charged. The complexity of runtime allocation and the runtime translation overhead is no longer justified to save that *potential* 0.19% of memory. With CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding after the page->private member and doesn't even increase struct page, and then this patch actually saves space. Remaining users that care can still compile their kernels without CONFIG_MEMCG. text data bss dec hex filename 8828345 1725264 983040 11536649 b00909 vmlinux.old 8827425 1725264 966656 11519345 afc571 vmlinux.new [[email protected]: update Documentation/cgroups/memory.txt] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: David S. Miller <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Joonsoo Kim <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove stale page_cgroup_lock commentJohannes Weiner1-4/+0
There is no cgroup-specific page lock anymore. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm, memcg: fix potential undefined behaviour in page stat accountingMichal Hocko1-4/+4
Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback page accounting") mem_cgroup_end_page_stat consumes locked and flags variables directly rather than via pointers which might trigger C undefined behavior as those variables are initialized only in the slow path of mem_cgroup_begin_page_stat. Although mem_cgroup_end_page_stat handles parameters correctly and touches them only when they hold a sensible value it is caller which loads a potentially uninitialized value which then might allow compiler to do crazy things. I haven't seen any warning from gcc and it seems that the current version (4.9) doesn't exploit this type undefined behavior but Sasha has reported the following: UBSan: Undefined behaviour in mm/rmap.c:1084:2 load of value 255 is not a valid value for type '_Bool' CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427 Call Trace: dump_stack (lib/dump_stack.c:52) ubsan_epilogue (lib/ubsan.c:159) __ubsan_handle_load_invalid_value (lib/ubsan.c:482) page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096) unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303) unmap_single_vma (mm/memory.c:1348) unmap_vmas (mm/memory.c:1377 (discriminator 3)) exit_mmap (mm/mmap.c:2837) mmput (kernel/fork.c:659) do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747) do_group_exit (include/linux/sched.h:775 kernel/exit.c:873) SyS_exit_group (kernel/exit.c:901) tracesys_phase2 (arch/x86/kernel/entry_64.S:529) Fix this by using pointer parameters for both locked and flags and be more robust for future compiler changes even though the current code is implemented correctly. Signed-off-by: Michal Hocko <[email protected]> Reported-by: Sasha Levin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree()Johannes Weiner1-43/+16
None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree()Johannes Weiner1-1/+1
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It makes a lot more sense to check where it's looked up, rather than check for it in __mem_cgroup_same_or_subtree() where it's unexpected. No other callsite passes NULL to __mem_cgroup_same_or_subtree(). Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove bogus NULL check after mem_cgroup_from_task()Johannes Weiner1-3/+2
That function acts like a typecast - unless NULL is passed in, no NULL can come out. task_in_mem_cgroup() callers don't pass NULL tasks. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: shorten the page statistics update slowpathJohannes Weiner1-13/+8
While moving charges from one memcg to another, page stat updates must acquire the old memcg's move_lock to prevent double accounting. That situation is denoted by an increased memcg->move_accounting. However, the charge moving code declares this way too early for now, even before summing up the RSS and pre-allocating destination charges. Shorten this slowpath mode by increasing memcg->move_accounting only right before walking the task's address space with the intention of actually moving the pages. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10memcg: use generic slab iterators for showing slabinfoVladimir Davydov1-21/+4
Let's use generic slab_start/next/stop for showing memcg caches info. In contrast to the current implementation, this will work even if all memcg caches' info doesn't fit into a seq buffer (a page), plus it simply looks neater. Actually, the main reason I do this isn't mere cleanup. I'm going to zap the memcg_slab_caches list, because I find it useless provided we have the slab_caches list, and this patch is a step in this direction. It should be noted that before this patch an attempt to read memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted in -EIO, while after this patch it will silently show nothing except the header, but I don't think it will frustrate anyone. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10memcg: remove mem_cgroup_reclaimable check from soft reclaimVladimir Davydov1-43/+0
mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on *any* NUMA node. However, the only place where it's called is mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific* zone. So the way it is used is incorrect - it will return true even if the cgroup doesn't have pages on the zone we're scanning. I think we can get rid of this check completely, because mem_cgroup_shrink_node_zone(), which is called by mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is equivalent to shrink_lruvec(), which exits almost immediately if the lruvec passed to it is empty. So there's no need to optimize anything here. Besides, we don't have such a check in the general scan path (shrink_zone) either. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: fold mem_cgroup_start_move()/mem_cgroup_end_move()Johannes Weiner1-28/+12
Having these functions and their documentation split out and somewhere makes it harder, not easier, to follow what's going on. Inline them directly where charge moving is prepared and finished, and put an explanation right next to it. Signed-off-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: don't pass a NULL memcg to mem_cgroup_end_move()Johannes Weiner1-7/+3
mem_cgroup_end_move() checks if the passed memcg is NULL, along with a lengthy comment to explain why this seemingly non-sensical situation is even possible. Check in cancel_attach() itself whether can_attach() set up the move context or not, it's a lot more obvious from there. Then remove the check and comment in mem_cgroup_end_move(). Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: inline memcg->move_lock lockingJohannes Weiner1-22/+6
The wrappers around taking and dropping the memcg->move_lock spinlock add nothing of value. Inline the spinlock calls into the callsites. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flagJohannes Weiner1-66/+41
pc->mem_cgroup had to be left intact after uncharge for the final LRU removal, and !PCG_USED indicated whether the page was uncharged. But since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages are uncharged after the final LRU removal. Uncharge can simply clear the pointer and the PCG_USED/PageCgroupUsed sites can test that instead. Because this is the last page_cgroup flag, this patch reduces the memcg per-page overhead to a single pointer. [[email protected]: remove unneeded initialization of `memcg', per Michal] Signed-off-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove unnecessary PCG_MEM memory charge flagJohannes Weiner1-3/+1
PCG_MEM is a remnant from an earlier version of 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"), used to tell whether migration cleared a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the final version, mem_cgroup_migrate() directly uncharges the source page, rendering this distinction unnecessary. Remove it. Signed-off-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flagJohannes Weiner1-22/+12
Now that mem_cgroup_swapout() fully uncharges the page, every page that is still in use when reaching mem_cgroup_uncharge() is known to carry both the memory and the memory+swap charge. Simplify the uncharge path and remove the PCG_MEMSW page flag accordingly. Signed-off-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: uncharge pages on swapoutJohannes Weiner1-4/+14
This series gets rid of the remaining page_cgroup flags, thus cutting the memcg per-page overhead down to one pointer. This patch (of 4): mem_cgroup_swapout() is called with exclusive access to the page at the end of the page's lifetime. Instead of clearing the PCG_MEMSW flag and deferring the uncharge, just do it right away. This allows follow-up patches to simplify the uncharge code. Signed-off-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: micro-optimize mem_cgroup_split_huge_fixup()Michal Hocko1-1/+3
Don't call lookup_page_cgroup() when memcg is disabled. Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10memcg: remove activate_kmem_mutexVladimir Davydov1-19/+5
The activate_kmem_mutex is used to serialize memcg.kmem.limit updates, but we already serialize them with memcg_limit_mutex so let's remove the former. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: clarify migration where old page is unchargedJohannes Weiner1-1/+6
Better explain re-entrant migration when compaction races with reclaim, and also mention swapcache readahead pages as possible uncharged migration sources. Signed-off-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: update mem_cgroup_page_lruvec() documentationJohannes Weiner1-8/+8
Commit 7512102cf64d ("memcg: fix GPF when cgroup removal races with last exit") added a pc->mem_cgroup reset into mem_cgroup_page_lruvec() to prevent a crash where an anon page gets uncharged on unmap, the memcg is released, and then the final LRU isolation on free dereferences the stale pc->mem_cgroup pointer. But since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"), pages are only uncharged AFTER that final LRU isolation, which guarantees the memcg's lifetime until then. pc->mem_cgroup now only needs to be reset for swapcache readahead pages. Update the comment and callsite requirements accordingly. Signed-off-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10memcg: simplify unreclaimable groups handling in soft limit reclaimVladimir Davydov1-22/+4
If we fail to reclaim anything from a cgroup during a soft reclaim pass we want to get the next largest cgroup exceeding its soft limit. To achieve this, we should obviously remove the current group from the tree and then pick the largest group. Currently we have a weird loop instead. Let's simplify it. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove synchronous stock draining codeJohannes Weiner1-40/+6
With charge reparenting, the last synchronous stock drainer left. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: continue cache reclaim from offlined groupsJohannes Weiner1-217/+1
On cgroup deletion, outstanding page cache charges are moved to the parent group so that they're not lost and can be reclaimed during pressure on/inside said parent. But this reparenting is fairly tricky and its synchroneous nature has led to several lock-ups in the past. Since c2931b70a32c ("cgroup: iterate cgroup_subsys_states directly") css iterators now also include offlined css, so memcg iterators can be changed to include offlined children during reclaim of a group, and leftover cache can just stay put. There is a slight change of behavior in that charges of deleted groups no longer show up as local charges in the parent. But they are still included in the parent's hierarchical statistics. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>