aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-01-21mm/sparse: use memblock apis for early memory allocationsSantosh Shilimkar2-14/+19
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tony Lindgren <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21lib/cpumask.c: use memblock apis for early memory allocationsSantosh Shilimkar1-2/+2
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tony Lindgren <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21lib/swiotlb.c: use memblock apis for early memory allocationsSantosh Shilimkar1-15/+20
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tony Lindgren <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21kernel/power/snapshot.c: use memblock apis for early memory allocationsSantosh Shilimkar1-1/+1
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Acked-by: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tony Lindgren <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/page_alloc.c: use memblock apis for early memory allocationsSantosh Shilimkar1-12/+15
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21kernel/printk/printk.c: use memblock apis for early memory allocationsSantosh Shilimkar1-7/+3
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21init/main.c: use memblock apis for early memory allocationsSantosh Shilimkar1-3/+5
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fall back to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: add memblock memory allocation apisSantosh Shilimkar3-4/+361
Introduce memblock memory allocation APIs which allow to support PAE or LPAE extension on 32 bits archs where the physical memory start address can be beyond 4GB. In such cases, existing bootmem APIs which operate on 32 bit addresses won't work and needs memblock layer which operates on 64 bit addresses. So we add equivalent APIs so that we can replace usage of bootmem with memblock interfaces. Architectures already converted to NO_BOOTMEM use these new memblock interfaces. The architectures which are still not converted to NO_BOOTMEM continue to function as is because we still maintain the fal lback option of bootmem back-end supporting these new interfaces. So no functional change as such. In long run, once all the architectures moves to NO_BOOTMEM, we can get rid of bootmem layer completely. This is one step to remove the core code dependency with bootmem and also gives path for architectures to move away from bootmem. The proposed interface will became active if both CONFIG_HAVE_MEMBLOCK and CONFIG_NO_BOOTMEM are specified by arch. In case !CONFIG_NO_BOOTMEM, the memblock() wrappers will fallback to the existing bootmem apis so that arch's not converted to NO_BOOTMEM continue to work as is. The meaning of MEMBLOCK_ALLOC_ACCESSIBLE and MEMBLOCK_ALLOC_ANYWHERE is kept same. [[email protected]: s/depricated/deprecated/] Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: switch to use NUMA_NO_NODE instead of MAX_NUMNODESGrygorii Strashko3-15/+25
It's recommended to use NUMA_NO_NODE everywhere to select "process any node" behavior or to indicate that "no node id specified". Hence, update __next_free_mem_range*() API's to accept both NUMA_NO_NODE and MAX_NUMNODES, but emit warning once on MAX_NUMNODES, and correct corresponding API's documentation to describe new behavior. Also, update other memblock/nobootmem APIs where MAX_NUMNODES is used dirrectly. The change was suggested by Tejun Heo. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: reorder parameters of memblock_find_in_range_nodeGrygorii Strashko3-11/+12
Reorder parameters of memblock_find_in_range_node to be consistent with other memblock APIs. The change was suggested by Tejun Heo <[email protected]>. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: drop WARN and use SMP_CACHE_BYTES as a default alignmentGrygorii Strashko1-2/+2
Don't produce warning and interpret 0 as "default align" equal to SMP_CACHE_BYTES in case if caller of memblock_alloc_base_nid() doesn't specify alignment for the block (align == 0). This is done in preparation of introducing common memblock alloc interface to make code behavior consistent. More details are in below thread : https://lkml.org/lkml/2013/10/13/117. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: remove unnecessary inclusions of bootmem.hGrygorii Strashko2-2/+0
Clean-up to remove depedency with bootmem headers. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Reviewed-by: Tejun Heo <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/bootmem: remove duplicated declaration of __free_pages_bootmem()Grygorii Strashko1-1/+0
The __free_pages_bootmem is used internally by MM core and already defined in internal.h. So, remove duplicated declaration. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Reviewed-by: Tejun Heo <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: debug: don't free reserved array if !ARCH_DISCARD_MEMBLOCKGrygorii Strashko1-0/+13
Now the Nobootmem allocator will always try to free memory allocated for reserved memory regions (free_low_memory_core_early()) without taking into to account current memblock debugging configuration (CONFIG_ARCH_DISCARD_MEMBLOCK and CONFIG_DEBUG_FS state). As result if: - CONFIG_DEBUG_FS defined - CONFIG_ARCH_DISCARD_MEMBLOCK not defined; - reserved memory regions array have been resized during boot then: - memory allocated for reserved memory regions array will be freed to buddy allocator; - debug_fs entry "sys/kernel/debug/memblock/reserved" will show garbage instead of state of memory reservations. like: 0: 0x98393bc0..0x9a393bbf 1: 0xff120000..0xff11ffff 2: 0x00000000..0xffffffff Hence, do not free memory allocated for reserved memory regions if defined(CONFIG_DEBUG_FS) && !defined(CONFIG_ARCH_DISCARD_MEMBLOCK). Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Reviewed-by: Tejun Heo <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21x86: memblock: set current limit to max low memory addressSantosh Shilimkar2-3/+3
The memblock current limit value is used to limit early boot memory allocations below max low memory address by default, as the kernel can access only to the low memory. Hence, set memblock current limit value to the max mapped low memory address instead of max mapped memory address. Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Grygorii Strashko <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21oom_kill: add rcu_read_lock() into find_lock_task_mm()Oleg Nesterov1-4/+8
find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless. Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm(). This also allows to simplify a bit one of its callers, oom_kill_process(). Signed-off-by: Oleg Nesterov <[email protected]> Cc: Sergey Dyasly <[email protected]> Cc: Sameer Nanda <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Mandeep Singh Baines <[email protected]> Cc: "Ma, Xindong" <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: "Tu, Xiaobing" <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()Oleg Nesterov1-8/+11
At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need "bool ret" and "break". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <[email protected]> Reviewed-by: Sergey Dyasly <[email protected]> Tested-by: Sergey Dyasly <[email protected]> Reviewed-by: Sameer Nanda <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Mandeep Singh Baines <[email protected]> Cc: "Ma, Xindong" <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: "Tu, Xiaobing" <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21oom_kill: change oom_kill.c to use for_each_thread()Oleg Nesterov1-10/+10
Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit. Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock(). Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems. Signed-off-by: Oleg Nesterov <[email protected]> Reviewed-by: Sergey Dyasly <[email protected]> Tested-by: Sergey Dyasly <[email protected]> Reviewed-by: Sameer Nanda <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Mandeep Singh Baines <[email protected]> Cc: "Ma, Xindong" <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: "Tu, Xiaobing" <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21introduce for_each_thread() to replace the buggy while_each_thread()Oleg Nesterov4-0/+22
while_each_thread() and next_thread() should die, almost every lockless usage is wrong. 1. Unless g == current, the lockless while_each_thread() is not safe. while_each_thread(g, t) can loop forever if g exits, next_thread() can't reach the unhashed thread in this case. Note that this can happen even if g is the group leader, it can exec. 2. Even if while_each_thread() itself was correct, people often use it wrongly. It was never safe to just take rcu_read_lock() and loop unless you verify that pid_alive(g) == T, even the first next_thread() can point to the already freed/reused memory. This patch adds signal_struct->thread_head and task->thread_node to create the normal rcu-safe list with the stable head. The new for_each_thread(g, t) helper is always safe under rcu_read_lock() as long as this task_struct can't go away. Note: of course it is ugly to have both task_struct->thread_node and the old task_struct->thread_group, we will kill it later, after we change the users of while_each_thread() to use for_each_thread(). Perhaps we can kill it even before we convert all users, we can reimplement next_thread(t) using the new thread_head/thread_node. But we can't do this right now because this will lead to subtle behavioural changes. For example, do/while_each_thread() always sees at least one task, while for_each_thread() can do nothing if the whole thread group has died. Or thread_group_empty(), currently its semantics is not clear unless thread_group_leader(p) and we need to audit the callers before we can change it. So this patch adds the new interface which has to coexist with the old one for some time, hopefully the next changes will be more or less straightforward and the old one will go away soon. Signed-off-by: Oleg Nesterov <[email protected]> Reviewed-by: Sergey Dyasly <[email protected]> Tested-by: Sergey Dyasly <[email protected]> Reviewed-by: Sameer Nanda <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Mandeep Singh Baines <[email protected]> Cc: "Ma, Xindong" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Tu, Xiaobing" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: use rmap_walk() in page_mkclean()Joonsoo Kim1-25/+26
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in page_mkclean(). In this patch, I change following things. 1. remove some variants of rmap traversing functions. cf> page_mkclean_file 2. mechanical change to use rmap_walk() in page_mkclean(). Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: use rmap_walk() in page_referenced()Joonsoo Kim4-194/+80
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in page_referenced(). In this patch, I change following things. 1. remove some variants of rmap traversing functions. cf> page_referenced_ksm, page_referenced_anon, page_referenced_file 2. introduce new struct page_referenced_arg and pass it to page_referenced_one(), main function of rmap_walk, in order to count reference, to store vm_flags and to check finish condition. 3. mechanical change to use rmap_walk() in page_referenced(). [[email protected]: fix BUG at rmap_walk] Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Cc: Sasha Levin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: use rmap_walk() in try_to_munlock()Joonsoo Kim3-168/+42
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in try_to_munlock(). In this patch, I change following things. 1. remove some variants of rmap traversing functions. cf> try_to_unmap_ksm, try_to_unmap_anon, try_to_unmap_file 2. mechanical change to use rmap_walk() in try_to_munlock(). 3. copy and paste comments. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: use rmap_walk() in try_to_unmap()Joonsoo Kim3-18/+39
Now, we have an infrastructure in rmap_walk() to handle difference from variants of rmap traversing functions. So, just use it in try_to_unmap(). In this patch, I change following things. 1. enable rmap_walk() if !CONFIG_MIGRATION. 2. mechanical change to use rmap_walk() in try_to_unmap(). Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: extend rmap_walk_xxx() to cope with different casesJoonsoo Kim3-8/+51
There are a lot of common parts in traversing functions, but there are also a little of uncommon parts in it. By assigning proper function pointer on each rmap_walker_control, we can handle these difference correctly. Following are differences we should handle. 1. difference of lock function in anon mapping case 2. nonlinear handling in file mapping case 3. prechecked condition: checking memcg in page_referenced(), checking VM_SHARE in page_mkclean() checking temporary vma in try_to_unmap() 4. exit condition: checking page_mapped() in try_to_unmap() So, in this patch, I introduce 4 function pointers to handle above differences. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: make rmap_walk to get the rmap_walk_control argumentJoonsoo Kim5-21/+27
In each rmap traverse case, there is some difference so that we need function pointers and arguments to them in order to handle these For this purpose, struct rmap_walk_control is introduced in this patch, and will be extended in following patch. Introducing and extending are separate, because it clarify changes. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: factor lock function out of rmap_walk_anon()Joonsoo Kim1-8/+20
When we traverse anon_vma, we need to take a read-side anon_lock. But there is subtle difference in the situation so that we can't use same method to take a lock in each cases. Therefore, we need to make rmap_walk_anon() taking difference lock function. This patch is the first step, factoring lock function for anon_lock out of rmap_walk_anon(). It will be used in case of removing migration entry and in default of rmap_walk_anon(). Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: factor nonlinear handling out of try_to_unmap_file()Joonsoo Kim1-62/+74
To merge all kinds of rmap traverse functions, try_to_unmap(), try_to_munlock(), page_referenced() and page_mkclean(), we need to extract common parts and separate out non-common parts. Nonlinear handling is handled just in try_to_unmap_file() and other rmap traverse functions doesn't care of it. Therfore it is better to factor nonlinear handling out of try_to_unmap_file() in order to merge all kinds of rmap traverse functions easily. Signed-off-by: Joonsoo Kim <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/rmap: recompute pgoff for huge pageJoonsoo Kim1-5/+2
Rmap traversing is used in five different cases, try_to_unmap(), try_to_munlock(), page_referenced(), page_mkclean() and remove_migration_ptes(). Each one implements its own traversing functions for the cases, anon, file, ksm, respectively. These cause lots of duplications and cause maintenance overhead. They also make codes being hard to understand and error-prone. One example is hugepage handling. There is a code to compute hugepage offset correctly in try_to_unmap_file(), but, there isn't a code to compute hugepage offset in rmap_walk_file(). These are used pairwise in migration context, but we missed to modify pairwise. To overcome these drawbacks, we should unify these through one unified function. I decide rmap_walk() as main function since it has no unnecessity. And to control behavior of rmap_walk(), I introduce struct rmap_walk_control having some function pointers. These makes rmap_walk() working for their specific needs. This patchset remove a lot of duplicated code as you can see in below short-stat and kernel text size also decrease slightly. text data bss dec hex filename 10640 1 16 10657 29a1 mm/rmap.o 10047 1 16 10064 2750 mm/rmap.o 13823 705 8288 22816 5920 mm/ksm.o 13199 705 8288 22192 56b0 mm/ksm.o This patch (of 9): We have to recompute pgoff if the given page is huge, since result based on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as shown by commit 36e4f20af833 ("hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach") and commit 369a713e ("rmap: recompute pgoff for unmapping huge page"). To handle both the cases, normal page for page cache and hugetlb page, by same way, we can use compound_page(). It returns 0 on non-compound page and it also returns proper value on compound page. Signed-off-by: Joonsoo Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memcg: make memcg_update_cache_sizes() staticVladimir Davydov1-1/+1
This function is not used outside of memcontrol.c so make it static. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memcg: fix kmem_account_flags check in memcg_can_account_kmem()Vladimir Davydov1-1/+2
We should start kmem accounting for a memory cgroup only after both its kmem limit is set (KMEM_ACCOUNTED_ACTIVE) and related call sites are patched (KMEM_ACCOUNTED_ACTIVATED). Currently memcg_can_account_kmem() allows kmem accounting even if only one of the conditions is true. Fix it. This means that a page might get charged by memcg_kmem_newpage_charge which would see its static key patched already but memcg_kmem_commit_charge would still see it unpatched and so the charge won't be committed. The result would be charge inconsistency (page_cgroup not marked as PageCgroupUsed) and the charge would leak because __memcg_kmem_uncharge_pages would ignore it. [[email protected]: augment changelog] Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Balbir Singh <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Glauber Costa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21x86, numa, acpi, memory-hotplug: make movable_node have higher priorityTang Chen1-2/+26
If users specify the original movablecore=nn@ss boot option, the kernel will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar except it specifies ZONE_NORMAL ranges. Now, if users specify "movable_node" in kernel commandline, the kernel will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored. For those who don't want this, just specify nothing. The kernel will act as before. Signed-off-by: Tang Chen <[email protected]> Signed-off-by: Zhang Yanfei <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock, mem_hotplug: make memblock skip hotpluggable regions if neededTang Chen3-0/+37
Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable memory used by the kernel won't be able to be hot-removed. To solve this problem, the basic idea is to prevent memblock from allocating hotpluggable memory for the kernel at early time, and arrange all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing zones. In the previous patches, we have marked hotpluggable memory regions with MEMBLOCK_HOTPLUG flag in memblock.memory. In this patch, we make memblock skip these hotpluggable memory regions in the default top-down allocation function if movable_node boot option is specified. [[email protected]: coding-style fixes] Signed-off-by: Tang Chen <[email protected]> Signed-off-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggableTang Chen1-0/+44
At very early time, the kernel have to use some memory such as loading the kernel image. We cannot prevent this anyway. So any node the kernel resides in should be un-hotpluggable. Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21acpi, numa, mem_hotplug: mark hotpluggable memory in memblockTang Chen2-0/+7
When parsing SRAT, we know that which memory area is hotpluggable. So we invoke function memblock_mark_hotplug() introduced by previous patch to mark hotpluggable memory in memblock. [[email protected]: coding-style fixes] Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock: make memblock_set_node() support different memblock_typeTang Chen12-19/+28
[[email protected]: fix powerpc build] Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock, mem_hotplug: introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable ↵Tang Chen2-0/+70
regions In find_hotpluggable_memory, once we find out a memory region which is hotpluggable, we want to mark them in memblock.memory. So that we could control memblock allocator not to allocte hotpluggable memory for the kernel later. To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the hotpluggable memory regions in memblock and a function memblock_mark_hotplug() to mark hotpluggable memory if we find one. [[email protected]: coding-style fixes] Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock, numa: introduce flags field into memblockTang Chen2-15/+39
There is no flag in memblock to describe what type the memory is. Sometimes, we may use memblock to reserve some memory for special usage. And we want to know what kind of memory it is. So we need a way to In hotplug environment, we want to reserve hotpluggable memory so the kernel won't be able to use it. And when the system is up, we have to free these hotpluggable memory to buddy. So we need to mark these memory first. In order to do so, we need to mark out these special memory in memblock. In this patch, we introduce a new "flags" member into memblock_region: struct memblock_region { phys_addr_t base; phys_addr_t size; unsigned long flags; /* This is new. */ #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int nid; #endif }; This patch does the following things: 1) Add "flags" member to memblock_region. 2) Modify the following APIs' prototype: memblock_add_region() memblock_insert_region() 3) Add memblock_reserve_region() to support reserve memory with flags, and keep memblock_reserve()'s prototype unmodified. 4) Modify other APIs to support flags, but keep their prototype unmodified. The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>. Suggested-by: Wen Congyang <[email protected]> Suggested-by: Liu Jiang <[email protected]> Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: debug: correct displaying of upper memory boundaryGrygorii Strashko1-2/+2
Current memblock APIs don't work on 32 PAE or LPAE extension arches where the physical memory start address beyond 4GB. The problem was discussed here [3] where Tejun, Yinghai(thanks) proposed a way forward with memblock interfaces. Based on the proposal, this series adds necessary memblock interfaces and convert the core kernel code to use them. Architectures already converted to NO_BOOTMEM use these new interfaces and other which still uses bootmem, these new interfaces just fallback to exiting bootmem APIs. So no functional change in behavior. In long run, once all the architectures moves to NO_BOOTMEM, we can get rid of bootmem layer completely. This is one step to remove the core code dependency with bootmem and also gives path for architectures to move away from bootmem. Testing is done on ARM architecture with 32 bit ARM LAPE machines with normal as well sparse(faked) memory model. This patch (of 23): When debugging is enabled (cmdline has "memblock=debug") the memblock will display upper memory boundary per each allocated/freed memory range wrongly. For example: memblock_reserve: [0x0000009e7e8000-0x0000009e7ed000] _memblock_early_alloc_try_nid_nopanic+0xfc/0x12c The 0x0000009e7ed000 is displayed instead of 0x0000009e7ecfff Hence, correct this by changing formula used to calculate upper memory boundary to (u64)base + size - 1 instead of (u64)base + size everywhere in the debug messages. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Russell King <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/mlock: prepare params outside critical regionDavidlohr Bueso1-7/+11
All mlock related syscalls prepare lock limits, lengths and start parameters with the mmap_sem held. Move this logic outside of the critical region. For the case of mlock, continue incrementing the amount already locked by mm->locked_vm with the rwsem taken. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Michel Lespinasse <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/mmap.c: add mlock_future_check() helperDavidlohr Bueso1-22/+23
Both do_brk and do_mmap_pgoff verify that we are actually capable of locking future pages if the corresponding VM_LOCKED flags are used. Encapsulate this logic into a single mlock_future_check() helper function. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Michel Lespinasse <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: add overcommit_kbytes sysctl variableJerome Marchand8-8/+70
Some applications that run on HPC clusters are designed around the availability of RAM and the overcommit ratio is fine tuned to get the maximum usage of memory without swapping. With growing memory, the 1%-of-all-RAM grain provided by overcommit_ratio has become too coarse for these workload (on a 2TB machine it represents no less than 20GB). This patch adds the new overcommit_kbytes sysctl variable that allow a much finer grain. [[email protected]: coding-style fixes] [[email protected]: fix nommu build] Signed-off-by: Jerome Marchand <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Alan Cox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNTMel Gorman9-190/+65
Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm: do not walk all of system memory during show_mem") avoided a PFN walk in the generic show_mem helper which removes the requirement for SHOW_MEM_FILTER_PAGE_COUNT in that case. This patch removes PFN walkers from the arch-specific implementations that report on a per-node or per-zone granularity. ARM and unicore32 still do a PFN walk as they report memory usage on each bank which is a much finer granularity where the debugging information may still be of use. As the remaining arches doing PFN walks have relatively small amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT. [[email protected]: fix parisc] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Tony Luck <[email protected]> Cc: Russell King <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}Jianyu Zhan1-10/+10
Currently we are implementing vmalloc_to_pfn() as a wrapper around vmalloc_to_page(), which is implemented as follow: 1. walks the page talbes to generates the corresponding pfn, 2. then converts the pfn to struct page, 3. returns it. And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn. This seems too circuitous, so this patch reverses the way: implement vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient. No functional change. Signed-off-by: Jianyu Zhan <[email protected]> Cc: Vladimir Murzin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm, mempolicy: remove unneeded functions for UMA configsDavid Rientjes1-32/+0
Mempolicies only exist for CONFIG_NUMA configurations. Therefore, a certain class of functions are unneeded in configurations where CONFIG_NUMA is disabled such as functions that duplicate existing mempolicies, lookup existing policies, set certain mempolicy traits, or test mempolicies for certain attributes. Remove the unneeded functions so that any future callers get a compile- time error and protect their code with CONFIG_NUMA as required. Signed-off-by: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: call MMU notifiers when copying a hugetlb page rangeAndreas Sandberg1-5/+16
When copy_hugetlb_page_range() is called to copy a range of hugetlb mappings, the secondary MMUs are not notified if there is a protection downgrade, which breaks COW semantics in KVM. This patch adds the necessary MMU notifier calls. Signed-off-by: Andreas Sandberg <[email protected]> Acked-by: Steve Capper <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm, memory-failure: fix typo in me_pagecache_dirty()Zhi Yong Wu1-1/+1
[[email protected]: s/cache/pagecache/] Signed-off-by: Zhi Yong Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: create a separate slab for page->ptl allocationKirill A. Shutemov3-3/+24
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64 is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab, so we loose 24 on each. An average system can easily allocate few tens thousands of page->ptl and overhead is significant. Let's create a separate slab for page->ptl allocation to solve this. To make sure that it really works this time, some numbers from my test machine (just booted, no load): Before: # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo kmalloc-96 31987 32190 128 30 1 : tunables 120 60 8 : slabdata 1073 1073 92 After: # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo page->ptl 27516 28143 72 53 1 : tunables 120 60 8 : slabdata 531 531 9 kmalloc-96 3853 5280 128 30 1 : tunables 120 60 8 : slabdata 176 176 0 Note that the patch is useful not only for debug case, but also for PREEMPT_RT, where spinlock_t is always bloated. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserveYasuaki Ishimatsu2-0/+19
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on 9TB memory machine since onlining memory sections is too slow. And we found out setup_zone_migrate_reserve spent >90% of the time. The problem is, setup_zone_migrate_reserve scans all pageblocks unconditionally, but it is only necessary if the number of reserved block was reduced (i.e. memory hot remove). Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means that the number of reserved pageblocks is almost always unchanged. This patch adds zone->nr_migrate_reserve_block to maintain the number of MIGRATE_RESERVE pageblocks and it reduces the overhead of setup_zone_migrate_reserve dramatically. The following table shows time of onlining a memory section. Amount of memory | 128GB | 192GB | 256GB| --------------------------------------------- linux-3.12 | 23.9 | 31.4 | 44.5 | This patch | 8.3 | 8.3 | 8.6 | Mel's proposal patch | 10.9 | 19.2 | 31.3 | --------------------------------------------- (millisecond) 128GB : 4 nodes and each node has 32GB of memory 192GB : 6 nodes and each node has 32GB of memory 256GB : 8 nodes and each node has 32GB of memory (*1) Mel proposed his idea by the following threads. https://lkml.org/lkml/2013/10/30/272 [[email protected]: tweak comment] Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Reported-by: Yasuaki Ishimatsu <[email protected]> Tested-by: Yasuaki Ishimatsu <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21/proc/meminfo: provide estimated available memoryRik van Riel2-0/+46
Many load balancing and workload placing programs check /proc/meminfo to estimate how much free memory is available. They generally do this by adding up "free" and "cached", which was fine ten years ago, but is pretty much guaranteed to be wrong today. It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files. Currently, the amount of memory that is available for a new workload, without pushing the system into swap, can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the "low" watermarks from /proc/zoneinfo. However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory. It is more convenient to provide such an estimate in /proc/meminfo. If things change in the future, we only have to change it in one place. Signed-off-by: Rik van Riel <[email protected]> Reported-by: Erik Mouw <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail()Oleg Nesterov1-4/+3
get_huge_page_tail()->compound_head() looks confusing. Every caller must check PageTail(page), otherwise atomic_inc(&page->_mapcount) is simply wrong if this page is compound-trans-head. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dave Jones <[email protected]> Cc: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>