aboutsummaryrefslogtreecommitdiff
path: root/include/linux/nodemask.h
AgeCommit message (Collapse)AuthorFilesLines
2015-02-13bitmap, cpumask, nodemask: remove dedicated formatting functionsTejun Heo1-26/+7
Now that all bitmap formatting usages have been converted to '%*pb[l]', the separate formatting functions are unnecessary. The following functions are removed. * bitmap_scn[list]printf() * cpumask_scnprintf(), cpulist_scnprintf() * [__]nodemask_scnprintf(), [__]nodelist_scnprintf() * seq_bitmap[_list](), seq_cpumask[_list](), seq_nodemask[_list]() * seq_buf_bitmask() Signed-off-by: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-13cpumask, nodemask: implement cpumask/nodemask_pr_args()Tejun Heo1-0/+8
printf family of functions can now format bitmaps using '%*pb[l]' and all cpumask and nodemask formatting will be converted to use it. To ease printing these masks with '%*pb[l]' which require two params - the number of bits and the actual bitmap, this patch implement cpumask_pr_args() and nodemask_pr_args() which can be used to provide arguments for '%*pb[l]' Signed-off-by: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "John W. Linville" <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Li Zefan <[email protected]> Cc: Max Filippov <[email protected]> Cc: Mike Travis <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-12linux/nodemask.h: update bitmap wrappers to take unsigned intRasmus Villemoes1-13/+13
Since the various bitmap_* functions now take an unsigned int as nbits parameter, it makes sense to also update the various wrappers. Signed-off-by: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-08-06mm, oom: ensure memoryless node zonelist always includes zonesDavid Rientjes1-1/+10
With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [[email protected]: fix non-numa build] Signed-off-by: David Rientjes <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-07-25numa: Mark __node_set() as __always_inlineTom Rini1-1/+10
It is posible for some compilers to decide that __node_set() does not need to be made turned into an inline function. When the compiler does this on an __init function calling it on __initdata we get a section mismatch warning now. Use __always_inline to ensure that we will be inlined. Reported-by: Paul Bolle <[email protected]> Cc: Jianpeng Ma <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Tom Rini <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-12-12numa: add CONFIG_MOVABLE_NODE for movable-dedicated nodeLai Jiangshan1-0/+4
We need a node which only contains movable memory. This feature is very important for node hotplug. If a node has normal/highmem, the memory may be used by the kernel and can't be offlined. If the node only contains movable memory, we can offline the memory and the node. All are prepared, we can actually introduce N_MEMORY. add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node [[email protected]: fix Kconfig text] Signed-off-by: Lai Jiangshan <[email protected]> Tested-by: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Jiang Liu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-12-12mm: node_states: introduce N_MEMORYLai Jiangshan1-0/+1
We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with zone_type <= ZONE_NORMAL. And we have N_HIGH_MEMORY for standing for the nodes that have normal or high memory. But we don't have any word to stand for the nodes that have *any* memory. And we have N_CPU but without N_MEMORY. Current code reuse the N_HIGH_MEMORY for this purpose because any node which has memory must have high memory or normal memory currently. A) But this reusing is bad for *readability*. Because the name N_HIGH_MEMORY just stands for high or normal: A.example 1) mem_cgroup_nr_lru_pages(): for_each_node_state(nid, N_HIGH_MEMORY) The user will be confused(why this function just counts for high or normal memory node? does it counts for ZONE_MOVABLE's lru pages?) until someone else tell them N_HIGH_MEMORY is reused to stand for nodes that have any memory. A.cont) If we introduce N_MEMORY, we can reduce this confusing AND make the code more clearly: A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice: One is in page_cgroup_init(void): for_each_node_state(nid, N_HIGH_MEMORY) { It means if the node have memory, we will allocate page_cgroup map for the node. We should use N_MEMORY instead here to gaim more clearly. The second using is in alloc_page_cgroup(): if (node_state(nid, N_HIGH_MEMORY)) addr = vzalloc_node(size, nid); It means if the node has high or normal memory that can be allocated from kernel. We should keep N_HIGH_MEMORY here, and it will be better if the "any memory" semantic of N_HIGH_MEMORY is removed. B) This reusing is out-dated if we introduce MOVABLE-dedicated node. The MOVABLE-dedicated node should not appear in node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY], because MOVABLE-dedicated node has no high or normal memory. In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node is in node_stats[N_HIGH_MEMORY], it is also means it is in node_stats[N_NORMAL_MEMORY], it causes SLUB wrong. The slub uses for_each_node_state(nid, N_NORMAL_MEMORY) and creates kmem_cache_node for MOVABLE-dedicated node and cause problem. In one word, we need a N_MEMORY. We just intrude it as an alias to N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches. Signed-off-by: Lai Jiangshan <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Hillf Danton <[email protected]> Signed-off-by: Wen Congyang <[email protected]> Cc: Lin Feng <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-07-26cpusets: randomize node rotor used in cpuset_mem_spread_node()Michal Hocko1-0/+13
[ This patch has already been accepted as commit 0ac0c0d0f837 but later reverted (commit 35926ff5fba8) because it itroduced arch specific __node_random which was defined only for x86 code so it broke other archs. This is a followup without any arch specific code. Other than that there are no functional changes.] Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [[email protected]: fix layout] [[email protected]: Define stub numa_random() for !NUMA configuration] [[email protected]: Make it arch independent] [[email protected]: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build] Signed-off-by: Jack Steiner <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Paul Menage <[email protected]> Cc: Jack Steiner <[email protected]> Cc: Robin Holt <[email protected]> Cc: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jack Steiner <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Menage <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Robin Holt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-30Revert "cpusets: randomize node rotor used in cpuset_mem_spread_node()"Linus Torvalds1-8/+0
This reverts commit 0ac0c0d0f837c499afd02a802f9cf52d3027fa3b, which caused cross-architecture build problems for all the wrong reasons. IA64 already added its own version of __node_random(), but the fact is, there is nothing architectural about the function, and the original commit was just badly done. Revert it, since no fix is forthcoming. Requested-by: Stephen Rothwell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-05-27cpusets: randomize node rotor used in cpuset_mem_spread_node()Jack Steiner1-0/+8
Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [[email protected]: fix layout] [[email protected]: Define stub numa_random() for !NUMA configuration] Signed-off-by: Jack Steiner <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Paul Menage <[email protected]> Cc: Jack Steiner <[email protected]> Cc: Robin Holt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12nodemask: fix the declaration of NODEMASK_ALLOC()Miao Xie1-1/+1
we can't declarate two variable at the same scope by NODEMASK_ALLOC(). This patch fixes it. Signed-off-by: Miao Xie <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Paul Menage <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06nodemask.h: remove macro any_online_nodeH Hartley Sweeten1-11/+0
The macro any_online_node() is prone to producing sparse warnings due to the local symbol 'node'. Since all the in-tree users are really requesting the first online node (the mask argument is either NODE_MASK_ALL or node_online_map) just use the first_online_node macro and remove the any_online_node macro since there are no users. Signed-off-by: H Hartley Sweeten <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Acked-by: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Milton Miller <[email protected]> Cc: Nathan Fontenot <[email protected]> Cc: Geoff Levand <[email protected]> Cc: Grant Likely <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Neil Brown <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: David S. Miller <[email protected]> Cc: Benny Halevy <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Ricardo Labiaga <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-12-15mm: add gfp flags for NODEMASK_ALLOC slab allocationsDavid Rientjes1-9/+12
Objects passed to NODEMASK_ALLOC() are relatively small in size and are backed by slab caches that are not of large order, traditionally never greater than PAGE_ALLOC_COSTLY_ORDER. Thus, using GFP_KERNEL for these allocations on large machines when CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in the allocation attempt, each time invoking both direct reclaim or the oom killer. This is of particular interest when using NODEMASK_ALLOC() from a mempolicy context (either directly in mm/mempolicy.c or the mempolicy constrained hugetlb allocations) since the oom killer always kills current when allocations are constrained by mempolicies. So for all present use cases in the kernel, current would end up being oom killed when direct reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but current would have sacrificed itself upon returning. This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations. All current use cases either directly from hugetlb code or indirectly via NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom killer when the slab allocator needs to allocate additional pages. The side-effect of this change is that all current use cases of either NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling when the allocation fails (never for CONFIG_NODES_SHIFT <= 8). All current use cases were audited and do have appropriate error handling at this time. Signed-off-by: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Adam Litke <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Eric Whitney <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-12-15hugetlb: factor init_nodemask_of_node()Lee Schermerhorn1-3/+8
Factor init_nodemask_of_node() out of the nodemask_of_node() macro. This will be used to populate the huge pages "nodes_allowed" nodemask for a single node when basing nodes_allowed on a preferred/local mempolicy or when a persistent huge page pool page count is modified via a per node sysfs attribute. Signed-off-by: Lee Schermerhorn <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Adam Litke <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Eric Whitney <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-12-15nodemask: make NODEMASK_ALLOC more generalDavid Rientjes1-7/+8
This is a series of patches to provide control over the location of the allocation and freeing of persistent huge pages on a NUMA platform. Please consider for merging into mmotm. This series uses two mechanisms to constrain the nodes from which persistent huge pages are allocated: 1) the task NUMA mempolicy of the task modifying a new sysctl "nr_hugepages_mempolicy", based on a suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs attributes have been added [in V4] to each node system device under: /sys/devices/node/node[0-9]*/hugepages The per node attibutes allow direct assignment of a huge page count on a specific node, regardless of the task's mempolicy or cpuset constraints. This patch: NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary. It's perfectly reasonable to use this macro to allocate a nodemask_t, which is anonymous, either dynamically or on the stack depending on NODES_SHIFT. Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Adam Litke <[email protected]> Cc: Andy Whitcroft <[email protected]> Cc: Eric Whitney <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-08-07mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY awareKAMEZAWA Hiroyuki1-0/+28
At first, init_task's mems_allowed is initialized as this. init_task->mems_allowed == node_state[N_POSSIBLE] And cpuset's top_cpuset mask is initialized as this top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY] Before 2.6.29: policy's mems_allowed is initialized as this. 1. update tasks->mems_allowed by its cpuset->mems_allowed. 2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask) Updating task's mems_allowed in reference to top_cpuset's one. cpuset's mems_allowed is aware of N_HIGH_MEMORY, always. In 2.6.30: After commit 58568d2a8215cb6f55caf2332017d7bdff954e1c ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed is initialized as this. 1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask) Here, if task is in top_cpuset, task->mems_allowed is not updated from init's one. Assume user excutes command as #numactrl --interleave=all ,.... policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK) Then, policy's mems_allowd can includes a possible node, which has no pgdat. MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this directly. NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL Then, what's we need is making policy->mems_allowed be aware of N_HIGH_MEMORY. This patch does that. But to do so, extra nodemask will be on statck. Because I know cpumask has a new interface of CPUMASK_ALLOC(), I added it to node. This patch stands on old behavior. But I feel this fix itself is just a Band-Aid. But to do fundametal fix, we have to take care of memory hotplug and it takes time. (task->mems_allowd should be N_HIGH_MEMORY, I think.) mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask should be includes only online nodes. In old behavior, this is guaranteed by frequent reference to cpuset's code. Now, most of them are removed and mempolicy has to check it by itself. To do check, a few nodemask_t will be used for calculating nodemask. But, size of nodemask_t can be big and it's not good to allocate them on stack. Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area. NODEMASK_ALLOC/FREE shoudl be there. [[email protected]: cleanups & tweaks] Tested-by: KOSAKI Motohiro <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Miao Xie <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Paul Menage <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Yasunori Goto <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2009-06-16page allocator: use a pre-calculated value instead of num_online_nodes() in ↵Christoph Lameter1-3/+16
fast paths num_online_nodes() is called in a number of places but most often by the page allocator when deciding whether the zonelist needs to be filtered based on cpusets or the zonelist cache. This is actually a heavy function and touches a number of cache lines. This patch stores the number of online nodes at boot time and updates the value when nodes get onlined and offlined. The value is then used in a number of important paths in place of num_online_nodes(). [[email protected]: do not override definition of node_set_online() with macro] Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2008-04-28mempolicy: add bitmap_onto() and bitmap_fold() operationsPaul Jackson1-1/+21
The following adds two more bitmap operators, bitmap_onto() and bitmap_fold(), with the usual cpumask and nodemask wrappers. The bitmap_onto() operator computes one bitmap relative to another. If the n-th bit in the origin mask is set, then the m-th bit of the destination mask will be set, where m is the position of the n-th set bit in the relative mask. The bitmap_fold() operator folds a bitmap into a second that has bit m set iff the input bitmap has some bit n set, where m == n mod sz, for the specified sz value. There are two substantive changes between this patch and its predecessor bitmap_relative: 1) Renamed bitmap_relative() to be bitmap_onto(). 2) Added bitmap_fold(). The essential motivation for bitmap_onto() is to provide a mechanism for converting a cpuset-relative CPU or Node mask to an absolute mask. Cpuset relative masks are written as if the current task were in a cpuset whose CPUs or Nodes were just the consecutive ones numbered 0..N-1, for some N. The bitmap_onto() operator is provided in anticipation of adding support for the first such cpuset relative mask, by the mbind() and set_mempolicy() system calls, using a planned flag of MPOL_F_RELATIVE_NODES. These bitmap operators (and their nodemask wrappers, in particular) will be used in code that converts the user specified cpuset relative memory policy to a specific system node numbered policy, given the current mems_allowed of the tasks cpuset. Such cpuset relative mempolicies will address two deficiencies of the existing interface between cpusets and mempolicies: 1) A task cannot at present reliably establish a cpuset relative mempolicy because there is an essential race condition, in that the tasks cpuset may be changed in between the time the task can query its cpuset placement, and the time the task can issue the applicable mbind or set_memplicy system call. 2) A task cannot at present establish what cpuset relative mempolicy it would like to have, if it is in a smaller cpuset than it might have mempolicy preferences for, because the existing interface only allows specifying mempolicies for nodes currently allowed by the cpuset. Cpuset relative mempolicies are useful for tasks that don't distinguish particularly between one CPU or Node and another, but only between how many of each are allowed, and the proper placement of threads and memory pages on the various CPUs and Nodes available. The motivation for the added bitmap_fold() can be seen in the following example. Let's say an application has specified some mempolicies that presume 16 memory nodes, including say a mempolicy that specified MPOL_F_RELATIVE_NODES (cpuset relative) nodes 12-15. Then lets say that application is crammed into a cpuset that only has 8 memory nodes, 0-7. If one just uses bitmap_onto(), this mempolicy, mapped to that cpuset, would ignore the requested relative nodes above 7, leaving it empty of nodes. That's not good; better to fold the higher nodes down, so that some nodes are included in the resulting mapped mempolicy. In this case, the mempolicy nodes 12-15 are taken modulo 8 (the weight of the mems_allowed of the confining cpuset), resulting in a mempolicy specifying nodes 4-7. Signed-off-by: Paul Jackson <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-10-16Memoryless nodes: Add N_CPU node stateChristoph Lameter1-0/+1
We need the check for a node with cpu in zone reclaim. Zone reclaim will not allow remote zone reclaim if a node has a cpu. [[email protected]: Move setup of N_CPU node state mask] Signed-off-by: Christoph Lameter <[email protected]> Tested-by: Lee Schermerhorn <[email protected]> Acked-by: Bob Picco <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-10-16Memoryless nodes: introduce mask of nodes with memoryChristoph Lameter1-2/+8
It is necessary to know if nodes have memory since we have recently begun to add support for memoryless nodes. For that purpose we introduce a two new node states: N_HIGH_MEMORY and N_NORMAL_MEMORY. A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the type of mmemory. If a node has memory then it has at least one zone defined in its pgdat structure that is located in the pgdat itself. A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than ZONE_HIGHMEM. This means it is possible to allocate memory that is not subject to kmap. N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure that we do the right thing when we encounter a memoryless node. [[email protected]: build fix] [[email protected]: update N_HIGH_MEMORY node state for memory hotadd] [[email protected]: Fix memory hotplug + sparsemem build] Signed-off-by: Lee Schermerhorn <[email protected]> Signed-off-by: Nishanth Aravamudan <[email protected]> Signed-off-by: Christoph Lameter <[email protected]> Acked-by: Bob Picco <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Yasunori Goto <[email protected]> Signed-off-by: Paul Mundt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-10-16Memoryless nodes: Generic management of nodemasks for various purposesChristoph Lameter1-16/+71
Why do we need to support memoryless nodes? KAMEZAWA Hiroyuki <[email protected]> wrote: > For fujitsu, problem is called "empty" node. > > When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init) > creates nodes, which includes no memory, no cpu. > > I tried to remove empty-node in past, but that was denied. > It was because we can hot-add cpu to the empty node. > (node-hotplug triggered by cpu is not implemented now. and it will be ugly.) > > > For HP, (Lee can comment on this later), they have memory-less-node. > As far as I hear, HP's machine can have following configration. > > (example) > Node0: CPU0 memory AAA MB > Node1: CPU1 memory AAA MB > Node2: CPU2 memory AAA MB > Node3: CPU3 memory AAA MB > Node4: Memory XXX GB > > AAA is very small value (below 16MB) and will be omitted by ia64 bootstrap. > After boot, only Node 4 has valid memory (but have no cpu.) > > Maybe this is memory-interleave by firmware config. Christoph Lameter <[email protected]> wrote: > Future SGI platforms (actually also current one can have but nothing like > that is deployed to my knowledge) have nodes with only cpus. Current SGI > platforms have nodes with just I/O that we so far cannot manage in the > core. So the arch code maps them to the nearest memory node. Lee Schermerhorn <[email protected]> wrote: > For the HP platforms, we can configure each cell with from 0% to 100% > "cell local memory". When we configure with <100% CLM, the "missing > percentages" are interleaved by hardware on a cache-line granularity to > improve bandwidth at the expense of latency for numa-challenged > applications [and OSes, but not our problem ;-)]. When we boot Linux on > such a config, all of the real nodes have no memory--it all resides in a > single interleaved pseudo-node. > > When we boot Linux on a 100% CLM configuration [== NUMA], we still have > the interleaved pseudo-node. It contains a few hundred MB stolen from > the real nodes to contain the DMA zone. [Interleaved memory resides at > phys addr 0]. The memoryless-nodes patches, along with the zoneorder > patches, support this config as well. > > Also, when we boot a NUMA config with the "mem=" command line, > specifying less memory than actually exists, Linux takes the excluded > memory "off the top" rather than distributing it across the nodes. This > can result in memoryless nodes, as well. > This patch: Preparation for memoryless node patches. Provide a generic way to keep nodemasks describing various characteristics of NUMA nodes. Remove the node_online_map and the node_possible map and realize the same functionality using two nodes stats: N_POSSIBLE and N_ONLINE. [[email protected]: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config] Signed-off-by: Christoph Lameter <[email protected]> Tested-by: Lee Schermerhorn <[email protected]> Acked-by: Lee Schermerhorn <[email protected]> Acked-by: Bob Picco <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Lee Schermerhorn <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2007-02-20[PATCH] Replace highest_possible_node_id() with nr_node_idsChristoph Lameter1-2/+2
highest_possible_node_id() is currently used to calculate the last possible node idso that the network subsystem can figure out how to size per node arrays. I think having the ability to determine the maximum amount of nodes in a system at runtime is useful but then we should name this entry correspondingly, it should return the number of node_ids, and the the value needs to be setup only once on bootup. The node_possible_map does not change after bootup. This patch introduces nr_node_ids and replaces the use of highest_possible_node_id(). nr_node_ids is calculated on bootup when the page allocators pagesets are initialized. [[email protected]: fix oops] Signed-off-by: Christoph Lameter <[email protected]> Cc: Neil Brown <[email protected]> Cc: Trond Myklebust <[email protected]> Signed-off-by: Frederik Deweerdt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-10-11[PATCH] bitmap: parse input from kernel and user buffersReinette Chatre1-7/+7
lib/bitmap.c:bitmap_parse() is a library function that received as input a user buffer. This seemed to have originated from the way the write_proc function of the /proc filesystem operates. This has been reworked to not use kmalloc and eliminates a lot of get_user() overhead by performing one access_ok before using __get_user(). We need to test if we are in kernel or user space (is_user) and access the buffer differently. We cannot use __get_user() to access kernel addresses in all cases, for example in architectures with separate address space for kernel and user. This function will be useful for other uses as well; for example, taking input for /sysfs instead of /proc, so it was changed to accept kernel buffers. We have this use for the Linux UWB project, as part as the upcoming bandwidth allocator code. Only a few routines used this function and they were changed too. Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Inaky Perez-Gonzalez <[email protected]> Cc: Paul Jackson <[email protected]> Cc: Joe Korty <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-10-02[PATCH] cpumask: add highest_possible_node_idGreg Banks1-0/+2
cpumask: add highest_possible_node_id(), analogous to highest_possible_processor_id(). [[email protected]: fix typo] Signed-off-by: Greg Banks <[email protected]> Signed-off-by: Paul Jackson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-03-27[PATCH] define for_each_online_pgdatKAMEZAWA Hiroyuki1-0/+4
This patch defines for_each_online_pgdat() as a replacement of for_each_pgdat() Now, online nodes are managed by node_online_map. But for_each_pgdat() uses pgdat_link to iterate over all nodes(pgdat). This means management structure for online pgdat is duplicated. I think using node_online_map for for_each_pgdat() is simple and sane rather ather than pgdat_link. New macro is named as for_each_online_pgdat(). Following patch will fix callers of for_each_pgdat(). The bootmem allocater uses for_each_pgdat() before pgdat initialization. I don't think it's sane. Following patch will fix it. Signed-off-by: Yasunori Goto <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2006-02-07[PATCH] remove bogus asm/bug.h includes.Al Viro1-1/+0
A bunch of asm/bug.h includes are both not needed (since it will get pulled anyway) and bogus (since they are done too early). Removed. Signed-off-by: Al Viro <[email protected]>
2005-10-30[PATCH] cpusets: bitmap and mask remap operatorsPaul Jackson1-0/+20
In the forthcoming task migration support, a key calculation will be mapping cpu and node numbers from the old set to the new set while preserving cpuset-relative offset. For example, if a task and its pages on nodes 8-11 are being migrated to nodes 24-27, then pages on node 9 (the 2nd node in the old set) should be moved to node 25 (the 2nd node in the new set.) As with other bitmap operations, the proper way to code this is to provide the underlying calculation in lib/bitmap.c, and then to provide the usual cpumask and nodemask wrappers. This patch provides that. These operations are termed 'remap' operations. Both remapping a single bit and a set of bits is supported. Signed-off-by: Paul Jackson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2005-04-16Linux-2.6.12-rc2Linus Torvalds1-0/+356
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!