aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/mm/numa_32.c
AgeCommit message (Collapse)AuthorFilesLines
2024-04-25fix missing vmalloc.h includesKent Overstreet1-0/+1
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/[email protected]/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [[email protected]: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix arc build] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-04x86/numa/32: Include missing <asm/pgtable_areas.h>Arnd Bergmann1-0/+1
The __vmalloc_start_set declaration is in a header that is not included in numa_32.c in current linux-next: arch/x86/mm/numa_32.c: In function 'initmem_init': arch/x86/mm/numa_32.c:57:9: error: '__vmalloc_start_set' undeclared (first use in this function) 57 | __vmalloc_start_set = true; | ^~~~~~~~~~~~~~~~~~~ arch/x86/mm/numa_32.c:57:9: note: each undeclared identifier is reported only once for each function it appears in Add an explicit #include. Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-05-28x86/mm: Drop deprecated DISCONTIGMEM support for 32-bitMike Rapoport1-34/+0
The DISCONTIGMEM support was marked as deprecated in v5.2 and since there were no complaints about it for almost 5 releases it can be completely removed. Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport1-1/+0
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [[email protected]: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Semin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-03-27x86/mm/32: Remove unused node_memmap_size_bytes() & ↵David Rientjes1-11/+0
CONFIG_NEED_NODE_MEMMAP_SIZE logic node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it. Signed-off-by: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: f03574f2d5b2 ("x86-32, mm: Rip out x86_32 NUMA remapping code") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-09x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init()Laura Abbott1-0/+1
'__vmalloc_start_set' currently only gets set in initmem_init() when !CONFIG_NEED_MULTIPLE_NODES. This breaks detection of vmalloc address with virt_addr_valid() with CONFIG_NEED_MULTIPLE_NODES=y, causing a kernel crash: [mm/usercopy] 517e1fbeb6: kernel BUG at arch/x86/mm/physaddr.c:78! Set '__vmalloc_start_set' appropriately for that case as well. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Laura Abbott <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: dc16ecf7fd1f ("x86-32: use specific __vmalloc_start_set flag in __virt_addr_valid") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-07-14x86/mm: Audit and remove any unnecessary uses of module.hPaul Gortmaker1-1/+1
Historically a lot of these existed because we did not have a distinction between what was modular code and what was providing support to modules via EXPORT_SYMBOL and friends. That changed when we forked out support for the latter into the export.h file. This means we should be able to reduce the usage of module.h in code that is obj-y Makefile or bool Kconfig. The advantage in doing so is that module.h itself sources about 15 other headers; adding significantly to what we feed cpp, and it can obscure what headers we are effectively using. Since module.h was the source for init.h (for __init) and for export.h (for EXPORT_SYMBOL) we consider each obj-y/bool instance for the presence of either and replace accordingly where needed. Note that some bool/obj-y instances remain since module.h is the header for some exception table entry stuff, and for things like __init_or_module (code that is tossed when MODULES=n). Signed-off-by: Paul Gortmaker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-02-01x86: Fix the initialization of physnode_mapPetr Tesarik1-0/+2
With DISCONTIGMEM, the mapping between a pfn and its owning node is initialized using data provided by the BIOS. However, the initialization may fail if the extents are not aligned to section boundary (64M). The symptom of this bug is an early boot failure in pfn_to_page(), as it tries to access NODE_DATA(__nid) using index from an unitialized element of the physnode_map[] array. While the bug is always present, it is more likely to be hit in kdump kernels on large machines, because: 1. The memory map for a kdump kernel is specified as exactmap, and exactmap is more likely to be unaligned. 2. Large reservations are more likely to span across a 64M boundary. [ hpa: fixed incorrect use of "pfn" instead of "start" ] Signed-off-by: Petr Tesarik <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2013-07-03mm/x86: prepare for removing num_physpages and simplify mem_init()Jiang Liu1-2/+0
Prepare for removing num_physpages and simplify mem_init(). Signed-off-by: Jiang Liu <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Andreas Herrmann <[email protected]> Cc: Tang Chen <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Jianguo Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-01-31x86-32, mm: Rip out x86_32 NUMA remapping codeDave Hansen1-161/+0
This code was an optimization for 32-bit NUMA systems. It has probably been the cause of a number of subtle bugs over the years, although the conditions to excite them would have been hard to trigger. Essentially, we remap part of the kernel linear mapping area, and then sometimes part of that area gets freed back in to the bootmem allocator. If those pages get used by kernel data structures (say mem_map[] or a dentry), there's no big deal. But, if anyone ever tried to use the linear mapping for these pages _and_ cared about their physical address, bad things happen. For instance, say you passed __GFP_ZERO to the page allocator and then happened to get handed one of these pages, it zero the remapped page, but it would make a pte to the _old_ page. There are probably a hundred other ways that it could screw with things. We don't need to hang on to performance optimizations for these old boxes any more. All my 32-bit NUMA systems are long dead and buried, and I probably had access to more than most people. This code is causing real things to break today: https://lkml.org/lkml/2013/1/9/376 I looked in to actually fixing this, but it requires surgery to way too much brittle code, as well as stuff like per_cpu_ptr_to_phys(). [ hpa: Cc: this for -stable, since it is a memory corruption issue. However, an alternative is to simply mark NUMA as depends BROKEN rather than EXPERIMENTAL in the X86_32 subclause... ] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: H. Peter Anvin <[email protected]> Cc: <[email protected]>
2011-07-14memblock, x86: Replace memblock_x86_reserve/free_range() with generic onesTejun Heo1-3/+3
Other than sanity check and debug message, the x86 specific version of memblock reserve/free functions are simple wrappers around the generic versions - memblock_reserve/free(). This patch adds debug messages with caller identification to the generic versions and replaces x86 specific ones and kills them. arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty after this change and removed. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: Yinghai Lu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-07-13memblock: Kill MEMBLOCK_ERRORTejun Heo1-2/+2
25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made MEMBLOCK_ERROR 0 and there already are codes which expect error return to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its misery. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: Yinghai Lu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-07-12x86, mm: s/PAGES_PER_ELEMENT/PAGES_PER_SECTION/Tejun Heo1-3/+3
DISCONTIGMEM on x86-32 implements pfn -> nid mapping similarly to SPARSEMEM; however, it calls each mapping unit ELEMENT instead of SECTION. This patch renames it to SECTION so that PAGES_PER_SECTION is valid for both DISCONTIGMEM and SPARSEMEM. This will be used by the next patch to implement mapping granularity check. This patch is trivial constant rename. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: Hans Rosenfeld <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-05-02x86, NUMA: Make 32bit use common NUMA init pathTejun Heo1-229/+2
With both _numa_init() methods converted and the rest of init code adjusted, numa_32.c now can switch from the 32bit only init code to the common one in numa.c. * Shim get_memcfg_*()'s are dropped and initmem_init() calls x86_numa_init(), which is updated to handle NUMAQ. * All boilerplate operations including node range limiting, pgdat alloc/init are handled by numa_init(). 32bit only implementation is removed. * 32bit numa_add_memblk(), numa_set_distance() and memory_add_physaddr_to_nid() removed and common versions in numa_32.c enabled for 32bit. This change causes the following behavior changes. * NODE_DATA()->node_start_pfn/node_spanned_pages properly initialized for 32bit too. * Much more sanity checks and configuration cleanups. * Proper handling of node distances. * The same NUMA init messages as 64bit. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86, NUMA: Initialize and use remap allocator from setup_node_bootmem()Tejun Heo1-1/+1
setup_node_bootmem() is taken from 64bit and doesn't use remap allocator. It's about to be shared with 32bit so add support for it. If NODE_DATA is remapped, it's noted in the debug message and node locality check is skipped as the __pa() of the remapped address doesn't reflect the actual physical address. On 64bit, remap allocator becomes noop and doesn't affect the behavior. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86-32, NUMA: Add @start and @end to init_alloc_remap()Tejun Heo1-15/+14
Instead of dereferencing node_start/end_pfn[] directly, make init_alloc_remap() take @start and @end and let the caller be responsible for making sure the range is sane. This is to prepare for use from unified NUMA init code. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86, NUMA: Enable build of generic NUMA init code on 32bitTejun Heo1-3/+0
Generic NUMA init code was moved to numa.c from numa_64.c but is still guaraded by CONFIG_X86_64. This patch removes the compile guard and enables compiling on 32bit. * numa_add_memblk() and numa_set_distance() clash with the shim implementation in numa_32.c and are left out. * memory_add_physaddr_to_nid() clashes with 32bit implementation and is left out. * MAX_DMA_PFN definition in dma.h moved out of !CONFIG_X86_32. * node_data definition in numa_32.c removed in favor of the one in numa.c. There are places where ulong is assumed to be 64bit. The next patch will fix them up. Note that although the code is compiled it isn't used yet and this patch doesn't cause any functional change. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86-32, NUMA: Update numaq to use new NUMA init protocolTejun Heo1-0/+23
Update numaq such that it calls numa_add_memblk() and sets numa_nodes_parsed instead of directly diddling with NUMA states. The original get_memcfg_numaq() is renamed to numaq_numa_init() and new get_memcfg_numaq() is created in numa_32.c. The shim numa_add_memblk() implementation handles node_start/end_pfn[] and node_set_online() for nodes with memory. The new get_memcfg_numaq() exactly the same with get_memcfg_from_srat() other than calling the numaq init function. Things get_memcfgs_numaq() do are not strictly necessary for numaq but added for consistency and to help unifying NUMA init handling. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86-32, NUMA: Replace srat_32.c with srat.cTejun Heo1-0/+23
SRAT support implementation in srat_32.c and srat.c are generally similar; however, there are some differences. First of all, 64bit implementation supports more types of SRAT entries. 64bit supports x2apic, affinity, memory and SLIT. 32bit only supports processor and memory. Most other differences stem from different initialization protocols employed by 64bit and 32bit NUMA init paths. On 64bit, * Mappings among PXM, node and apicid are directly done in each SRAT entry callback. * Memory affinity information is passed to numa_add_memblk() which takes care of all interfacing with NUMA init. * Doesn't directly initialize NUMA configurations. All the information is recorded in numa_nodes_parsed and memblks. On 32bit, * Checks numa_off. * Things go through one more level of indirection via private tables but eventually end up initializing the same mappings. * node_start/end_pfn[] are initialized and memblock_x86_register_active_regions() is called for each memory chunk. * node_set_online() is called for each online node. * sort_node_map() is called. There are also other minor differences in sanity checking and messages but taking 64bit version should be good enough. This patch drops the 32bit specific implementation and makes the 64bit implementation common for both 32 and 64bit. The init protocol differences are dealt with in two places - the numa_add_memblk() shim added in the previous patch and new temporary numa_32.c:get_memcfg_from_srat() which wraps invocation of x86_acpi_numa_init(). The shim numa_add_memblk() handles the folowings. * node_start/end_pfn[] initialization. * node_set_online() for memory nodes. * Invocation of memblock_x86_register_active_regions(). The shim get_memcfg_from_srat() handles the followings. * numa_off check. * node_set_online() for CPU nodes. * sort_node_map() invocation. * Clearing of numa_nodes_parsed and active_ranges on failure. The shims are temporary and will be removed as the generic NUMA init path in 32bit is replaced with 64bit one. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86-32, NUMA: implement temporary NUMA init shimsTejun Heo1-0/+34
To help transition to common NUMA init, implement temporary 32bit shims for numa_add_memblk() and numa_set_distance(). numa_add_memblk() registers the memblk and adjusts node_start/end_pfn[]. numa_set_distance() is noop. These shims will allow using 64bit NUMA init functions on 32bit and gradual transition to common NUMA init path. For detailed description, please read description of commits which make use of the shim functions. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86-32, NUMA: Move get_memcfg_numa() into numa_32.cTejun Heo1-1/+10
There's no reason get_memcfg_numa() to be implemented inline in mmzone_32.h. Move it to numa_32.c and also make get_memcfg_numa_flag() static. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86-32, NUMA: use sparse_memory_present_with_active_regions()Tejun Heo1-1/+0
Instead of calling memory_present() for each region from NUMA init, call sparse_memory_present_with_active_regions() from paging_init() similarly to x86-64. For flat and numaq, this results in exactly the same memory_present() calls. For srat, if there are multiple memory chunks for a node, after this change, memory_present() will be called separately for each chunk instead of being called once to encompass the whole range, which doesn't cause any harm and actually is the better behavior. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-05-02x86, NUMA: Unify 32/64bit numa_cpu_node() implementationTejun Heo1-5/+0
Currently, the only meaningful user of apic->x86_32_numa_cpu_node() is NUMAQ which returns valid mapping only after CPU is initialized during SMP bringup; thus, the previous patch to set apicid -> node in setup_local_APIC() makes __apicid_to_node[] always contain the correct mapping whether custom apic->x86_32_numa_cpu_node() is used or not. So, there is no reason to keep separate 32bit implementation. We can always consult __apicid_to_node[]. Move 64bit implementation from numa_64.c to numa.c and remove 32bit implementation from numa_32.c. Signed-off-by: Tejun Heo <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]>
2011-04-06x86-32, numa: Update remap allocator commentsTejun Heo1-14/+42
Now that remap allocator is cleaned up, update comments such that they are in docbook function description format and reflect the actual implementation. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Remove redundant node_remap_size[]Tejun Heo1-6/+4
Remap area size can be determined from node_remap_start_vaddr[] and node_remap_end_vaddr[] making node_remap_size[] redundant. Remove it. While at it, make resume_map_numa_kva() use @nr_pages for number of pages instead of @size. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Remove now useless node_remap_offset[]Tejun Heo1-11/+6
With lowmem address reservation moved into init_alloc_remap(), node_remap_offset[] is no longer useful. Remove it and related offset handling code. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Make pgdat allocation use alloc_remap()Tejun Heo1-4/+3
pgdat allocation is handled differnetly from other remap allocations - it's reserved during initialization. There's no reason to handle this any differnetly. Remap allocator is initialized for every node and if init failed the allocation will fail and pgdat allocation can fall back to generic code like anyone else. Remove special init-time pgdat reservation and make allocate_pgdat() use alloc_remap() like everyone else. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Move remapping for remap allocator into init_alloc_remap()Tejun Heo1-22/+7
There's no reason to perform the actual remapping separately. Collapse remap_numa_kva() into init_alloc_remap() and, while at it, make it less verbose. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Move lowmem address space reservation to init_alloc_remap()Tejun Heo1-57/+25
Remap alloc init is done in the following stages. 1. init_alloc_remap() calculates how much memory is necessary for each node and reserves node local memory. 2. initmem_init() collects how much each node needs and reserves a single contiguous lowmem area which can contain all. 3. init_remap_allocator() initializes allocator parameters from the determined lowmem address and per-node offsets. 4. Actual remap happens. There is no reason for the lowmem remap area to be reserved as a single contiguous area at one go. They don't interact with each other and the memblock allocator will put them side-by-side anyway. This patch breaks up the single lowmem address reservation and put per-node lowmem address reservation into init_alloc_remap() and initializes allocator parameters directly in the function as all the addresses are determined there. This merges steps 2 and 3 into 1. While at it, remove now largely irrelevant comments in init_alloc_remap(). This change causes the following behavior changes. * Remap lowmem areas are allocated in smaller per-node chunks. * Remap lowmem area reservation failure fail future remap allocations instead of panicking. * Remap allocator initialization is less verbose. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Make init_alloc_remap() less panickyTejun Heo1-2/+5
Remap allocator failure isn't fatal. The callers are required to fall back to regular early memory allocation mechanisms on failure anyway, so there's no reason to panic on remap init failure. Whining and returning are enough. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Calculate remap size in common codeTejun Heo1-6/+4
Only pgdat and memmap use remap area and there isn't much benefit in allowing per-node override. In addition, the use of node_remap_size[] is confusing in that it contains number of bytes before remap initialization and then number of pages afterwards. Move remap size calculation for memap from specific NUMA config implementations to init_alloc_remap() and make node_remap_size[] static. The only behavior difference is that, before this patch, numaq_32 didn't consider max_pfn when calculating the memmap size but it's enforced after this patch, which is the right thing to do. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Make @size in init_aloc_remap() represent bytesTejun Heo1-11/+7
@size variable in init_alloc_remap() is confusing in that it starts as number of bytes as its name implies and then becomes number of pages. Make it consistently represent bytes. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Rename @node_kva to @node_pa in init_alloc_remap()Tejun Heo1-10/+9
init_alloc_remap() is about to do more and using _kva suffix for physical address becomes confusing because the function will be handling both physical and virtual addresses. Rename @node_kva to @node_pa. This is trivial rename and doesn't cause any behavior difference. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Reorganize calculate_numa_remap_page()Tejun Heo1-63/+62
Separate the outer node walking loop and per-node logic from calculate_numa_remap_pages(). The outer loop is collapsed into initmem_init() and the per-node logic is moved into a new function - init_alloc_remap(). The new function name is confusing with the existing init_remap_allocator() and the behavior is the function isn't very clean either at this point, but this is to prepare for further cleanups and it will become prettier. This function doesn't introduce any behavior change. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Remove redundant top-down alloc code from remap initializationTejun Heo1-29/+14
memblock_find_in_range() now does top-down allocation by default, so there's no reason for its callers to explicitly implement it by gradually lowering the start address. Remove redundant top-down allocation logic from init_meminit() and calculate_numa_remap_pages(). Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Align pgdat size while initializing alloc_remapTejun Heo1-1/+2
When pgdat is reserved in init_remap_allocator(), PAGE_SIZE aligned size will be used. Match the size alignment in initialization to avoid allocation failure down the road. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-04-06x86-32, numa: Fix failure condition check in alloc_remap()Tejun Heo1-1/+1
node_remap_{start|end}_vaddr[] describe [start, end) ranges; however, alloc_remap() incorrectly failed when the current allocation + size equaled the end but it should fail only when it goes over. Fix it. Signed-off-by: Tejun Heo <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Yinghai Lu <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2011-02-16x86, NUMA: Move *_numa_init() invocations into initmem_init()Tejun Heo1-1/+1
There's no reason for these to live in setup_arch(). Move them inside initmem_init(). - v2: x86-32 initmem_init() weren't updated breaking 32bit builds. Fixed. Found by Ankita. Signed-off-by: Tejun Heo <[email protected]> Cc: Ankita Garg <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Shaohui Zheng <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]>
2011-02-16x86, NUMA: Drop @start/last_pfn from initmem_init()Tejun Heo1-2/+1
initmem_init() extensively accesses and modifies global data structures and the parameters aren't even followed depending on which path is being used. Drop @start/last_pfn and let it deal with @max_pfn directly. This is in preparation for further NUMA init cleanups. - v2: x86-32 initmem_init() weren't updated breaking 32bit builds. Fixed. Found by Yinghai. Signed-off-by: Tejun Heo <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Shaohui Zheng <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]>
2011-01-28x86: Unify NUMA initialization between 32 and 64bitTejun Heo1-0/+1
Now that everything else is unified, NUMA initialization can be unified too. * numa_init_array() and init_cpu_to_node() are moved from numa_64 to numa. * numa_32::initmem_init() is updated to call numa_init_array() and setup_arch() to call init_cpu_to_node() on 32bit too. * x86_cpu_to_node_map is now initialized to NUMA_NO_NODE on 32bit too. This is safe now as numa_init_array() will initialize it early during boot. This makes NUMA mapping fully initialized before setup_per_cpu_areas() on 32bit too and thus makes the first percpu chunk which contains all the static variables and some of dynamic area allocated with NUMA affinity correctly considered. Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reported-by: Eric Dumazet <[email protected]> Reviewed-by: Pekka Enberg <[email protected]>
2011-01-28x86: Unify cpu/apicid <-> NUMA node mapping between 32 and 64bitTejun Heo1-0/+6
The mapping between cpu/apicid and node is done via apicid_to_node[] on 64bit and apicid_2_node[] + apic->x86_32_numa_cpu_node() on 32bit. This difference makes it difficult to further unify 32 and 64bit NUMA handling. This patch unifies it by replacing both apicid_to_node[] and apicid_2_node[] with __apicid_to_node[] array, which is accessed by two accessors - set_apicid_to_node() and numa_cpu_node(). On 64bit, numa_cpu_node() always consults __apicid_to_node[] directly while 32bit goes through apic->numa_cpu_node() method to allow apic implementations to override it. srat_detect_node() for amd cpus contains workaround for broken NUMA configuration which assumes relationship between APIC ID, HT node ID and NUMA topology. Leave it to access __apicid_to_node[] directly as mapping through CPU might result in undesirable behavior change. The comment is reformatted and updated to note the ugliness. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: David Rientjes <[email protected]>
2010-10-05x86-32, memblock: Make add_highpages honor early reserved rangesYinghai Lu1-2/+0
Originally the only early reserved range that is overlapped with high pages is "KVA RAM", but we already do remove that from the active ranges. However, It turns out Xen could have that kind of overlapping to support memory ballooning.x So we need to make add_highpage_with_active_regions() to subtract memblock reserved just like low ram; this is the proper design anyway. In this patch, refactering get_freel_all_memory_range() to make it can be used by add_highpage_with_active_regions(). Also we don't need to remove "KVA RAM" from active ranges. Signed-off-by: Yinghai Lu <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27x86: Remove old bootmem codeYinghai Lu1-3/+0
Requested by Ingo, Thomas and HPA. The old bootmem code is no longer necessary, and the transition is complete. Remove it. Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-08-27x86, memblock: Replace e820_/_early string with memblock_Yinghai Lu1-12/+13
1.include linux/memblock.h directly. so later could reduce e820.h reference. 2 this patch is done by sed scripts mainly -v2: use MEMBLOCK_ERROR instead of -1ULL or -1UL Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2010-02-12x86: Make 32bit support NO_BOOTMEMYinghai Lu1-0/+3
Let's make 32bit consistent with 64bit. -v2: Andrew pointed out for 32bit that we should use -1ULL Signed-off-by: Yinghai Lu <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>
2009-10-12x86: Export k8 physical topologyDavid Rientjes1-2/+2
To eventually interleave emulated nodes over physical nodes, we need to know the physical topology of the machine without actually registering it. This does the k8 node setup in two parts: detection and registration. NUMA emulation can then used the physical topology detected to setup the address ranges of emulated nodes accordingly. If emulation isn't used, the k8 nodes are registered as normal. Two formals are added to the x86 NUMA setup functions: `acpi' and `k8'. These represent whether ACPI or K8 NUMA has been detected; both cannot be true at the same time. This specifies to the NUMA emulation code whether an underlying physical NUMA topology exists and which interface to use. This patch deals solely with separating the k8 setup path into Northbridge detection and registration steps and leaves the ACPI changes for a subsequent patch. The `acpi' formal is added here, however, to avoid touching all the header files again in the next patch. This approach also ensures emulated nodes will not span physical nodes so the true memory latency is not misrepresented. k8_get_nodes() may now be used to export the k8 physical topology of the machine for NUMA emulation. Signed-off-by: David Rientjes <[email protected]> Cc: Andreas Herrmann <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Ankita Garg <[email protected]> Cc: Len Brown <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-04-17x86: mm/numa_32.c calculate_numa_remap_pages should use __initJaswinder Singh Rajput1-1/+1
calculate_numa_remap_pages() is called only by __init initmem_init() further calculate_numa_remap_pages is calling: __init find_e820_area() and __init reserve_early() So calculate_numa_remap_pages() should be __init calculate_numa_remap_pages(). WARNING: arch/x86/built-in.o(.text+0x82ea3): Section mismatch in reference from the function calculate_numa_remap_pages() to the function .init.text:find_e820_area() The function calculate_numa_remap_pages() references the function __init find_e820_area(). This is often because calculate_numa_remap_pages lacks a __init annotation or the annotation of find_e820_area is wrong. WARNING: arch/x86/built-in.o(.text+0x82f5f): Section mismatch in reference from the function calculate_numa_remap_pages() to the function .init.text:reserve_early() The function calculate_numa_remap_pages() references the function __init reserve_early(). This is often because calculate_numa_remap_pages lacks a __init annotation or the annotation of reserve_early is wrong. [ Impact: save memory, address Section mismatch warning ] Signed-off-by: Jaswinder Singh Rajput <[email protected]> Cc: Sam Ravnborg <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-03-04x86: fix bootmem cross node for 32bit numaYinghai Lu1-2/+3
Impact: fix panic on system 2g x4 sockets Found one system with 4 sockets and every sockets has 2g can not boot with numa32 because boot mem is crossing nodes. So try to have numa version of setup_bootmem_allocator(). Signed-off-by: Yinghai Lu <[email protected]> Cc: Andrew Morton <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-03-04Merge branches 'x86/apic', 'x86/cpu', 'x86/fixmap', 'x86/mm', 'x86/sched', ↵Ingo Molnar1-26/+0
'x86/setup-lzma', 'x86/signal' and 'x86/urgent' into x86/core
2009-03-03x86: set_highmem_pages_init() cleanupPekka Enberg1-26/+0
Impact: cleanup This patch moves set_highmem_pages_init() to arch/x86/mm/highmem_32.c. The declaration of the function is kept in asm/numa_32.h because asm/highmem.h is included only if CONFIG_HIGHMEM is enabled so we can't put the empty static inline function there. Signed-off-by: Pekka Enberg <[email protected]> LKML-Reference: <1236082212.2675.24.camel@penberg-laptop> Signed-off-by: Ingo Molnar <[email protected]>