Age | Commit message (Collapse) | Author | Files | Lines |
|
Kernel style prefers a single string over split strings when the string is
'user-visible'.
Miscellanea:
- Add a missing newline
- Realign arguments
Signed-off-by: Joe Perches <[email protected]>
Acked-by: Tejun Heo <[email protected]> [percpu]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
online_pages() simply returns an error value if
memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
want for successfully onlining target pages. This patch arms to print
more failure information like offline_pages() in online_pages.
This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves
__offline_pages() to not print failure information with KERN_INFO
according to David Rientjes's suggestion[1].
[1] https://lkml.org/lkml/2016/2/24/1094
Signed-off-by: Chen Yucong <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.
In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.
In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We can reuse the nid we've determined instead of repeated pfn_to_nid()
usages. Also zone_to_nid() should be a bit cheaper in general than
pfn_to_nid().
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[[email protected]: fix build errors with kcompactd]
[[email protected]: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Paul Gortmaker <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a performance drop report due to hugepage allocation and in
there half of cpu time are spent on pageblock_pfn_to_page() in
compaction [1].
In that workload, compaction is triggered to make hugepage but most of
pageblocks are un-available for compaction due to pageblock type and
skip bit so compaction usually fails. Most costly operations in this
case is to find valid pageblock while scanning whole zone range. To
check if pageblock is valid to compact, valid pfn within pageblock is
required and we can obtain it by calling pageblock_pfn_to_page(). This
function checks whether pageblock is in a single zone and return valid
pfn if possible. Problem is that we need to check it every time before
scanning pageblock even if we re-visit it and this turns out to be very
expensive in this workload.
Although we have no way to skip this pageblock check in the system where
hole exists at arbitrary position, we can use cached value for zone
continuity and just do pfn_to_page() in the system where hole doesn't
exist. This optimization considerably speeds up in above workload.
Before vs After
Max: 1096 MB/s vs 1325 MB/s
Min: 635 MB/s 1015 MB/s
Avg: 899 MB/s 1194 MB/s
Avg is improved by roughly 30% [2].
[1]: http://www.spinics.net/lists/linux-mm/msg97378.html
[2]: https://lkml.org/lkml/2015/12/9/23
[[email protected]: don't forget to restore zone->contiguous on error path, per Vlastimil]
Signed-off-by: Joonsoo Kim <[email protected]>
Reported-by: Aaron Lu <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Tested-by: Aaron Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, all newly added memory blocks remain in 'offline' state
unless someone onlines them, some linux distributions carry special udev
rules like:
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
to make this happen automatically. This is not a great solution for
virtual machines where memory hotplug is being used to address high
memory pressure situations as such onlining is slow and a userspace
process doing this (udev) has a chance of being killed by the OOM killer
as it will probably require to allocate some memory.
Introduce default policy for the newly added memory blocks in
/sys/devices/system/memory/auto_online_blocks file with two possible
values: "offline" which preserves the current behavior and "online"
which causes all newly added memory blocks to go online as soon as
they're added. The default is "offline".
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Daniel Kiper <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Daniel Kiper <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: David Vrabel <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Igor Mammedov <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Set IORESOURCE_SYSTEM_RAM in struct resource.flags of "System
RAM" entries.
Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: David Vrabel <[email protected]> # xen
Cc: Andrew Banman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Gu Zheng <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luis R. Rodriguez <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: [email protected]
Cc: linux-mm <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.
Signed-off-by: Dan Williams <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Out of memory condition is not a bug and while we can't add new memory
in such case crashing the system seems wrong. Propagating the return
value from register_memory_resource() requires interface change.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Igor Mammedov <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Sheng Yong <[email protected]>
Cc: Zhu Guihua <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Vrabel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range. pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.
Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.
This also prevents a crash from offlining memory devices with missing
sections. Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.
Signed-off-by: Andrew Banman <[email protected]>
Acked-by: Alex Thorlton <[email protected]>
Cc: Russ Anderson <[email protected]>
Cc: Alex Thorlton <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime by introducing an 'enum memmap_context'
parameter to memmap_init_zone() and init_currently_empty_zone(). This
parameter is intended to be used to tell whether the call of these two
functions is being made on behalf of a hotplug event, or happening at
boot-time. However, init_currently_empty_zone() does not use this
parameter at all, so remove it.
Signed-off-by: Yaowei Bai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add add_memory_resource() to add memory using an existing "System RAM"
resource. This is useful if the memory region is being located by
finding a free resource slot with allocate_resource().
Xen guests will make use of this in their balloon driver to hotplug
arbitrary amounts of memory in response to toolstack requests.
Signed-off-by: David Vrabel <[email protected]>
Reviewed-by: Daniel Kiper <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
|
|
node_data for a node.
Commit f9126ab9241f ("memory-hotplug: fix wrong edge when hot add a new
node") hot-added memory range to memblock, after creating pgdat for new
node.
But there is a problem:
add_memory()
|--> hotadd_new_pgdat()
|--> free_area_init_node()
|--> get_pfn_range_for_nid()
|--> find start_pfn and end_pfn in memblock
|--> ......
|--> memblock_add_node(start, size, nid) -------- Here, just too late.
get_pfn_range_for_nid() will find that start_pfn and end_pfn are both 0.
As a result, when adding memory, dmesg will give the following wrong
message.
Initmem setup node 5 [mem 0x0000000000000000-0xffffffffffffffff]
On node 5 totalpages: 0
Built 5 zonelists in Node order, mobility grouping on. Total pages: 32588823
Policy zone: Normal
init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
The solution is simple, just add the memory range to memblock a little
earlier, before hotadd_new_pgdat().
[[email protected]: coding-style fixes]
Signed-off-by: Tang Chen <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Taku Izumi <[email protected]>
Cc: Gu Zheng <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: <[email protected]> [4.2.x]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.
Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jerome Glisse <[email protected]>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
|
|
When we add a new node, the edge of memory may be wrong.
e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],
1. hotremove the node3,
2. then hotadd node3 with a part of memory, mem:[26G-30G],
3. call hotadd_new_pgdat()
free_area_init_node()
get_pfn_range_for_nid()
4. it will return wrong start_pfn and end_pfn, because we have not
update the memblock.
This patch also fixes a BUG_ON during hot-addition, please see
http://marc.info/?l=linux-kernel&m=142961156129456&w=2
Signed-off-by: Xishi Qiu <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Taku Izumi <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Gu Zheng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 92923ca3aace ("mm: meminit: only set page reserved in the
memblock region") broke memory hotplug which expects the memmap for
newly added sections to be reserved until onlined by
online_pages_range(). This patch marks hotplugged pages as reserved
when adding new zones.
Signed-off-by: Mel Gorman <[email protected]>
Reported-by: David Vrabel <[email protected]>
Tested-by: David Vrabel <[email protected]>
Cc: Nathan Zimmer <[email protected]>
Cc: Robin Holt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When hot add two nodes continuously, we found the vmemmap region info is
a bit messed. The last region of node 2 is printed when node 3 hot
added, like the following:
Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
On node 2 totalpages: 0
Built 2 zonelists in Node order, mobility grouping on. Total pages: 16090539
Policy zone: Normal
init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
[mem 0x40000000000-0x407ffffffff] page 1G
[ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
[ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
...
[ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
[ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
On node 3 totalpages: 0
Built 3 zonelists in Node order, mobility grouping on. Total pages: 16090539
Policy zone: Normal
init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
[mem 0x60000000000-0x607ffffffff] page 1G
[ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
[ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
[ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
[ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
...
The cause is the last region was missed at the and of hot add memory,
and p_start, p_end, node_start were not reset, so when hot add memory to
a new node, it will consider they are not contiguous blocks and print
the previous one. So we print the last vmemmap region at the end of hot
add memory to avoid the confusion.
Signed-off-by: Zhu Guihua <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Izumi found the following oops when hot re-adding a node:
BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP <ffff880017b97be8>
CR2: ffffc90008963690
Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)
This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.
When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:
static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}
so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.
Signed-off-by: Gu Zheng <[email protected]>
Reported-by: Taku Izumi <[email protected]>
Reviewed by: Yasuaki Ishimatsu <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now we have an easy access to hugepages' activeness, so existing helpers to
get the information can be cleaned up.
[[email protected]: s/PageHugeActive/page_huge_active/]
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There's a deadlock when concurrently hot-adding memory through the probe
interface and switching a memory block from offline to online.
When hot-adding memory via the probe interface, add_memory() first takes
mem_hotplug_begin() and then device_lock() is later taken when registering
the newly initialized memory block. This creates a lock dependency of (1)
mem_hotplug.lock (2) dev->mutex.
When switching a memory block from offline to online, dev->mutex is first
grabbed in device_online() when the write(2) transitions an existing
memory block from offline to online, and then online_pages() will take
mem_hotplug_begin().
This creates a lock inversion between mem_hotplug.lock and dev->mutex.
Vitaly reports that this deadlock can happen when kworker handling a probe
event races with systemd-udevd switching a memory block's state.
This patch requires the state transition to take mem_hotplug_begin()
before dev->mutex. Hot-adding memory via the probe interface creates a
memory block while holding mem_hotplug_begin(), there is no way to take
dev->mutex first in this case.
online_pages() and offline_pages() are only called when transitioning
memory block state. We now require that mem_hotplug_begin() is taken
before calling them -- this requires exporting the mem_hotplug_begin() and
mem_hotplug_done() to generic code. In all hot-add and hot-remove cases,
mem_hotplug_begin() is done prior to device_online(). This is all that is
needed to avoid the deadlock.
Signed-off-by: David Rientjes <[email protected]>
Reported-by: Vitaly Kuznetsov <[email protected]>
Tested-by: Vitaly Kuznetsov <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Zhang Zhen <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Wang Nan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use macro section_nr_to_pfn() to switch between section and pfn, instead
of open-coding it. No semantic changes.
Signed-off-by: Sheng Yong <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:
BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0
The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.
process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer access
So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.
Signed-off-by: Gu Zheng <[email protected]>
Reported-by: Xishi Qiu <[email protected]>
Suggested-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Taku Izumi <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Xie XiuQi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Memory hotplug and failure mechanisms have several places where pcplists
are drained so that pages are returned to the buddy allocator and can be
e.g. prepared for offlining. This is always done in the context of a
single zone, we can reduce the pcplists drain to the single zone, which
is now possible.
The change should make memory offlining due to hotremove or failure
faster and not disturbing unrelated pcplists anymore.
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.
This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.
All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When memory is hot-added, all the memory is in offline state. So clear
all zones' present_pages because they will be updated in online_pages()
and offline_pages(). Otherwise, /proc/zoneinfo will corrupt:
When the memory of node2 is offline:
# cat /proc/zoneinfo
......
Node 2, zone Movable
......
spanned 8388608
present 8388608
managed 0
When we online memory on node2:
# cat /proc/zoneinfo
......
Node 2, zone Movable
......
spanned 8388608
present 16777216
managed 8388608
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Yasuaki Ishimatsu <[email protected]>
Cc: <[email protected]> [3.16+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In free_area_init_core(), zone->managed_pages is set to an approximate
value for lowmem, and will be adjusted when the bootmem allocator frees
pages into the buddy system.
But free_area_init_core() is also called by hotadd_new_pgdat() when
hot-adding memory. As a result, zone->managed_pages of the newly added
node's pgdat is set to an approximate value in the very beginning.
Even if the memory on that node has node been onlined,
/sys/device/system/node/nodeXXX/meminfo has wrong value:
hot-add node2 (memory not onlined)
cat /sys/device/system/node/node2/meminfo
Node 2 MemTotal: 33554432 kB
Node 2 MemFree: 0 kB
Node 2 MemUsed: 33554432 kB
Node 2 Active: 0 kB
This patch fixes this problem by reset node managed pages to 0 after
hot-adding a new node.
1. Move reset_managed_pages_done from reset_node_managed_pages() to
reset_all_zones_managed_pages()
2. Make reset_node_managed_pages() non-static
3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
is initialized
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Cc: <[email protected]> [3.16+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When hot adding the same memory after hot removal, the following
messages are shown:
WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
...
Call Trace:
dump_stack+0x46/0x58
warn_slowpath_common+0x81/0xa0
warn_slowpath_null+0x1a/0x20
free_area_init_node+0x3fe/0x426
hotadd_new_pgdat+0x90/0x110
add_memory+0xd4/0x200
acpi_memory_device_add+0x1aa/0x289
acpi_bus_attach+0xfd/0x204
acpi_bus_attach+0x178/0x204
acpi_bus_scan+0x6a/0x90
acpi_device_hotplug+0xe8/0x418
acpi_hotplug_work_fn+0x1f/0x2b
process_one_work+0x14e/0x3f0
worker_thread+0x11b/0x510
kthread+0xe1/0x100
ret_from_fork+0x7c/0xb0
The detaled explanation is as follows:
When hot removing memory, pgdat is set to 0 in try_offline_node(). But
if the pgdat is allocated by bootmem allocator, the clearing step is
skipped.
And when hot adding the same memory, the uninitialized pgdat is reused.
But free_area_init_node() checks wether pgdat is set to zero. As a
result, free_area_init_node() hits WARN_ON().
This patch clears pgdat which is allocated by bootmem allocator in
try_offline_node().
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Cc: Zhang Zhen <[email protected]>
Cc: Wang Nan <[email protected]>
Cc: Tang Chen <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently memory-hotplug has two limits:
1. If the memory block is in ZONE_NORMAL, you can change it to
ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.
2. If the memory block is in ZONE_MOVABLE, you can change it to
ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.
With this patch, we can easy to know a memory block can be onlined to
which zone, and don't need to know the above two limits.
Updated the related Documentation.
[[email protected]: use conventional comment layout]
[[email protected]: fix build with CONFIG_MEMORY_HOTREMOVE=n]
[[email protected]: remove unused local zone_prev]
Signed-off-by: Zhang Zhen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Wang Nan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This series of patches fixes a problem when adding memory in bad manner.
For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
memory installed, following commands cause problem:
# echo 0x40000000 > /sys/devices/system/memory/probe
[ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
# echo 0x48000000 > /sys/devices/system/memory/probe
[ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
# echo online_movable > /sys/devices/system/memory/memory9/state
# echo 0x50000000 > /sys/devices/system/memory/probe
[ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
# echo 0x58000000 > /sys/devices/system/memory/probe
[ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
# echo online_movable > /sys/devices/system/memory/memory11/state
# echo online> /sys/devices/system/memory/memory8/state
# echo online> /sys/devices/system/memory/memory10/state
# echo offline> /sys/devices/system/memory/memory9/state
[ 30.558819] Offlined Pages 32768
# free
total used free shared buffers cached
Mem: 780588 18014398509432020 830552 0 0 51180
-/+ buffers/cache: 18014398509380840 881732
Swap: 0 0 0
This is because the above commands probe higher memory after online a
section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
After the second online_movable, the problem can be observed from
zoneinfo:
# cat /proc/zoneinfo
...
Node 0, zone Movable
pages free 65491
min 250
low 312
high 375
scanned 0
spanned 18446744073709518848
present 65536
managed 65536
...
This series of patches solve the problem by checking ZONE_MOVABLE when
choosing zone for new memory. If new memory is inside or higher than
ZONE_MOVABLE, makes it go there instead.
After applying this series of patches, following are free and zoneinfo
result (after offlining memory9):
bash-4.2# free
total used free shared buffers cached
Mem: 780956 80112 700844 0 0 51180
-/+ buffers/cache: 28932 752024
Swap: 0 0 0
bash-4.2# cat /proc/zoneinfo
Node 0, zone DMA
pages free 3389
min 14
low 17
high 21
scanned 0
spanned 4095
present 3998
managed 3977
nr_free_pages 3389
...
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 73724
min 341
low 426
high 511
scanned 0
spanned 98304
present 98304
managed 92958
nr_free_pages 73724
...
start_pfn: 4096
inactive_ratio: 1
Node 0, zone Normal
pages free 32630
min 120
low 150
high 180
scanned 0
spanned 32768
present 32768
managed 32768
nr_free_pages 32630
...
start_pfn: 262144
inactive_ratio: 1
Node 0, zone Movable
pages free 65476
min 241
low 301
high 361
scanned 0
spanned 98304
present 65536
managed 65536
nr_free_pages 65476
...
start_pfn: 294912
inactive_ratio: 1
This patch (of 7):
Introduce zone_for_memory() in arch independent code for
arch_add_memory() use.
Many arch_add_memory() function simply selects ZONE_HIGHMEM or
ZONE_NORMAL and add new memory into it. However, with the existance of
ZONE_MOVABLE, the selection method should be carefully considered: if
new, higher memory is added after ZONE_MOVABLE is setup, the default
zone and ZONE_MOVABLE may overlap each other.
should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
has already contain memory, compare the address of new memory and
movable memory. If new memory is higher than movable, it should be
added into ZONE_MOVABLE instead of default zone.
Signed-off-by: Wang Nan <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: "Mel Gorman" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In store_mem_state(), we have:
...
334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
335 online_type = -1;
...
355 case -1:
356 ret = device_offline(&mem->dev);
357 break;
...
Here, "offline" is hard coded as -1.
This patch does the following renaming:
ONLINE_KEEP -> MMOP_ONLINE_KEEP
ONLINE_KERNEL -> MMOP_ONLINE_KERNEL
ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLE
and introduces MMOP_OFFLINE = -1 to avoid hard coding.
Signed-off-by: Tang Chen <[email protected]>
Cc: Hu Tao <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Gu Zheng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
grow_zone_span and grow_pgdat_span are only called by
__meminit __add_zone
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Toshi Kani <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.
This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.
If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.
Signed-off-by: David Rientjes <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Greg Thelen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace ((x) >> PAGE_SHIFT) with the pfn macro.
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Wen Congyang <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
lock_memory_hotplug
We don't need to do register_memory_resource() under
lock_memory_hotplug() since it has its own lock and doesn't make any
callbacks.
Also register_memory_resource return NULL on failure so we don't have
anything to cleanup at this point.
The reason for this rfc is I was doing some experiments with hotplugging
of memory on some of our larger systems. While it seems to work, it can
be quite slow. With some preliminary digging I found that
lock_memory_hotplug is clearly ripe for breakup.
It could be broken up per nid or something but it also covers the
online_page_callback. The online_page_callback shouldn't be very hard
to break out.
Also there is the issue of various structures(wmarks come to mind) that
are only updated under the lock_memory_hotplug that would need to be
dealt with.
Cc: Tang Chen <[email protected]>
Cc: Wen Congyang <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Reviewed-by: Yasuaki Ishimatsu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Hedi <[email protected]>
Cc: Mike Travis <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether ->index or ->mapping is required to be NULL.
This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places. It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.
This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.
[[email protected]: switch to pr_alert]
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Correct ensure_zone_is_initialized() function description according to
the introduced memblock APIs for early memory allocations.
Signed-off-by: Grygorii Strashko <[email protected]>
Signed-off-by: Santosh Shilimkar <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Russell King <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Tony Lindgren <[email protected]>
Cc: Yinghai Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Clean-up to remove depedency with bootmem headers.
Signed-off-by: Grygorii Strashko <[email protected]>
Signed-off-by: Santosh Shilimkar <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Russell King <[email protected]>
Cc: Tony Lindgren <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Linux kernel cannot migrate pages used by the kernel. As a result,
hotpluggable memory used by the kernel won't be able to be hot-removed.
To solve this problem, the basic idea is to prevent memblock from
allocating hotpluggable memory for the kernel at early time, and arrange
all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
ZONE_MOVABLE when initializing zones.
In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.
In this patch, we make memblock skip these hotpluggable memory regions
in the default top-down allocation function if movable_node boot option
is specified.
[[email protected]: coding-style fixes]
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Rafael J . Wysocki" <[email protected]>
Cc: Chen Tang <[email protected]>
Cc: Gong Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Liu Jiang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Taku Izumi <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Renninger <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Vasilis Liaskovitis <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Wen Congyang <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Yinghai Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
Suggested-by: Kamezawa Hiroyuki <[email protected]>
Suggested-by: Ingo Molnar <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Toshi Kani <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Thomas Renninger <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Wen Congyang <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Taku Izumi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For below functions,
- sparse_add_one_section()
- kmalloc_section_memmap()
- __kmalloc_section_memmap()
- __kfree_section_memmap()
they are always invoked to operate on one memory section, so it is
redundant to always pass a nr_pages parameter, which is the page numbers
in one section. So we can directly use predefined macro PAGES_PER_SECTION
instead of passing the parameter.
Signed-off-by: Zhang Yanfei <[email protected]>
Cc: Wen Congyang <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Yasunori Goto <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
mem_online_node() to put its node online if offlined and then call
build_all_zonelists() to initialize the zone list.
These steps are specific to memory hotplug, and should be managed in
mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the
whole steps.
For this reason, this patch replaces mem_online_node() with
try_online_node(), which performs the whole steps with
lock_memory_hotplug() held. try_online_node() is named after
try_offline_node() as they have similar purpose.
There is no functional change in this patch.
Signed-off-by: Toshi Kani <[email protected]>
Reviewed-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use "pfn_to_nid(pfn)" instead of "page_to_nid(pfn_to_page(pfn))".
Signed-off-by: Xishi Qiu <[email protected]>
Acked-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A is_memblock_offlined() return or 1 means memory block is offlined, but
is_memblock_offlined_cb() returning 1 means memory block is not offlined,
this will confuse somebody, so rename the function.
Signed-off-by: Xishi Qiu <[email protected]>
Acked-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.
Signed-off-by: Xishi Qiu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"All of these commits are fixes that have emerged recently and some of
them fix bugs introduced during this merge window.
Specifics:
1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events
After the recent ACPIPHP changes we've seen some interesting
breakage on a system that triggers device check notifications
during boot for non-existing devices. Although those
notifications are really spurious, we should be able to deal with
them nevertheless and that shouldn't introduce too much overhead.
Four commits to make that work properly.
2) Memory hotplug and hibernation mutual exclusion rework
This was maent to be a cleanup, but it happens to fix a classical
ABBA deadlock between system suspend/hibernation and ACPI memory
hotplug which is possible if they are started roughly at the same
time. Three commits rework memory hotplug so that it doesn't
acquire pm_mutex and make hibernation use device_hotplug_lock
which prevents it from racing with memory hotplug.
3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix
The ACPI LPSS driver crashes during boot on Apple Macbook Air with
Haswell that has slightly unusual BIOS configuration in which one
of the LPSS device's _CRS method doesn't return all of the
information expected by the driver. Fix from Mika Westerberg, for
stable.
4) ACPICA fix related to Store->ArgX operation
AML interpreter fix for obscure breakage that causes AML to be
executed incorrectly on some machines (observed in practice).
From Bob Moore.
5) ACPI core fix for PCI ACPI device objects lookup
There still are cases in which there is more than one ACPI device
object matching a given PCI device and we don't choose the one
that the BIOS expects us to choose, so this makes the lookup take
more criteria into account in those cases.
6) Fix to prevent cpuidle from crashing in some rare cases
If the result of cpuidle_get_driver() is NULL, which can happen on
some systems, cpuidle_driver_ref() will crash trying to use that
pointer and the Daniel Fu's fix prevents that from happening.
7) cpufreq fixes related to CPU hotplug
Stephen Boyd reported a number of concurrency problems with
cpufreq related to CPU hotplug which are addressed by a series of
fixes from Srivatsa S Bhat and Viresh Kumar.
8) cpufreq fix for time conversion in time_in_state attribute
Time conversion carried out by cpufreq when user space attempts to
read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
won't work correcty if cputime_t doesn't map directly to jiffies.
Fix from Andreas Schwab.
9) Revert of a troublesome cpufreq commit
Commit 7c30ed5 (cpufreq: make sure frequency transitions are
serialized) was intended to address some known concurrency
problems in cpufreq related to the ordering of transitions, but
unfortunately it introduced several problems of its own, so I
decided to revert it now and address the original problems later
in a more robust way.
10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.
11) cpufreq fixes related to system suspend/resume
The recent cpufreq changes that made it preserve CPU sysfs
attributes over suspend/resume cycles introduced a possible NULL
pointer dereference that caused it to crash during the second
attempt to suspend. Three commits from Srivatsa S Bhat fix that
problem and a couple of related issues.
12) cpufreq locking fix
cpufreq_policy_restore() should acquire the lock for reading, but
it acquires it for writing. Fix from Lan Tianyu"
* tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
cpufreq: Restructure if/else block to avoid unintended behavior
cpufreq: Fix crash in cpufreq-stats during suspend/resume
intel_pstate: Add Haswell CPU models
Revert "cpufreq: make sure frequency transitions are serialized"
cpufreq: Use signed type for 'ret' variable, to store negative error values
cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
cpufreq: Split __cpufreq_remove_dev() into two parts
cpufreq: Fix wrong time unit conversion
cpufreq: serialize calls to __cpufreq_governor()
cpufreq: don't allow governor limits to be changed when it is disabled
ACPI / bind: Prefer device objects with _STA to those without it
ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
...
|
|
Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.
What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().
Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().
As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.
[[email protected]: remove duplicated include]
Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
lock_device_hotplug() serializes hotplug & online/offline operations. The
lock is held in common sysfs online/offline interfaces and ACPI hotplug
code paths.
And here are the code paths:
- CPU & Mem online/offline via sysfs online
store_online()->lock_device_hotplug()
- Mem online via sysfs state:
store_mem_state()->lock_device_hotplug()
- ACPI CPU & Mem hot-add:
acpi_scan_bus_device_check()->lock_device_hotplug()
- ACPI CPU & Mem hot-delete:
acpi_scan_hot_remove()->lock_device_hotplug()
try_offline_node() off-lines a node if all memory sections and cpus are
removed on the node. It is called from acpi_processor_remove() and
acpi_memory_remove_memory()->remove_memory() paths, both of which are in
the ACPI hotplug code.
try_offline_node() calls stop_machine() to stop all cpus while checking
all cpu status with the assumption that the caller is not protected from
CPU hotplug or CPU online/offline operations. However, the caller is
always serialized with lock_device_hotplug(). Also, the code needs to be
properly serialized with a lock, not by stopping all cpus at a random
place with stop_machine().
This patch removes the use of stop_machine() in try_offline_node() and
adds comments to try_offline_node() and remove_memory() that
lock_device_hotplug() is required.
Signed-off-by: Toshi Kani <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Wanpeng Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|