Age | Commit message (Collapse) | Author | Files | Lines |
|
Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables. Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().
Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
[[email protected]: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Jianguo Wu <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Search a page table about the removed memory, and clear page table for
x86_64 architecture.
[[email protected]: make kernel_physical_mapping_remove() static]
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Jianguo Wu <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When memory is removed, the corresponding pagetables should alse be
removed. This patch introduces some common APIs to support vmemmap
pagetable and x86_64 architecture direct mapping pagetable removing.
All pages of virtual mapping in removed memory cannot be freed if some
pages used as PGD/PUD include not only removed memory but also other
memory. So this patch uses the following way to check whether a page
can be freed or not.
1) When removing memory, the page structs of the removed memory are
filled with 0FD.
2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
cleared. In this case, the page used as PT/PMD can be freed.
For direct mapping pages, update direct_pages_count[level] when we freed
their pagetables. And do not free the pages again because they were
freed when offlining.
For vmemmap pages, free the pages and their pagetables.
For larger pages, do not split them into smaller ones because there is
no way to know if the larger page has been split. As a result, there is
no way to decide when to split. We deal the larger pages in the
following way:
1) For direct mapped pages, all the pages were freed when they were
offlined. And since menmory offline is done section by section, all
the memory ranges being removed are aligned to PAGE_SIZE. So only need
to deal with unaligned pages when freeing vmemmap pages.
2) For vmemmap pages being used to store page_struct, if part of the
larger page is still in use, just fill the unused part with 0xFD. And
when the whole page is fulfilled with 0xFD, then free the larger page.
[[email protected]: fix typo in comment]
[[email protected]: do not calculate direct mapping pages when freeing vmemmap pagetables]
[[email protected]: do not free direct mapping pages twice]
[[email protected]: do not free page split from hugepage one by one]
[[email protected]: do not split pages when freeing pagetable pages]
[[email protected]: use pmd_page_vaddr()]
[[email protected]: fix used-uninitialised bug]
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Jianguo Wu <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't
need to lock the whole function. If we do some work to free pagetables
in free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.
If we lock the whole sparse_remove_one_section(), then we come to this call trace:
------------[ cut here ]------------
WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
Hardware name: PRIMEQUEST 1800E
......
Call Trace:
smp_call_function_many+0xbd/0x260
smp_call_function+0x3b/0x50
on_each_cpu+0x3b/0xc0
flush_tlb_all+0x1c/0x20
remove_pagetable+0x14e/0x1d0
vmemmap_free+0x18/0x20
sparse_remove_one_section+0xf7/0x100
__remove_section+0xa2/0xb0
__remove_pages+0xa0/0xd0
arch_remove_memory+0x6b/0xc0
remove_memory+0xb8/0xf0
acpi_memory_device_remove+0x53/0x96
acpi_device_remove+0x90/0xb2
__device_release_driver+0x7c/0xf0
device_release_driver+0x2f/0x50
acpi_bus_remove+0x32/0x6d
acpi_bus_trim+0x91/0x102
acpi_bus_hot_remove_device+0x88/0x16b
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x20e/0x5c0
worker_thread+0x12e/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
---[ end trace 25e85300f542aa01 ]---
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem(). So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().
NOTE: register_page_bootmem_memmap() is not implemented for ia64,
ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
and revert register_page_bootmem_info_node() when platform doesn't
support it.
It's implemented by adding a new Kconfig option named
CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
by memory-hotplug feature fully supported archs(currently only on
x86_64).
Since we have 2 config options called MEMORY_HOTPLUG and
MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
and codes in function register_page_bootmem_info_node() are only
used for collecting infomation for hot-remove, so reside it under
MEMORY_HOTREMOVE.
Besides page_isolation.c selected by MEMORY_ISOLATION under
MEMORY_HOTPLUG is also such case, move it too.
[[email protected]: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
[[email protected]: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
[[email protected]: remove the arch specific functions without any implementation]
[[email protected]: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
[[email protected]: fix defined but not used warning]
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wu Jianguo <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Lin Feng <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For removing memory, we need to remove page tables. But it depends on
architecture. So the patch introduce arch_remove_memory() for removing
page table. Now it only calls __remove_pages().
Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
type} sysfs files are created. But there is no code to remove these
files. This patch implements the function to remove them.
We cannot free firmware_map_entry which is allocated by bootmem because
there is no way to do so when the system is up. But we can at least
remember the address of that memory and reuse the storage when the
memory is added next time.
This patch also introduces a new list map_entries_bootmem to link the
map entries allocated by bootmem when they are removed, and a lock to
protect it. And these entries will be reused when the memory is
hot-added again.
The idea is suggestted by Andrew Morton.
NOTE: It is unsafe to return an entry pointer and release the
map_entries_lock. So we should not hold the map_entries_lock
separately in firmware_map_find_entry() and
firmware_map_remove_entry(). Hold the map_entries_lock across find
and remove /sys/firmware/memmap/X operation.
And also, users of these two functions need to be careful to
hold the lock when using these two functions.
[[email protected]: Hold spinlock across find|remove /sys operation]
[[email protected]: fix the wrong comments of map_entries]
[[email protected]: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem]
[[email protected]: fix section mismatch problem]
[[email protected]: fix the doc format in drivers/firmware/memmap.c]
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Kamezawa Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Julian Calaby <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Kamezawa Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
removing memory
We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug
All memory blocks must be offlined before removing memory. But we don't
hold the lock in the whole operation. So we should check whether all
memory blocks are offlined before step6. Otherwise, kernel maybe
panicked.
Offlining a memory block and removing a memory device can be two
different operations. Users can just offline some memory blocks without
removing the memory device. For this purpose, the kernel has held
lock_memory_hotplug() in __offline_pages(). To reuse the code for
memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
repeatedly lock and unlock memory hotplug, but not hold the memory
hotplug lock in the whole operation.
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G).
You will find 4 new directories memory8, memory9, memory10, and memory11
under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup when we online pages. When we online memory8, the memory stored
page cgroup is not provided by this memory device. But when we online
memory9, the memory stored page cgroup may be provided by memory8. So
we can't offline memory8 now. We should offline the memory in the
reversed order.
When the memory device is hotremoved, we will auto offline memory
provided by this memory device. But we don't know which memory is
onlined first, so offlining memory may fail. In such case, iterate
twice to offline the memory. 1st iterate: offline every non primary
memory block. 2nd iterate: offline primary (i.e. first added) memory
block.
This idea is suggested by KOSAKI Motohiro.
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jianguo Wu <[email protected]>
Cc: Kamezawa Hiroyuki <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Wu Jianguo <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove one redundant check of res.
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.
This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The vm_populate() code populates user mappings without constantly
holding the mmap_sem. This makes it susceptible to racy userspace
programs: the user mappings may change while vm_populate() is running,
and in this case vm_populate() may end up populating the new mapping
instead of the old one.
In order to reduce the possibility of userspace getting surprised by
this behavior, this change introduces the VM_POPULATE vma flag which
gets set on vmas we want vm_populate() to work on. This way
vm_populate() may still end up populating the new mapping after such a
race, but only if the new mapping is also one that the user has
requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
the vma type - we know we're working with a stack. So, we can call
directly into __mlock_vma_pages_range(), and remove the last
make_pages_present() call site.
Note that we don't use mm_populate() here, so we can't release the
mmap_sem while allocating new stack pages. This is deemed acceptable,
because the stack vmas grow by a bounded number of pages at a time, and
these are anon pages so we don't have to read from disk to populate
them.
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After the MAP_POPULATE handling has been moved to mmap_region() call
sites, the only remaining use of the flags argument is to pass the
MAP_NORESERVE flag. This can be just as easily handled by
do_mmap_pgoff(), so do that and remove the mmap_region() flags
parameter.
[[email protected]: remove double parens]
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Michel Lespinasse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.
This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().
Signed-off-by: Michel Lespinasse <[email protected]>
Reported-by: Andy Lutomirski <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We have many vma manipulation functions that are fast in the typical
case, but can optionally be instructed to populate an unbounded number
of ptes within the region they work on:
- mmap with MAP_POPULATE or MAP_LOCKED flags;
- remap_file_pages() with MAP_NONBLOCK not set or when working on a
VM_LOCKED vma;
- mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
effect;
- brk() when mlock(MCL_FUTURE) is in effect.
Current code handles these pte operations locally, while the
sourrounding code has to hold the mmap_sem write side since it's
manipulating vmas. This means we're doing an unbounded amount of pte
population work with mmap_sem held, and this causes problems as Andy
Lutomirski reported (we've hit this at Google as well, though it's not
entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
first place).
I propose introducing a new mm_populate() function to do this pte
population work after the mmap_sem has been released. mm_populate()
does need to acquire the mmap_sem read side, but critically, it doesn't
need to hold it continuously for the entire duration of the operation -
it can drop it whenever things take too long (such as when hitting disk
for a file read) and re-acquire it later on.
The following patches are included
- Patches 1 fixes some issues I noticed while working on the existing code.
If needed, they could potentially go in before the rest of the patches.
- Patch 2 introduces the new mm_populate() function and changes
mmap_region() call sites to use it after they drop mmap_sem. This is
inspired from Andy Lutomirski's proposal and is built as an extension
of the work I had previously done for mlock() and mlockall() around
v2.6.38-rc1. I had tried doing something similar at the time but had
given up as there were so many do_mmap() call sites; the recent cleanups
by Linus and Viro are a tremendous help here.
- Patches 3-5 convert some of the less-obvious places doing unbounded
pte populates to the new mm_populate() mechanism.
- Patches 6-7 are code cleanups that are made possible by the
mm_populate() work. In particular, they remove more code than the
entire patch series added, which should be a good thing :)
- Patch 8 is optional to this entire series. It only helps to deal more
nicely with racy userspace programs that might modify their mappings
while we're trying to populate them. It adds a new VM_POPULATE flag
on the mappings we do want to populate, so that if userspace replaces
them with mappings it doesn't want populated, mm_populate() won't
populate those replacement mappings.
This patch:
Assorted small fixes. The first two are quite small:
- Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
Purely cosmetic.
- In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
range, make sure we own the mmap_sem write lock around the
munlock_vma_pages_range call as this manipulates the vma's vm_flags.
Last fix requires a longer explanation. remap_file_pages() can do its work
either through VM_NONLINEAR manipulation or by creating extra vmas.
These two cases were inconsistent with each other (and ultimately, both wrong)
as to exactly when did they fault in the newly mapped file pages:
- In the VM_NONLINEAR case, new file pages would be populated if
the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
new file pages wouldn't be populated even if the vma is already
marked as VM_LOCKED.
- In the linear (emulated) case, the work is passed to the mmap_region()
function which would populate the pages if the vma is marked as
VM_LOCKED, and would not otherwise - regardless of the value of the
MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
mmap_region().
The desired behavior is that we want the pages to be populated and locked
if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
flag is not passed to remap_file_pages().
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Tested-by: Andy Lutomirski <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
check the status, then break the loop if pgdat is balanced, just to be
immediately called again. The second call is completely unnecessary, of
course.
The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots
of code.
Signed-off-by: Zlatko Calusic <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These functions always return 0. Formalise this.
Cc: Jason Liu <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make madvise(MADV_WILLNEED) support swap file prefetch. If memory is
swapout, this syscall can do swapin prefetch. It has no impact if the
memory isn't swapout.
[[email protected]: fix CONFIG_SWAP=n build]
[[email protected]: fix BUG on madvise early failure]
Signed-off-by: Shaohua Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Targeted (hard resp soft) reclaim has traditionally tried to scan one
group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
pages) is reclaimed or all priorities are exhausted. The reclaim is
then retried until the limit is met.
This approach, however, doesn't work well with deeper hierarchies where
groups higher in the hierarchy do not have any or only very few pages
(this usually happens if those groups do not have any tasks and they
have only re-parented pages after some of their children is removed).
Those groups are reclaimed with decreasing priority pointlessly as there
is nothing to reclaim from them.
An easiest fix is to break out of the memcg iteration loop in
shrink_zone only if the whole hierarchy has been visited or sufficient
pages have been reclaimed. This is also more natural because the
reclaimer expects that the hierarchy under the given root is reclaimed.
As a result we can simplify the soft limit reclaim which does its own
iteration.
[[email protected]: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
[[email protected]: use conventional comparison order]
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Ying Han <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Ying Han <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Switch ksm to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the ksm module.
Signed-off-by: Sasha Levin <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Switch hugemem to use the new hashtable implementation. This reduces
the amount of generic unrelated code in the hugemem.
This also removes the dymanic allocation of the hash table. The upside
is that we save a pointer dereference when accessing the hashtable, but
we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
doesn't support hugepages.
Signed-off-by: Sasha Levin <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Xiao Guangrong <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN. It happened to work when initially
implemented as the starting PFN was also aligned but with caching
restarts and isolating in smaller chunks this is no longer always true.
The impact is that the migrate scanner scans outside its current
pageblock. As pfn_valid() is still checked properly it does not cause
any failure and the impact of the bug is that in some cases it will scan
more than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
`int' is an inappropriate type for a number-of-pages counter.
While we're there, use the clamp() macro.
Acked-by: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.
These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.
Use page_add_new_anon_rmap() for these copies. This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Simon Jeons <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Satoru Moriya <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The restart logic for when reclaim operates back to back with compaction
is currently applied on the lruvec level. But this does not make sense,
because the container of interest for compaction is a zone as a whole,
not the zone pages that are part of a certain memory cgroup.
Negative impact is bounded. For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled. And the unfairness of hammering on
one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups. Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or small
group when there are other, more reclaimable groups in the zone.
Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.
[[email protected]: no need for min_t]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.
Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which is
not necessarily the most obvious way:
fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;
Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.
[[email protected]: avoid using unintialized_var()]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix comment style and elaborate on why anonymous memory is force-scanned
when file cache runs low.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A swappiness of 0 has a slightly different meaning for global reclaim
(may swap if file cache really low) and memory cgroup reclaim (never
swap, ever).
In addition, global reclaim at highest priority will scan all LRU lists
equal to their size and ignore other balancing heuristics. UNLESS
swappiness forbids swapping, then the lists are balanced based on recent
reclaim effectiveness. UNLESS file cache is running low, then anonymous
pages are force-scanned.
This (total mess of a) behaviour is implicit and not obvious from the
way the code is organized. At least make it apparent in the code flow
and document the conditions. It will be it easier to come up with sane
semantics later.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Satoru Moriya <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.
Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.
Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node. The number of expected empty LRU lists is thus
memcgs * (nodes - 1) * lru types
Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc. Avoid
that.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit e9868505987a ("mm, vmscan: only evict file pages when we have
plenty") makes a point of not going for anonymous memory while there is
still enough inactive cache around.
The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:
200M-memcg-defconfig-j2
vanilla patched
Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
User time 668.57 ( +0.00%) 668.73 ( +0.02%)
System time 128.92 ( +0.00%) 129.53 ( +0.46%)
Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.
[[email protected]: remove a test-n-branch in the wrapup code]
Signed-off-by: Srinivas Pandruvada <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Bartlomiej Zolnierkiewicz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Acked-by: Sha Zhengju <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently when a memcg oom is happening the oom dump messages is still
global state and provides few useful info for users. This patch prints
more pointed memcg page statistics for memcg-oom and take hierarchy into
consideration:
Based on Michal's advice, we take hierarchy into consideration: supppose
we trigger an OOM on A's limit
root_memcg
|
A (use_hierachy=1)
/ \
B C
|
D
then the printed info will be:
Memory cgroup stats for /A:...
Memory cgroup stats for /A/B:...
Memory cgroup stats for /A/C:...
Memory cgroup stats for /A/B/D:...
Following are samples of oom output:
(1) Before change:
mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[<ffffffff8167fbfb>] dump_header+0x83/0x1ca
..... (call trace)
[<ffffffff8168a818>] page_fault+0x28/0x30
<<<<<<<<<<<<<<<<<<<<< memcg specific information
Task in /A/B/D killed as a result of limit of /A
memory: usage 101376kB, limit 101376kB, failcnt 57
memory+swap: usage 101376kB, limit 101376kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
<<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
......
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 173
......
CPU 3: hi: 186, btch: 31 usd: 130
<<<<<<<<<<<<<<<<<<<<< print global page state
active_anon:92963 inactive_anon:40777 isolated_anon:0
active_file:33027 inactive_file:51718 isolated_file:0
unevictable:0 dirty:3 writeback:0 unstable:0
free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
mapped:20278 shmem:35971 pagetables:5885 bounce:0
free_cma:0
<<<<<<<<<<<<<<<<<<<<< print per zone page state
Node 0 DMA free:15836kB ... all_unreclaimable? no
lowmem_reserve[]: 0 3175 3899 3899
Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
lowmem_reserve[]: 0 0 724 724
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
120710 total pagecache pages
0 pages in swap cache
<<<<<<<<<<<<<<<<<<<<< print global swap cache stat
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 499708kB
Total swap = 499708kB
1040368 pages RAM
58678 pages reserved
169065 pages shared
173632 pages non-shared
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2693] 0 2693 6005 1324 17 0 0 god
[ 2754] 0 2754 6003 1320 16 0 0 god
[ 2811] 0 2811 5992 1304 18 0 0 god
[ 2874] 0 2874 6005 1323 18 0 0 god
[ 2935] 0 2935 8720 7742 21 0 0 mal-30
[ 2976] 0 2976 21520 17577 42 0 0 mal-80
Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
We can see that messages dumped by show_free_areas() are longsome and can
provide so limited info for memcg that just happen oom.
(2) After change
mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[<ffffffff8167fd0b>] dump_header+0x83/0x1d1
.......(call trace)
[<ffffffff8168a918>] page_fault+0x28/0x30
Task in /A/B/D killed as a result of limit of /A
<<<<<<<<<<<<<<<<<<<<< memcg specific information
memory: usage 102400kB, limit 102400kB, failcnt 140
memory+swap: usage 102400kB, limit 102400kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2260] 0 2260 6006 1325 18 0 0 god
[ 2383] 0 2383 6003 1319 17 0 0 god
[ 2503] 0 2503 6004 1321 18 0 0 god
[ 2622] 0 2622 6004 1321 16 0 0 god
[ 2695] 0 2695 8720 7741 22 0 0 mal-30
[ 2704] 0 2704 21520 17839 43 0 0 mal-80
Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
This version provides more pointed info for memcg in "Memory cgroup stats
for XXX" section.
Signed-off-by: Sha Zhengju <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the warning:
drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
Cc: Alasdair Kergon <[email protected]>
Cc: Neil Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
"So from the depth of frozen Minnesota, here's the powerpc pull request
for 3.9. It has a few interesting highlights, in addition to the
usual bunch of bug fixes, minor updates, embedded device tree updates
and new boards:
- Hand tuned asm implementation of SHA1 (by Paulus & Michael
Ellerman)
- Support for Doorbell interrupts on Power8 (kind of fast
thread-thread IPIs) by Ian Munsie
- Long overdue cleanup of the way we handle relocation of our open
firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard
- Support for saving/restoring & context switching the PPR (Processor
Priority Register) on server processors that support it. This
allows the kernel to preserve thread priorities established by
userspace. By Haren Myneni.
- DAWR (new watchpoint facility) support on Power8 by Michael Neuling
- Ability to change the DSCR (Data Stream Control Register) which
controls cache prefetching on a running process via ptrace by
Alexey Kardashevskiy
- Support for context switching the TAR register on Power8 (new
branch target register meant to be used by some new specific
userspace perf event interrupt facility which is yet to be enabled)
by Ian Munsie.
- Improve preservation of the CFAR register (which captures the
origin of a branch) on various exception conditions by Paulus.
- Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
it belongs by Philippe De Muyter
- Support for Transactional Memory on Power8 by Michael Neuling
(based on original work by Matt Evans). For those curious about
the feature, the patch contains a pretty good description."
(See commit db8ff907027b: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
powerpc/kexec: Disable hard IRQ before kexec
powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
powerpc/85xx: bsc9131 - Correct typo in SDHC device node
powerpc/e500/qemu-e500: enable coreint
powerpc/mpic: allow coreint to be determined by MPIC version
powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
powerpc/85xx: Board support for ppa8548
powerpc/fsl: remove extraneous DIU platform functions
arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
powerpc: Documentation for transactional memory on powerpc
powerpc: Add transactional memory to pseries and ppc64 defconfigs
powerpc: Add config option for transactional memory
powerpc: Add transactional memory to POWER8 cpu features
powerpc: Add new transactional memory state to the signal context
powerpc: Hook in new transactional memory code
powerpc: Routines for FP/VSX/VMX unavailable during a transaction
powerpc: Add transactional memory unavaliable execption handler
powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
powerpc: Add FP/VSX and VMX register load functions for transactional memory
powerpc: Add helper functions for transactional memory context switching
...
|
|
Disable hard IRQ before kexec a new kernel image.
Not doing it can result in corrupted data in the memory segments
reserved for the new kernel.
Signed-off-by: Phileas Fogg <[email protected]>
CC: <[email protected]>
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin
Pull small blackfin update from Bob Liu.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin:
blackfin: time-ts: Remove duplicate assignment
blackfin: pm: fix build error
blackfin: sync data in blackfin write buffer
blackfin: use bitmap library functions
blackfin: mem_init: update dmc config register
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller.
The bulk of this is optimized page coping/clearing and cache flushing
(virtual caches are lovely) by John David Anglin.
* 'parisc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (31 commits)
arch/parisc/include/asm: use ARRAY_SIZE macro in mmzone.h
parisc: remove empty lines and unnecessary #ifdef coding in include/asm/signal.h
parisc: sendfile and sendfile64 syscall cleanups
parisc: switch to available compat_sched_rr_get_interval implementation
parisc: fix fallocate syscall
parisc: fix error return codes for rt_sigaction and rt_sigprocmask
parisc: convert msgrcv and msgsnd syscalls to use compat layer
parisc: correctly wire up mq_* functions for CONFIG_COMPAT case
parisc: fix personality on 32bit kernel
parisc: wire up process_vm_readv, process_vm_writev, kcmp and finit_module syscalls
parisc: led driver requires CONFIG_VM_EVENT_COUNTERS
parisc: remove unused compat_rt_sigframe.h header
parisc/mm/fault.c: Port OOM changes to do_page_fault
parisc: space register variables need to be in native length (unsigned long)
parisc: fix ptrace breakage
parisc: always detect multiple physical ranges
parisc: ensure that mmapped shared pages are aligned at SHMLBA addresses
parisc: disable preemption while flushing D- or I-caches through TMPALIAS region
parisc: remove IRQF_DISABLED
parisc: fixes and cleanups in page cache flushing (4/4)
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull ia64 build breakage fix from Tony Luck.
* tag 'please-pull-fix-ia64-build' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
sched: move RR_TIMESLICE from sysctl.h to rt.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading update from Peter Anvin:
"This patchset lets us update the CPU microcode very, very early in
initialization if the BIOS fails to do so (never happens, right?)
This is handy for dealing with things like the Atom erratum where we
have to run without PSE because microcode loading happens too late.
As I mentioned in the x86/mm push request it depends on that
infrastructure but it is otherwise a standalone feature."
* 'x86/microcode' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Kconfig: Make early microcode loading a configuration feature
x86/mm/init.c: Copy ucode from initrd image to kernel memory
x86/head64.c: Early update ucode in 64-bit
x86/head_32.S: Early update ucode in 32-bit
x86/microcode_intel_early.c: Early update ucode on Intel's CPU
x86/tlbflush.h: Define __native_flush_tlb_global_irq_disabled()
x86/microcode_intel_lib.c: Early update ucode on Intel's CPU
x86/microcode_core_early.c: Define interfaces for early loading ucode
x86/common.c: load ucode in 64 bit or show loading ucode info in 32 bit on AP
x86/common.c: Make have_cpuid_p() a global function
x86/microcode_intel.h: Define functions and macros for early loading ucode
x86, doc: Documentation for early microcode loading
|
|
With commit 8170e6bed465 ("x86, 64bit: Use a #PF handler to materialize
early mappings on demand") we started hitting an early bootup crash
where the Xen hypervisor would inform us that:
(XEN) d7:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffea000005b2d0:
(XEN) L4[0x1d4] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 7 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.2.0 x86_64 debug=n Not tainted ]----
.. that Xen was unable to context switch back to dom0.
Looking at the calling stack we find:
[<ffffffff8103feba>] xen_get_user_pgd+0x5a <--
[<ffffffff8103feba>] xen_get_user_pgd+0x5a
[<ffffffff81042d27>] xen_write_cr3+0x77
[<ffffffff81ad2d21>] init_mem_mapping+0x1f9
[<ffffffff81ac293f>] setup_arch+0x742
[<ffffffff81666d71>] printk+0x48
We are trying to figure out whether we need to up-date the user PGD as
well. Please keep in mind that under 64-bit PV guests we have a limited
amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
and user-space. As such the Linux pvops'fied version of write_cr3
checks if it has to update the user-space cr3 as well.
That clearly is not needed during early bootup. The recent changes (see
above git commit) streamline the x86 page table allocation to be much
simpler (And also incidentally the #PF handler ends up in spirit being
similar to how the Xen toolstack sets up the initial page-tables).
The fix is to have an early-bootup version of cr3 that just loads the
kernel %cr3. The later version - which also handles user-page
modifications will be used after the initial page tables have been
setup.
[ hpa: removed a redundant #ifdef and made the new function __init.
Also note that x86-32 already has such an early xen_write_cr3. ]
Tested-by: "H. Peter Anvin" <[email protected]>
Reported-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The code requires the use of the proper per-exception-vector stub
functions (set up as the early_idt_handlers[] array - note the 's') that
make sure to set up the error vector number. This is true regardless of
whether CONFIG_EARLY_PRINTK is set or not.
Why? The stack offset for the comparison of __KERNEL_CS won't be right
otherwise, nor will the new check (from commit 8170e6bed465: "x86,
64bit: Use a #PF handler to materialize early mappings on demand") for
the page fault exception vector.
Acked-by: H. Peter Anvin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|