Age | Commit message (Collapse) | Author | Files | Lines |
|
Here're the last pieces of page fault accounting that were still done
outside handle_mm_fault() where we still have regs==NULL when calling
handle_mm_fault():
arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault
arch/sparc/mm/fault_32.c: force_user_fault
arch/um/kernel/trap.c: handle_page_fault
mm/gup.c: faultin_page
fixup_user_fault
mm/hmm.c: hmm_vma_fault
mm/ksm.c: break_ksm
Some of them has the issue of duplicated accounting for page fault
retries. Some of them didn't do the accounting at all.
This patch cleans all these up by letting handle_mm_fault() to do per-task
page fault accounting even if regs==NULL (though we'll still skip the perf
event accountings). With that, we can safely remove all the outliers now.
There's another functional change in that now we account the page faults
to the caller of gup, rather than the task_struct that passed into the gup
code. More information of this can be found at [1].
After this patch, below things should never be touched again outside
handle_mm_fault():
- task_struct.[maj|min]_flt
- PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]
[1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: Page fault accounting cleanups", v5.
This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):
https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
What this series did:
- Correct page fault accounting: we do accounting for a page fault
(no matter whether it's from #PF handling, or gup, or anything else)
only with the one that completed the fault. For example, page fault
retries should not be counted in page fault counters. Same to the
perf events.
- Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
event is used in an adhoc way across different archs.
Case (1): for many archs it's done at the entry of a page fault
handler, so that it will also cover e.g. errornous faults.
Case (2): for some other archs, it is only accounted when the page
fault is resolved successfully.
Case (3): there're still quite some archs that have not enabled
this perf event.
Since this series will touch merely all the archs, we unify this
perf event to always follow case (1), which is the one that makes most
sense. And since we moved the accounting into handle_mm_fault, the
other two MAJ/MIN perf events are well taken care of naturally.
- Unify definition of "major faults": the definition of "major
fault" is slightly changed when used in accounting (not
VM_FAULT_MAJOR). More information in patch 1.
- Always account the page fault onto the one that triggered the page
fault. This does not matter much for #PF handlings, but mostly for
gup. More information on this in patch 25.
Patchset layout:
Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25: Cleanup GUP task_struct pointer since it's not needed any more
This patch (of 25):
This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault(). This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events. To
do this, the pt_regs pointer is passed into handle_mm_fault().
PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.
So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.
The change is generated using coccinelle with the following rule:
// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)
Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ying Han <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The powerpc 32-bit implementation of pgtable has nice shortcuts for
accessing kernel PMD and PTE for a given virtual address. Make these
helpers available for all architectures.
[[email protected]: microblaze: fix page table traversal in setup_rt_frame()]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: s/pmd_ptr_k/pmd_off_k/ in various powerpc places]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Although there're tons of arch-specific page fault handlers, most of them
are still sharing the same initial value of the page fault flags. Say,
merely all of the page fault handlers would allow the fault to be retried,
and they also allow the fault to respond to SIGKILL.
Let's define a default value for the fault flags to replace those initial
page fault flags that were copied over. With this, it'll be far easier to
introduce new fault flag that can be used by all the architectures instead
of touching all the archs.
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Brian Geffon <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Bobby Powers <[email protected]>
Cc: Denis Plotnikov <[email protected]>
Cc: "Dr . David Alan Gilbert" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Martin Cracauer <[email protected]>
Cc: Marty McFadden <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Maya Gokhale <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The UML port uses 4 and 5 level fixups to support higher level page
table directories in the generic VM code.
Implement primitives necessary for the 4th level folding, add walks of
p4d level where appropriate and drop usage of __ARCH_USE_5LEVEL_HACK.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Anatoly Pugachev <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Peter Rosin <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rolf Eike Beer <[email protected]>
Cc: Russell King <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sam Creasey <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Convert files to use SPDX header. All files are licensed under the
GPLv2.
Signed-off-by: Alex Dewar <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.
The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.
The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:
force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.
This also makes it clear force_sig passes current into force_sig_info.
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.
This also makes it clear force_sigsegv always calls force_sig
on the current task.
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
Trivial fix to a spelling mistake of the error access name EACCESS,
rename to EACCES
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.
Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Levin, Alexander (Sasha Levin)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a bug in relay signal. It assumes that when a signal is
relayed the signal never uses a signal independent si_code, such
as SI_USER, SI_KERNEL, SI_QUEUE, ... SI_SIGIO etc. In practice
siginfo was assuming it was relaying a signal with the SIL_FAULT
layout. As that is the common cases for the signals it supported
that is a reasonable assumption.
Further user mode linux must be very careful when relaying different
kinds of signals to prevent an information leak. This means simply
increasing the kinds of signals that are handled in relay_signal
is non-trivial.
Therefore use siginfo_layout and force_sig_fault to simplify
the signal relaying in relay_signal.
By taking advantage of the fact that user mode linux only works
on x86 and x86_64 we can assume that si_trapno can be ignored,
and that si_errno is always zero.
For the signals SIGLL, SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP the only
fault handler I know of that sets si_errno is SIGTRAP TRAP_HWBKPT on a
few oddball architectures. Those architectures have been modified to
use force_sig_ptrace_errno_trap.
Similarly only a few architectures set __ARCH_SI_TRAPNO.
At the point uml supports those architectures again these additional
cases can be examined and supported if desired in relay_signal.
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Martin Pärtel <[email protected]>
Cc: [email protected]
Cc: [email protected]
Fixes: d3c1cfcdb43e ("um: pass siginfo to guest process")
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
Filling in struct siginfo before calling force_sig_info a tedious and
error prone process, where once in a great while the wrong fields
are filled out, and siginfo has been inconsistently cleared.
Simplify this process by using the helper force_sig_fault. Which
takes as a parameters all of the information it needs, ensures
all of the fiddly bits of filling in struct siginfo are done properly
and then calls force_sig_info.
In short about a 5 line reduction in code for every time force_sig_info
is called, which makes the calling function clearer.
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.
Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.
The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.
In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.
Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
The function clear_siginfo is just a nice wrapper around memset so
this results in no functional change. This change makes mistakes
a little more difficult and it makes it clearer what is going on.
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
The hashing of %p was designed to restrict kernel addresses. There is
no reason to hash the userspace values seen during a segfault report,
so switch these to %px. (Some architectures already use %lx.)
Fixes: ad67b74d2469d9b8 ("printk: hash addresses printed with %p")
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Thomas Meyer <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
<linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
<linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We always have vma->vm_mm around.
Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We have to exclude memory locations <= PAGE_SIZE from
the condition and let the kernel mode fault path catch it.
Otherwise a kernel NULL pointer exception will be reported
as a kernel user space access.
Fixes: d2313084e2c (um: Catch unprotected user memory access)
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Once x86 exports its do_signal(), the prototypes will clash.
Fix the clash and also improve the code a bit: remove the
unnecessary kern_do_signal() indirection. This allows
interrupt_end() to share the 'regs' parameter calculation.
Also remove the unused return code to match x86.
Minimally build and boot tested.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/67c57eac09a589bac3c6c5ff22f9623ec55a184a.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:
- remove hppfs ("HonePot ProcFS")
- initial support for musl libc
- uaccess cleanup
- random cleanups and bug fixes all over the place
* 'for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (21 commits)
um: Don't pollute kernel namespace with uapi
um: Include sys/types.h for makedev(), major(), minor()
um: Do not use stdin and stdout identifiers for struct members
um: Do not use __ptr_t type for stack_t's .ss pointer
um: Fix mconsole dependency
um: Handle tracehook_report_syscall_entry() result
um: Remove copy&paste code from init.h
um: Stop abusing __KERNEL__
um: Catch unprotected user memory access
um: Fix warning in setup_signal_stack_si()
um: Rework uaccess code
um: Add uaccess.h to ldt.c
um: Add uaccess.h to syscalls_64.c
um: Add asm/elf.h to vma.c
um: Cleanup mem_32/64.c headers
um: Remove hppfs
um: Move syscall() declaration into os.h
um: kernel: ksyms: Export symbol syscall() for fixing modpost issue
um/os-Linux: Use char[] for syscall_stub declarations
um: Use char[] for linker script address declarations
...
|
|
If the kernel tries to access user memory without copy_from_user()
a trap will happen as kernel and userspace run in different processes
on the host side. Currently this special page fault cannot be resolved
and will happen over and over again. As result UML will lockup.
This patch allows the page fault code to detect that situation and
causes a panic() such that the root cause of the unprotected memory
access can be found and fixed.
Signed-off-by: Richard Weinberger <[email protected]>
|
|
A missing include file caused build fail.
Reported-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Fixes: 70ffdb9393a7 ("mm/fault, arch: Use pagefault_disable() to check for disabled pagefaults in the handler")
Signed-off-by: Ingo Molnar <[email protected]>
|
|
the handler
Introduce faulthandler_disabled() and use it to check for irq context and
disabled pagefaults (via pagefault_disable()) in the pagefault handlers.
Please note that we keep the in_atomic() checks in place - to detect
whether in irq context (in which case preemption is always properly
disabled).
In contrast, preempt_disable() should never be used to disable pagefaults.
With !CONFIG_PREEMPT_COUNT, preempt_disable() doesn't modify the preempt
counter, and therefore the result of in_atomic() differs.
We validate that condition by using might_fault() checks when calling
might_sleep().
Therefore, add a comment to faulthandler_disabled(), describing why this
is needed.
faulthandler_disabled() and pagefault_disable() are defined in
linux/uaccess.h, so let's properly add that include to all relevant files.
This patch is based on a patch from Thomas Gleixner.
Reviewed-and-tested-by: Thomas Gleixner <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Before we had SKAS0 UML had two modes of operation
TT (tracing thread) and SKAS3/4 (separated kernel address space).
TT was known to be insecure and got removed a long time ago.
SKAS3/4 required a few (3 or 4) patches on the host side which never went
mainline. The last host patch is 10 years old.
With SKAS0 mode (separated kernel address space using 0 host patches),
default since 2005, SKAS3/4 is obsolete and can be removed.
Signed-off-by: Richard Weinberger <[email protected]>
|
|
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.
That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.
However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.
To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.
This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.
Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.
Reported-and-tested-by: Takashi Iwai <[email protected]>
Tested-by: Jan Engelhardt <[email protected]>
Acked-by: Heiko Carstens <[email protected]> # "s390 still compiles and boots"
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
...otherwise me lose user mode regs and the resulting
stack trace is useless.
Signed-off-by: Richard Weinberger <[email protected]>
|
|
As UML uses an alternative signal stack we cannot use
the current stack pointer for stack dumping if UML itself
dies by SIGSEGV. To bypass this issue we save regs taken
from mcontext in our segv handler into thread_struct and
use these regs to obtain the stack pointer in show_stack().
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Unlike global OOM handling, memory cgroup code will invoke the OOM killer
in any OOM situation because it has no way of telling faults occuring in
kernel context - which could be handled more gracefully - from
user-triggered faults.
Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: azurIt <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Kernel faults are expected to handle OOM conditions gracefully (gup,
uaccess etc.), so they should never invoke the OOM killer. Reserve this
for faults triggered in user context when it is the only option.
Most architectures already do this, fix up the remaining few.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: azurIt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML changes from Richard Weinberger:
"UML receives this time only cleanups.
The most outstanding change is the 'include "foo.h"' do 'include
<foo.h>' conversion done by Al Viro.
It touches many files, that's why the diffstat is rather big."
* 'for-linus-37rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
typo in UserModeLinux-HOWTO
hppfs: fix the return value of get_inode()
hostfs: drop vmtruncate
um: get rid of pointless include "..." where include <...> will do
um: move sysrq.h out of include/shared
um/x86: merge 32 and 64 bit variants of ptrace.h
um/x86: merge 32 and 64bit variants of checksum.h
|
|
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
.fault now can retry. The retry can break state machine of .fault. In
filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
try, since the page is in page cache now, ra->mmap_miss is decreased. And
these are done in one fault, so we can't detect random mmap file access.
Add a new flag to indicate .fault is tried once. In the second try, skip
ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.
I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)
Signed-off-by: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
UML guest processes now get correct siginfo_t for SIGTRAP, SIGFPE,
SIGILL and SIGBUS. Specifically, si_addr and si_code are now correct
where previously they were si_addr = NULL and si_code = 128.
Signed-off-by: Martin Pärtel <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Commit d065bd810b6d ("mm: retry page fault when blocking on disk
transfer") and commit 37b23e0525d3 ("x86,mm: make pagefault killable")
introduced changes into the x86 pagefault handler for making the page
fault handler retryable as well as killable.
These changes reduce the mmap_sem hold time, which is crucial during OOM
killer invocation.
Port these changes to um.
Signed-off-by: Kautuk Consul <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ksyms.c is down to the stuff defined in various USER_OBJS
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Print a short info about fatal segfaults like other archs do.
Signed-off-by: Richard Weinberger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault(). All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.
With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time. Create a hook for pagefault oom path to call
into instead.
Only converted x86 and uml so far.
[[email protected]: make __out_of_memory() static]
[[email protected]: fix comment]
Signed-off-by: Nick Piggin <[email protected]>
Cc: Jeff Dike <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
UML was panicing in the case of failures of libc calls which shouldn't happen.
This is an overreaction since a failure from libc doesn't normally mean that
kernel data structures are in an unknown state. Instead, the current process
should just be killed if there is no way to recover.
The case that prompted this was a failure of PTRACE_SETREGS restoring the same
state that was read by PTRACE_GETREGS. It appears that when a process tries
to load a bogus value into a segment register, it segfaults (as expected) and
the value is actually loaded and is seen by PTRACE_GETREGS (not expected).
This case is fixed by forcing a fatal SIGSEGV on the process so that it
immediately dies. fatal_sigsegv was added for this purpose. It was declared
as noreturn, so in order to pursuade gcc that it actually does not return, I
added a call to os_dump_core (and declared it noreturn) so that I get a core
file if somehow the process survives.
All other calls in arch/um/os-Linux/skas/process.c got the same treatment,
with failures causing the process to die instead of a kernel panic, with some
exceptions.
userspace_tramp exits with status 1 if anything goes wrong there. That will
cause start_userspace to return an error. copy_context_skas0 and
map_stub_pages also now return errors instead of panicing. Callers of thes
functions were changed to check for errors and do something appropriate.
Usually that's to return an error to their callers.
check_skas3_ptrace_faultinfo just exits since that's too early to do anything
else.
save_registers, restore_registers, and init_registers now return status
instead of panicing on failure, with their callers doing something
appropriate.
There were also duplicate declarations of save_registers and restore_registers
in os.h - these are gone.
I noticed and fixed up some whitespace damage.
Signed-off-by: Jeff Dike <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Large pieces of include/asm/pgtable.h were unused cruft.
This uncovered arch/um/kernel/trap.c needing skas.h in order to get
ptrace_faultinfo.
Signed-off-by: Jeff Dike <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Tidy kern_util.h. It turns out that most of the function declarations
aren't used, so they can go away. os.h no longer includes
kern_util.h, so files which got it through os.h now need to include it
directly. A number of other files never needed it, so these includes
are deleted.
The structure which was used to pass signal handlers from the kernel
side to the userspace side is gone. Instead, the handlers are
declared here, and used directly from libc code. This allows
arch/um/os-Linux/trap.c to be deleted, with its remnants being moved
to arch/um/os-Linux/skas/trap.c.
arch/um/os-Linux/tty.c had its inclusions changed, and it needed some
style attention, so it got tidied.
Signed-off-by: Jeff Dike <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
bugs.c, for both i386 and x86_64, can undergo further cleaning -
The i386 arch_check_bugs only does one thing, so we might as
well inline the cmov checking.
The i386 includes can be trimmed down a bit.
arch_init_thread wasn't used, so it is deleted.
The panics in arch_handle_signal are turned into printks
because the process is about to get segfaulted anyway, so something is
dying no matter what happens here. Also, the return value was always
the same, so it contained no information, so it can be void instead.
The name is changed to arch_examine_signal because it doesn't handle
anything.
The caller of arch_handle_signal, relay_signal, does things in
a different order. The kernel-mode signal check is now first, which
puts everything else together, making things a bit clearer conceptually.
[[email protected]: coding-style fixes]
Signed-off-by: Jeff Dike <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().
A cgroup init has it's tsk->pid == 1.
A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.
Changelog:
2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().
2.6.21-mm2-pidns2:
- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.
[[email protected]: fix comment]
[[email protected]: Use is_global_init() in arch/m32r/mm/fault.c]
[[email protected]: kernel/pid.c: remove unused exports]
[[email protected]: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn <[email protected]>
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Pavel Emelianov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Herbert Poetzel <[email protected]>
Cc: Kirill Korotaev <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|