Age | Commit message (Collapse) | Author | Files | Lines |
|
Stefan Agner reported a bug when using zsram on 32-bit Arm machines
with RAM above the 4GB address boundary:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = a27bd01c
[00000000] *pgd=236a0003, *pmd=1ffa64003
Internal error: Oops: 207 [#1] SMP ARM
Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
Hardware name: BCM2711
PC is at zs_map_object+0x94/0x338
LR is at zram_bvec_rw.constprop.0+0x330/0xa64
pc : [<c0602b38>] lr : [<c0bda6a0>] psr: 60000013
sp : e376bbe0 ip : 00000000 fp : c1e2921c
r10: 00000002 r9 : c1dda730 r8 : 00000000
r7 : e8ff7a00 r6 : 00000000 r5 : 02f9ffa0 r4 : e3710000
r3 : 000fdffe r2 : c1e0ce80 r1 : ebf979a0 r0 : 00000000
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 30c5383d Table: 235c2a80 DAC: fffffffd
Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
Stack: (0xe376bbe0 to 0xe376c000)
As it turns out, zsram needs to know the maximum memory size, which
is defined in MAX_PHYSMEM_BITS when CONFIG_SPARSEMEM is set, or in
MAX_POSSIBLE_PHYSMEM_BITS on the x86 architecture.
The same problem will be hit on all 32-bit architectures that have a
physical address space larger than 4GB and happen to not enable sparsemem
and include asm/sparsemem.h from asm/pgtable.h.
After the initial discussion, I suggested just always defining
MAX_POSSIBLE_PHYSMEM_BITS whenever CONFIG_PHYS_ADDR_T_64BIT is
set, or provoking a build error otherwise. This addresses all
configurations that can currently have this runtime bug, but
leaves all other configurations unchanged.
I looked up the possible number of bits in source code and
datasheets, here is what I found:
- on ARC, CONFIG_ARC_HAS_PAE40 controls whether 32 or 40 bits are used
- on ARM, CONFIG_LPAE enables 40 bit addressing, without it we never
support more than 32 bits, even though supersections in theory allow
up to 40 bits as well.
- on MIPS, some MIPS32r1 or later chips support 36 bits, and MIPS32r5
XPA supports up to 60 bits in theory, but 40 bits are more than
anyone will ever ship
- On PowerPC, there are three different implementations of 36 bit
addressing, but 32-bit is used without CONFIG_PTE_64BIT
- On RISC-V, the normal page table format can support 34 bit
addressing. There is no highmem support on RISC-V, so anything
above 2GB is unused, but it might be useful to eventually support
CONFIG_ZRAM for high pages.
Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
Fixes: 02390b87a945 ("mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS")
Acked-by: Thomas Bogendoerfer <[email protected]>
Reviewed-by: Stefan Agner <[email protected]>
Tested-by: Stefan Agner <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Link: https://lore.kernel.org/linux-mm/bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch/
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
All architectures define pte_index() as
(address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)
and all architectures define pte_offset_kernel() as an entry in the array
of PTEs indexed by the pte_index().
For the most architectures the pte_offset_kernel() implementation relies
on the availability of pmd_page_vaddr() that converts a PMD entry value to
the virtual address of the page containing PTEs array.
Let's move x86 definitions of the PTE accessors to the generic place in
<linux/pgtable.h> and then simply drop the respective definitions from the
other architectures.
The architectures that didn't provide pmd_page_vaddr() are updated to have
that defined.
The generic implementation of pte_offset_kernel() can be overridden by an
architecture and alpha makes use of this because it has special ordering
requirements for its version of pte_offset_kernel().
[[email protected]: v2]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: update]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: update]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix x86 warning]
[[email protected]: fix powerpc build]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pmd_present() is expected to test positive after pmdp_mknotpresent() as
the PMD entry still points to a valid huge page in memory.
pmdp_mknotpresent() implies that given PMD entry is just invalidated from
MMU perspective while still holding on to pmd_page() referred valid huge
page thus also clearing pmd_present() test. This creates the following
situation which is counter intuitive.
[pmd_present(pmd_mknotpresent(pmd)) = true]
This renames pmd_mknotpresent() as pmd_mkinvalid() reflecting the helper's
functionality more accurately while changing the above mentioned situation
as follows. This does not create any functional change.
[pmd_present(pmd_mkinvalid(pmd)) = true]
This is not applicable for platforms that define own pmdp_invalidate() via
__HAVE_ARCH_PMDP_INVALIDATE. Suggestion for renaming came during a
previous discussion here.
https://patchwork.kernel.org/patch/11019637/
[[email protected]: change pmd_mknotvalid() to pmd_mkinvalid() per Will]
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Catalin Marinas <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove all traces of Stage-2 and HYP page table support.
Signed-off-by: Marc Zyngier <[email protected]>
Acked-by: Olof Johansson <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: Vladimir Murzin <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Linus Walleij <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
|
|
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_leaf() functions/macros.
For arm pmd_large() already exists and does what we want. So simply
provide the generic pmd_leaf() name.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Steven Price <[email protected]>
Cc: Russell King <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexandre Ghiti <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Hogan <[email protected]>
Cc: James Morse <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: "Liang, Kan" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Zong Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 59 temple place suite 330 boston ma 02111
1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 136 file(s).
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexios Zavras <[email protected]>
Reviewed-by: Allison Randal <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Currently the PTE special supports is turned on in per architecture
header files. Most of the time, it is defined in
arch/*/include/asm/pgtable.h depending or not on some other per
architecture static definition.
This patch introduce a new configuration variable to manage this
directly in the Kconfig files. It would later replace
__HAVE_ARCH_PTE_SPECIAL.
Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:
arm
__HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
- arch/powerpc/include/asm/book3s/64/pgtable.h
- arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.
sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64
There is no functional change introduced by this patch.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Laurent Dufour <[email protected]>
Suggested-by: Jerome Glisse <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Aneesh Kumar K . V" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Christophe LEROY <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ARM LPAE doesn't have hardware dirty/accessed bits.
generic_pmdp_establish() is the right implementation of pmdp_establish
for this case.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In response to compile breakage introduced by a series that added the
pud_write helper to x86, Stephen notes:
did you consider using the other paradigm:
In arch include files:
#define pud_write pud_write
static inline int pud_write(pud_t pud)
.....
Then in include/asm-generic/pgtable.h:
#ifndef pud_write
tatic inline int pud_write(pud_t pud)
{
....
}
#endif
If you had, then the powerpc code would have worked ... ;-) and many
of the other interfaces in include/asm-generic/pgtable.h are
protected that way ...
Given that some architecture already define pmd_write() as a macro, it's
a net reduction to drop the definition of __HAVE_ARCH_PMD_WRITE.
Link: http://lkml.kernel.org/r/151129126721.37405.13339850900081557813.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
Suggested-by: Stephen Rothwell <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Oliver OHalloran <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Russell King <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently pmd_mknotpresent will use a zero entry to respresent an
invalidated pmd.
Unfortunately this definition clashes with pmd_none, thus it is
possible for a race condition to occur if zap_pmd_range sees pmd_none
whilst __split_huge_pmd_locked is running too with pmdp_invalidate
just called.
This patch fixes the race condition by modifying pmd_mknotpresent to
create non-zero faulting entries (as is done in other architectures),
removing the ambiguity with pmd_none.
[[email protected]: using L_PMD_SECT_VALID instead of PMD_TYPE_SECT]
Fixes: 8d9625070073 ("ARM: mm: Transparent huge page support for LPAE systems.")
Cc: <[email protected]> # 3.11+
Reported-by: Kirill A. Shutemov <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Steve Capper <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
In a subsequent patch, pmd_mknotpresent will clear the valid bit of the
pmd entry, resulting in a not-present entry from the hardware's
perspective. Unfortunately, pmd_present simply checks for a non-zero pmd
value and will therefore continue to return true even after a
pmd_mknotpresent operation. Since pmd_mknotpresent is only used for
managing huge entries, this is only an issue for the 3-level case.
This patch fixes the 3-level pmd_present implementation to take into
account the valid bit. For bisectability, the change is made before the
fix to pmd_mknotpresent.
[[email protected]: comment update regarding pmd_mknotpresent patch]
Fixes: 8d9625070073 ("ARM: mm: Transparent huge page support for LPAE systems.")
Cc: <[email protected]> # 3.11+
Cc: Russell King <[email protected]>
Cc: Steve Capper <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).
Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: Hugh Dickins <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andres Lagar-Cavilla <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Ning Qu <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Acked-by: David S. Miller <[email protected]>
Acked-by: Vineet Gupta <[email protected]> [arch/arc]
Acked-by: Gerald Schaefer <[email protected]> [arch/s390]
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.
This patch adds pmd_mkclean for THP page MADV_FREE support.
Signed-off-by: Minchan Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Russell King <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Daniel Micay <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jason Evans <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mika Penttil <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.
pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.
[[email protected]: fix unterminated ifdef in header file]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
|
|
With PROT_NONE, the traditional page table manipulation functions are
sufficient.
[[email protected]: fix compiler warning in pmdp_invalidate()]
[[email protected]: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Aneesh Kumar <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We've replaced remap_file_pages(2) implementation with emulation. Nobody
creates non-linear mapping anymore.
This patch also adjust __SWP_TYPE_SHIFT, effectively increase size of
possible swap file to 128G.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add support for initial write protection of VM memslots. This patch
series assumes that huge PUDs will not be used in 2nd stage tables, which is
always valid on ARMv7
Acked-by: Christoffer Dall <[email protected]>
Signed-off-by: Mario Smarduch <[email protected]>
|
|
Activate the RCU fast_gup for ARM. We also need to force THP splits to
broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.
Some pre-requisite functions pud_write and pud_page are also added.
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Dann Frazier <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Russell King <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We need a mechanism to tag ptes as being special, this indicates that no
attempt should be made to access the underlying struct page * associated
with the pte. This is used by the fast_gup when operating on ptes as it
has no means to access VMAs (that also contain this information)
locklessly.
The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
pte_special and pte_mkspecial to make use of it, and defines
__HAVE_ARCH_PTE_SPECIAL.
This patch also excludes special ptes from the icache/dcache sync logic.
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Dann Frazier <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Russell King <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For LPAE, we have the following means for encoding writable or dirty
ptes:
L_PTE_DIRTY L_PTE_RDONLY
!pte_dirty && !pte_write 0 1
!pte_dirty && pte_write 0 1
pte_dirty && !pte_write 1 1
pte_dirty && pte_write 1 0
So we can't distinguish between writeable clean ptes and read only
ptes. This can cause problems with ptes being incorrectly flagged as
read only when they are writeable but not dirty.
This patch renumbers L_PTE_RDONLY from AP[2] to a software bit #58,
and adds additional logic to set AP[2] whenever the pte is read only
or not dirty. That way we can distinguish between clean writeable ptes
and read only ptes.
HugeTLB pages will use this new logic automatically.
We need to add some logic to Transparent HugePages to ensure that they
correctly interpret the revised pgprot permissions (L_PTE_RDONLY has
moved and no longer matches PMD_SECT_AP2). In the process of revising
THP, the names of the PMD software bits have been prefixed with L_ to
make them easier to distinguish from their hardware bit counterparts.
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
Long descriptors on ARM are 64 bits, and some pte functions such as
pte_dirty return a bitwise-and of a flag with the pte value. If the
flag to be tested resides in the upper 32 bits of the pte, then we run
into the danger of the result being dropped if downcast.
For example:
gather_stats(page, md, pte_dirty(*pte), 1);
where pte_dirty(*pte) is downcast to an int.
This patch introduces a new macro pte_isset which performs the bitwise
and, then performs a double logical invert (where needed) to ensure
predictable downcasting. The logical inverse pte_isclear is also
introduced.
Equivalent pmd functions for Transparent HugePages have also been
added.
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
The stage-2 memory attributes are distinct from the Hyp memory
attributes and the Stage-1 memory attributes. We were using the stage-1
memory attributes for stage-2 mappings causing device mappings to be
mapped as normal memory. Add the S2 equivalent defines for memory
attributes and fix the comments explaining the defines while at it.
Add a prot_pte_s2 field to the mem_type struct and fill out the field
for device mappings accordingly.
Cc: <[email protected]> [3.9+]
Acked-by: Marc Zyngier <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
This patch allows the kernel page tables to be dumped via a debugfs file,
allowing kernel developers to check the layout of the kernel page tables
and the verify the various permissions and type settings.
Signed-off-by: Russell King <[email protected]>
|
|
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a few
bugfixes.
ARM got transparent huge page support, improved overcommit, and
support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
|
|
The memory pinning code in uaccess_with_memcpy.c does not check
for HugeTLB or THP pmds, and will enter an infinite loop should
a __copy_to_user or __clear_user occur against a huge page.
This patch adds detection code for huge pages to pin_page_for_write.
As this code can be executed in a fast path it refers to the actual
pmds rather than the vma. If a HugeTLB or THP is found (they have
the same pmd representation on ARM), the page table spinlock is
taken to prevent modification whilst the page is pinned.
On ARM, huge pages are only represented as pmds, thus no huge pud
checks are performed. (For huge puds one would lock the page table
in a similar manner as in the pmd case).
Two helper functions are introduced; pmd_thp_or_huge will check
whether or not a page is huge or transparent huge (which have the
same pmd layout on ARM), and pmd_hugewillfault will detect whether
or not a page fault will occur on write to the page.
Running the following test (with the chunking from read_zero
removed):
$ dd if=/dev/zero of=/dev/null bs=10M count=1024
Gave: 2.3 GB/s backed by normal pages,
2.9 GB/s backed by huge pages,
5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK.
After some discussion, it was decided not to adopt the HPAGE_MASK,
as this would have a significant detrimental effect on the overall
system latency due to page_table_lock being held for too long.
This could be revisited if split huge page locks are adopted.
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Nicolas Pitre <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
Support huge pages in KVM/ARM and KVM/ARM64. The pud_huge checking on
the unmap path may feel a bit silly as the pud_huge check is always
defined to false, but the compiler should be smart about this.
Note: This deals only with VMAs marked as huge which are allocated by
users through hugetlbfs only. Transparent huge pages can only be
detected by looking at the underlying pages (or the page tables
themselves) and this patch so far simply maps these on a page-by-page
level in the Stage-2 page tables.
Cc: Catalin Marinas <[email protected]>
Cc: Russell King <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into devel-stable
Conflicts:
arch/arm/kernel/smp.c
Please pull these miscellaneous LPAE fixes I've been collecting for a while
now for 3.11. They've been tested and reviewed by quite a few people, and most
of the patches are pretty trivial. -- Will Deacon.
|
|
The patch adds support for THP (transparent huge pages) to LPAE
systems. When this feature is enabled, the kernel tries to map
anonymous pages as 2MB sections where possible.
Signed-off-by: Catalin Marinas <[email protected]>
[[email protected]: symbolic constants used, value of
PMD_SECT_SPLITTING adjusted, tlbflush.h included in pgtable.h,
added PROT_NONE support.]
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
|
|
This patch adds support for hugetlbfs based on the x86 implementation.
It allows mapping of 2MB sections (see Documentation/vm/hugetlbpage.txt
for usage). The 64K pages configuration is not supported (section size
is 512MB in this case).
Signed-off-by: Catalin Marinas <[email protected]>
[[email protected]: symbolic constants replace numbers in places.
Split up into multiple files, to simplify future non-LPAE support,
removed huge_pmd_share code, as this is very rarely executed,
Added PROT_NONE support].
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
|
|
For 3 levels of paging the PTE_EXT_NG bit will be set for user
address ptes that are written to a page table but not for ptes
created with mk_pte.
This can cause some comparison tests made by pte_same to fail
spuriously and lead to other problems.
To correct this behaviour, we mask off PTE_EXT_NG for any pte that
is present before running the comparison.
Signed-off-by: Steve Capper <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
|
|
For 2-level page tables, PTE_HWTABLE_PTRS describes the offset between
Linux PTEs and hardware PTEs. On LPAE, there is no distinction (since
we have 64-bit descriptors with plenty of space) so PTE_HWTABLE_PTRS
should be 0. Unfortunately, it is wrongly defined as PTRS_PER_PTE,
meaning that current pte table flushing is off by a page. Luckily,
all current LPAE implementations are SMP, so the hardware walker can
snoop L1.
This patch fixes the broken definition.
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
This patch applies to PAGE_MASK, PMD_MASK, and PGDIR_MASK, where forcing
unsigned long math truncates the mask at the 32-bits. This clearly does bad
things on PAE systems.
This patch fixes this problem by defining these masks as signed quantities.
We then rely on sign extension to do the right thing.
Signed-off-by: Cyril Chemparathy <[email protected]>
Signed-off-by: Vitaly Andrianov <[email protected]>
Reviewed-by: Nicolas Pitre <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Tested-by: Santosh Shilimkar <[email protected]>
Tested-by: Subash Patel <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
Looks like our L_PTE_S2_RDWR definition is slightly wrong,
and is actually write only (see ARM ARM Table B3-9, Stage 2 control
of access permissions). Didn't make a difference for normal pages,
as we OR the flags together, but I'm still wondering how it worked
for Stage-2 mapped devices, such as the GIC.
Brown paper bag time, again.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
KVM uses the stage-2 page tables and the Hyp page table format,
so we define the fields and page protection flags needed by KVM.
The nomenclature is this:
- page_hyp: PL2 code/data mappings
- page_hyp_device: PL2 device mappings (vgic access)
- page_s2: Stage-2 code/data page mappings
- page_s2_device: Stage-2 device mappings (vgic access)
Reviewed-by: Will Deacon <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Christoffer Dall <[email protected]>
|
|
PROT_NONE mappings apply the page protection attributes defined by _P000
which translate to PAGE_NONE for ARM. These attributes specify an XN,
RDONLY pte that is inaccessible to userspace. However, on kernels
configured without support for domains, such a pte *is* accessible to
the kernel and can be read via get_user, allowing tasks to read
PROT_NONE pages via syscalls such as read/write over a pipe.
This patch introduces a new software pte flag, L_PTE_NONE, that is set
to identify faulting, present entries.
Signed-off-by: Will Deacon <[email protected]>
|
|
For long-descriptor translation table formats, the ARMv7 architecture
defines the last two bits of the second- and third-level descriptors to
be:
x0b - Invalid
01b - Block (second-level), Reserved (third-level)
11b - Table (second-level), Page (third-level)
This allows us to define L_PTE_PRESENT as (3 << 0) and use this value to
create ptes directly. However, when determining whether a given pte
value is present in the low-level page table accessors, we only need to
check the least significant bit of the descriptor, allowing us to write
faulting, present entries which are required for PROT_NONE mappings.
This patch introduces L_PTE_VALID, which can be used to test whether a
pte should fault, and updates the low-level page table accessors
accordingly.
Signed-off-by: Will Deacon <[email protected]>
|
|
These have already been removed from the classic MMU in favour of
L_PTE_MT_* macros.
Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
This patch modifies the pgd/pmd/pte manipulation functions to support
the 3-level page table format. Since there is no need for an 'ext'
argument to cpu_set_pte_ext(), this patch conditionally defines a
different prototype for this function when CONFIG_ARM_LPAE.
The patch also introduces the L_PGD_SWAPPER flag to mark pgd entries
pointing to pmd tables pre-allocated in the swapper_pg_dir and avoid
trying to free them at run-time. This flag is 0 with the classic page
table format.
Signed-off-by: Catalin Marinas <[email protected]>
|
|
This patch introduces the pgtable-3level*.h files with definitions
specific to the LPAE page table format (3 levels of page tables).
Each table is 4KB and has 512 64-bit entries. An entry can point to a
40-bit physical address. The young, write and exec software bits share
the corresponding hardware bits (negated). Other software bits use spare
bits in the PTE.
The patch also changes some variable types from unsigned long or int to
pteval_t or pgprot_t.
Signed-off-by: Catalin Marinas <[email protected]>
|