aboutsummaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2019-04-30KVM: VMX: Nop emulation of MSR_IA32_POWER_CTLLiran Alon3-0/+9
Since commits 668fffa3f838 ("kvm: better MWAIT emulation for guestsâ€) and 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT interceptsâ€), KVM was modified to allow an admin to configure certain guests to execute MONITOR/MWAIT inside guest without being intercepted by host. This is useful in case admin wishes to allocate a dedicated logical processor for each vCPU thread. Thus, making it safe for guest to completely control the power-state of the logical processor. The ability to use this new KVM capability was introduced to QEMU by commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|offâ€) and 2266d4431132 ("i386/cpu: make -cpu host support monitor/mwaitâ€). However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle kernel module to execute c1e_promotion_disable() which will attempt to RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable" bit. This behaviour was introduced by commit 32e9518005c8 ("intel_idle: export both C1 and C1Eâ€). Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0 will cause the above guest behaviour to raise a #GP which will cause guest to kernel panic. Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to avoid #GP in guest in this scenario. Future commits can optimise emulation further by reflecting guest MSR changes to host MSR to provide guest with the ability to fine-tune the dedicated logical processor power-state. Reviewed-by: Boris Ostrovsky <[email protected]> Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-04-30KVM: x86: Add support of clear Trace_ToPA_PMI statusLuwei Kang3-1/+12
Let guests clear the Intel PT ToPA PMI status (bit 55 of MSR_CORE_PERF_GLOBAL_OVF_CTRL). Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-04-30KVM: x86: Inject PMI for KVM guestLuwei Kang3-1/+19
Inject a PMI for KVM guest when Intel PT working in Host-Guest mode and Guest ToPA entry memory buffer was completely filled. Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-04-30Merge tag 'kvm-s390-next-5.2-1' of ↵Paolo Bonzini8-8/+145
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Features and fixes for 5.2 - VSIE crypto fixes - new guest features for gen15 - disable halt polling for nested virtualization with overcommit
2019-04-30KVM: lapic: Check for in-kernel LAPIC before deferencing apic pointerSean Christopherson2-4/+2
...to avoid dereferencing a null pointer when querying the per-vCPU timer advance. Fixes: 39497d7660d98 ("KVM: lapic: Track lapic timer advance per vCPU") Reported-by: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-04-30x86/kvm/mmu: reset MMU context when 32-bit guest switches PAEVitaly Kuznetsov2-0/+2
Commit 47c42e6b4192 ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'") introduced a regression: 32-bit PAE guests stopped working. The issue appears to be: when guest switches (enables) PAE we need to re-initialize MMU context (set context->root_level, do reset_rsvds_bits_mask(), ...) but init_kvm_tdp_mmu() doesn't do that because we threw away is_pae(vcpu) flag from mmu role. Restore it to kvm_mmu_extended_role (as we now don't need it in base role) to fix the issue. Fixes: 47c42e6b4192 ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'") Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-04-30KVM: x86: Whitelist port 0x7e for pre-incrementing %ripSean Christopherson2-2/+20
KVM's recent bug fix to update %rip after emulating I/O broke userspace that relied on the previous behavior of incrementing %rip prior to exiting to userspace. When running a Windows XP guest on AMD hardware, Qemu may patch "OUT 0x7E" instructions in reaction to the OUT itself. Because KVM's old behavior was to increment %rip before exiting to userspace to handle the I/O, Qemu manually adjusted %rip to account for the OUT instruction. Arguably this is a userspace bug as KVM requires userspace to re-enter the kernel to complete instruction emulation before taking any other actions. That being said, this is a bit of a grey area and breaking userspace that has worked for many years is bad. Pre-increment %rip on OUT to port 0x7e before exiting to userspace to hack around the issue. Fixes: 45def77ebf79e ("KVM: x86: update %rip after emulating IO") Reported-by: Simon Becherer <[email protected]> Reported-and-tested-by: Iakov Karpov <[email protected]> Reported-by: Gabriele Balducci <[email protected]> Reported-by: Antti Antinoja <[email protected]> Cc: [email protected] Cc: Takashi Iwai <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-04-30xen/arm: Use p2m entry with lock protectionHillf Danton1-1/+2
A new local variable is introduced for accessing p2m entry with lock protection. Signed-off-by: Hillf Danton <[email protected]> Signed-off-by: Stefano Stabellini <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]>
2019-04-30xen/arm: Free p2m entry if fail to add it to RB treeHillf Danton1-0/+1
Release the newly allocated p2m entry if we detect a duplicate in the RB tree. Signed-off-by: Hillf Danton <[email protected]> Signed-off-by: Stefano Stabellini <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]>
2019-04-30RISC-V: Implement nosmp commandline option.Atish Patra1-1/+11
nosmp command line option sets max_cpus to zero. No secondary harts will boot if this is enabled. But present cpu mask will still point to all possible masks. Fix present cpu mask for nosmp usecase. Signed-off-by: Atish Patra <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2019-04-30RISC-V: Add RISC-V specific arch_match_cpu_phys_idAtish Patra2-2/+7
OF/DT core has a hook for architecture specific logical cpuid to hartid mapping. By implementing this, we can pass the logical cpu id to cpu node parsing functions. Fix the instances where logical cpuid is expected as an argument in of_get_cpu_node. Signed-off-by: Atish Patra <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Signed-off-by: Palmer Dabbelt <[email protected]>
2019-04-30x86/mm/mem_encrypt: Disable all instrumentation for early SME setupGary Hook1-0/+12
Enablement of AMD's Secure Memory Encryption feature is determined very early after start_kernel() is entered. Part of this procedure involves scanning the command line for the parameter 'mem_encrypt'. To determine intended state, the function sme_enable() uses library functions cmdline_find_option() and strncmp(). Their use occurs early enough such that it cannot be assumed that any instrumentation subsystem is initialized. For example, making calls to a KASAN-instrumented function before KASAN is set up will result in the use of uninitialized memory and a boot failure. When AMD's SME support is enabled, conditionally disable instrumentation of these dependent functions in lib/string.c and arch/x86/lib/cmdline.c. [ bp: Get rid of intermediary nostackp var and cleanup whitespace. ] Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption") Reported-by: Li RongQing <[email protected]> Signed-off-by: Gary R Hook <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Coly Li <[email protected]> Cc: "[email protected]" <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: "[email protected]" <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: x86-ml <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-04-30ARM: dts: logicpd-som-lv: Fix MMC1 card detectAdam Ford1-1/+1
The card detect pin was incorrectly using IRQ_TYPE_LEVEL_LOW instead of GPIO_ACTIVE_LOW when reading the state of the CD pin. This was previosly fixed on Torpedo, but missed on the SOM-LV Fixes: 5cb8b0fa55a9 ("ARM: dts: Move most of logicpd-som-lv-37xx-devkit.dts to logicpd-som-lv-baseboard.dtsi") Signed-off-by: Adam Ford <[email protected]> Signed-off-by: Tony Lindgren <[email protected]>
2019-04-30clocksource/arm_arch_timer: Use arch_timer_read_counter to access stable ↵Marc Zyngier2-4/+26
counters Instead of always going via arch_counter_get_cntvct_stable to access the counter workaround, let's have arch_timer_read_counter point to the right method. For that, we need to track whether any CPU in the system has a workaround for the counter. This is done by having an atomic variable tracking this. Acked-by: Mark Rutland <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30clocksource/arm_arch_timer: Remove use of workaround static keyMarc Zyngier1-4/+0
The use of a static key in a hotplug path has proved to be a real nightmare, and makes it impossible to have scream-free lockdep kernel. Let's remove the static key altogether, and focus on something saner. Acked-by: Mark Rutland <[email protected]> Acked-by: Daniel Lezcano <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30clocksource/arm_arch_timer: Drop use of static key in arch_timer_reg_read_stableMarc Zyngier1-14/+28
Let's start with the removal of the arch_timer_read_ool_enabled static key in arch_timer_reg_read_stable. It is not a fast path, and we can simplify things a bit. Acked-by: Mark Rutland <[email protected]> Acked-by: Daniel Lezcano <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30clocksource/arm_arch_timer: Direcly assign set_next_event workaroundMarc Zyngier2-0/+20
When a given timer is affected by an erratum and requires an alternative implementation of set_next_event, we do a rather complicated dance to detect and call the workaround on each set_next_event call. This is clearly idiotic, as we can perfectly detect whether this CPU requires a workaround while setting up the clock event device. This only requires the CPU-specific detection to be done a bit earlier, and we can then safely override the set_next_event pointer if we have a workaround associated to that CPU. Acked-by: Mark Rutland <[email protected]> Acked-by; Daniel Lezcano <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: Use arch_timer_read_counter instead of arch_counter_get_cntvctMarc Zyngier1-2/+2
Only arch_timer_read_counter will guarantee that workarounds are applied. So let's use this one instead of arch_counter_get_cntvct. Acked-by: Mark Rutland <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30ARM: vdso: Remove dependency with the arch_timer driver internalsMarc Zyngier2-2/+5
The VDSO code uses the kernel helper that was originally designed to abstract the access between 32 and 64bit systems. It worked so far because this function is declared as 'inline'. As we're about to revamp that part of the code, the VDSO would break. Let's fix it by doing what should have been done from the start, a proper system register access. Reviewed-by: Mark Rutland <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: Apply ARM64_ERRATUM_1188873 to Neoverse-N1Marc Zyngier2-7/+17
Neoverse-N1 is also affected by ARM64_ERRATUM_1188873, so let's add it to the list of affected CPUs. Signed-off-by: Marc Zyngier <[email protected]> [will: Update silicon-errata.txt] Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: Add part number for Neoverse N1Marc Zyngier1-0/+2
New CPU, new part number. You know the drill. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: Make ARM64_ERRATUM_1188873 depend on COMPATMarc Zyngier1-0/+1
Since ARM64_ERRATUM_1188873 only affects AArch32 EL0, it makes some sense that it should depend on COMPAT. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: Restrict ARM64_ERRATUM_1188873 mitigation to AArch32Marc Zyngier1-2/+17
We currently deal with ARM64_ERRATUM_1188873 by always trapping EL0 accesses for both instruction sets. Although nothing wrong comes out of that, people trying to squeeze the last drop of performance from buggy HW find this over the top. Oh well. Let's change the mitigation by flipping the counter enable bit on return to userspace. Non-broken HW gets an extra branch on the fast path, which is hopefully not the end of the world. The arch timer workaround is also removed. Acked-by: Daniel Lezcano <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30powerpc/powernv/ioda: Handle failures correctly in ↵Alexey Kardashevskiy1-2/+2
pnv_pci_ioda_iommu_bypass_supported() When the return value type was changed from int to bool, few places were left unchanged, this fixes them. We did not hit these failures as the first one is not happening at all and the second one is little more likely to happen if the user switches a 33..58bit DMA capable device between the VFIO and vendor drivers and there are not so many of these. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2019-04-30Merge branch 'topic/ppc-kvm' into nextMichael Ellerman22-1277/+1227
Merge our topic branch shared with KVM. In particular this includes the rewrite of the idle code into C.
2019-04-30powerpc/powernv/idle: Restore AMR/UAMOR/AMOR/IAMR after idleMichael Ellerman1-6/+46
This is an implementation of commits 53a712bae5dd ("powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle") and a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR after idle") using the new C-based idle code. Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Extract from Nick's patch] Signed-off-by: Michael Ellerman <[email protected]>
2019-04-30powerpc/64s: Reimplement book3s idle code in CNicholas Piggin12-1218/+969
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation speific HV idle code to the powernv platform code. Book3S assembly stubs are kept in common code and used only to save the stack frame and non-volatile GPRs before executing architected idle instructions, and restoring the stack and reloading GPRs then returning to C after waking from idle. The complex logic dealing with threads and subcores, locking, SPRs, HMIs, timebase resync, etc., is all done in C which makes it more maintainable. This is not a strict translation to C code, there are some significant differences: - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs, but saves and restores them itself. - The optimisation where EC=ESL=0 idle modes did not have to save GPRs or change MSR is restored, because it's now simple to do. ESL=1 sleeps that do not lose GPRs can use this optimization too. - KVM secondary entry and cede is now more of a call/return style rather than branchy. nap_state_lost is not required because KVM always returns via NVGPR restoring path. - KVM secondary wakeup from offline sequence is moved entirely into the offline wakeup, which avoids a hwsync in the normal idle wakeup path. Performance measured with context switch ping-pong on different threads or cores, is possibly improved a small amount, 1-3% depending on stop state and core vs thread test for shallow states. Deep states it's in the noise compared with other latencies. KVM improvements: - Idle sleepers now always return to caller rather than branch out to KVM first. - This allows optimisations like very fast return to caller when no state has been lost. - KVM no longer requires nap_state_lost because it controls NVGPR save/restore itself on the way in and out. - The heavy idle wakeup KVM request check can be moved out of the normal host idle code and into the not-performance-critical offline code. - KVM nap code now returns from where it is called, which makes the flow a bit easier to follow. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> [mpe: Squash the KVM changes in] Signed-off-by: Michael Ellerman <[email protected]>
2019-04-30arm64: mm: Remove pte_unmap_nested()Qian Cai1-2/+0
As of commit ece0e2b6406a ("mm: remove pte_*map_nested()"), pte_unmap_nested() is no longer used and can be removed from the arm64 code. Signed-off-by: Qian Cai <[email protected]> [will: also remove pte_offset_map_nested()] Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variableQian Cai1-1/+2
When building with -Wunused-but-set-variable, the compiler shouts about a number of pte_unmap() users, since this expands to an empty macro on arm64: | mm/gup.c: In function 'gup_pte_range': | mm/gup.c:1727:16: warning: variable 'ptem' set but not used | [-Wunused-but-set-variable] | mm/gup.c: At top level: | mm/memory.c: In function 'copy_pte_range': | mm/memory.c:821:24: warning: variable 'orig_dst_pte' set but not used | [-Wunused-but-set-variable] | mm/memory.c:821:9: warning: variable 'orig_src_pte' set but not used | [-Wunused-but-set-variable] | mm/swap_state.c: In function 'swap_ra_info': | mm/swap_state.c:641:15: warning: variable 'orig_pte' set but not used | [-Wunused-but-set-variable] | mm/madvise.c: In function 'madvise_free_pte_range': | mm/madvise.c:318:9: warning: variable 'orig_pte' set but not used | [-Wunused-but-set-variable] Rewrite pte_unmap() as a static inline function, which silences the warnings. Signed-off-by: Qian Cai <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30x86/alternatives: Add comment about module removal racesNadav Amit1-0/+5
Add a comment to clarify that users of text_poke() must ensure that no races with module removal take place. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/kprobes: Use vmalloc special flagRick Edgecombe1-6/+1
Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special permissioned memory in vmalloc and remove places where memory was set NX and RW before freeing which is no longer needed. Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/ftrace: Use vmalloc special flagRick Edgecombe1-8/+6
Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special permissioned memory in vmalloc and remove places where memory was set NX and RW before freeing which is no longer needed. Tested-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30mm/hibernation: Make hibernation handle unmapped pagesRick Edgecombe1-4/+0
Make hibernate handle unmapped pages on the direct map when CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages to invalid configurations, so now hibernate should check if the pages have valid mappings and handle if they are unmapped when doing a hibernate save operation. Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y was configured. It does not appear to have a big hibernating performance impact. The speed of the saving operation before this change was measured as 819.02 MB/s, and after was measured at 813.32 MB/s. Before: [ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s) After: [ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s) Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Pavel Machek <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/mm/cpa: Add set_direct_map_*() functionsRick Edgecombe4-3/+19
Add two new functions set_direct_map_default_noflush() and set_direct_map_invalid_noflush() for setting the direct map alias for the page to its default valid permissions and to an invalid state that cannot be cached in a TLB, respectively. These functions do not flush the TLB. Note, __kernel_map_pages() does something similar but flushes the TLB and doesn't reset the permission bits to default on all architectures. Also add an ARCH config ARCH_HAS_SET_DIRECT_MAP for specifying whether these have an actual implementation or a default empty one. Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/alternatives: Remove the return value of text_poke_*()Nadav Amit2-9/+6
The return value of text_poke_early() and text_poke_bp() is useless. Remove it. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/jump-label: Remove support for custom text pokerNadav Amit1-16/+10
There are only two types of text poking: early and breakpoint based. The use of a function pointer to perform text poking complicates the code and is probably inefficient due to the use of indirect branches. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/modules: Avoid breaking W^X while loading modulesNadav Amit2-8/+22
When modules and BPF filters are loaded, there is a time window in which some memory is both writable and executable. An attacker that has already found another vulnerability (e.g., a dangling pointer) might be able to exploit this behavior to overwrite kernel code. Prevent having writable executable PTEs in this stage. In addition, avoiding having W+X mappings can also slightly simplify the patching of modules code on initialization (e.g., by alternatives and static-key), as would be done in the next patch. This was actually the main motivation for this patch. To avoid having W+X mappings, set them initially as RW (NX) and after they are set as RO set them as X as well. Setting them as executable is done as a separate step to avoid one core in which the old PTE is cached (hence writable), and another which sees the updated PTE (executable), which would break the W^X protection. Suggested-by: Thomas Gleixner <[email protected]> Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Rik van Riel <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/kprobes: Set instruction page as executableNadav Amit1-4/+20
Set the page as executable after allocation. This patch is a preparatory patch for a following patch that makes module allocated pages non-executable. While at it, do some small cleanup of what appears to be unnecessary masking. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/ftrace: Set trampoline pages as executableNadav Amit1-0/+8
Since alloc_module() will not set the pages as executable soon, set ftrace trampoline pages as executable after they are allocated. For the time being, do not change ftrace to use the text_poke() interface. As a result, ftrace still breaks W^X. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/kgdb: Avoid redundant comparison of patched codeNadav Amit1-13/+1
text_poke() already ensures that the written value is the correct one and fails if that is not the case. There is no need for an additional comparison. Remove it. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/alternatives: Use temporary mm for text pokingNadav Amit3-26/+86
text_poke() can potentially compromise security as it sets temporary PTEs in the fixmap. These PTEs might be used to rewrite the kernel code from other cores accidentally or maliciously, if an attacker gains the ability to write onto kernel memory. Moreover, since remote TLBs are not flushed after the temporary PTEs are removed, the time-window in which the code is writable is not limited if the fixmap PTEs - maliciously or accidentally - are cached in the TLB. To address these potential security hazards, use a temporary mm for patching the code. Finally, text_poke() is also not conservative enough when mapping pages, as it always tries to map 2 pages, even when a single one is sufficient. So try to be more conservative, and do not map more than needed. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/alternatives: Initialize temporary mm for patchingNadav Amit4-0/+45
To prevent improper use of the PTEs that are used for text patching, the next patches will use a temporary mm struct. Initailize it by copying the init mm. The address that will be used for patching is taken from the lower area that is usually used for the task memory. Doing so prevents the need to frequently synchronize the temporary-mm (e.g., when BPF programs are installed), since different PGDs are used for the task memory. Finally, randomize the address of the PTEs to harden against exploits that use these PTEs. Suggested-by: Andy Lutomirski <[email protected]> Tested-by: Masami Hiramatsu <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/mm: Save debug registers when loading a temporary mmNadav Amit1-0/+23
Prevent user watchpoints from mistakenly firing while the temporary mm is being used. As the addresses of the temporary mm might overlap those of the user-process, this is necessary to prevent wrong signals or worse things from happening. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/mm: Introduce temporary mm structsAndy Lutomirski1-0/+33
Using a dedicated page-table for temporary PTEs prevents other cores from using - even speculatively - these PTEs, thereby providing two benefits: (1) Security hardening: an attacker that gains kernel memory writing abilities cannot easily overwrite sensitive data. (2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in remote page-tables. To do so a temporary mm_struct can be used. Mappings which are private for this mm can be set in the userspace part of the address-space. During the whole time in which the temporary mm is loaded, interrupts must be disabled. The first use-case for temporary mm struct, which will follow, is for poking the kernel text. [ Commit message was written by Nadav Amit ] Tested-by: Masami Hiramatsu <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/jump_label: Use text_poke_early() during early initNadav Amit1-1/+6
There is no apparent reason not to use text_poke_early() during early-init, since no patching of code that might be on the stack is done and only a single core is running. This is required for the next patches that would set a temporary mm for text poking, and this mm is only initialized after some static-keys are enabled/disabled. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30mm/tlb: Provide default nmi_uaccess_okay()Nadav Amit1-0/+2
x86 has an nmi_uaccess_okay(), but other architectures do not. Arch-independent code might need to know whether access to user addresses is ok in an NMI context or in other code whose execution context is unknown. Specifically, this function is needed for bpf_probe_write_user(). Add a default implementation of nmi_uaccess_okay() for architectures that do not have such a function. Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30x86/alternatives: Add text_poke_kgdb() to not assert the lock when debuggingNadav Amit3-19/+45
text_mutex is currently expected to be held before text_poke() is called, but kgdb does not take the mutex, and instead *supposedly* ensures the lock is not taken and will not be acquired by any other core while text_poke() is running. The reason for the "supposedly" comment is that it is not entirely clear that this would be the case if gdb_do_roundup is zero. Create two wrapper functions, text_poke() and text_poke_kgdb(), which do or do not run the lockdep assertion respectively. While we are at it, change the return code of text_poke() to something meaningful. One day, callers might actually respect it and the existing BUG_ON() when patching fails could be removed. For kgdb, the return value can actually be used. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Acked-by: Jiri Kosina <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-30arm64: compat: Reduce address limit for 64K pagesVincenzo Frascino1-1/+1
With the introduction of the config option that allows to enable kuser helpers, it is now possible to reduce TASK_SIZE_32 when these are disabled and 64K pages are enabled. This extends the compliance with the section 6.5.8 of the C standard (C99). Acked-by: Catalin Marinas <[email protected]> Reported-by: Catalin Marinas <[email protected]> Signed-off-by: Vincenzo Frascino <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30arm64: arch_timer: Ensure counter register reads occur with seqlock heldWill Deacon2-6/+42
When executing clock_gettime(), either in the vDSO or via a system call, we need to ensure that the read of the counter register occurs within the seqlock reader critical section. This ensures that updates to the clocksource parameters (e.g. the multiplier) are consistent with the counter value and therefore avoids the situation where time appears to go backwards across multiple reads. Extend the vDSO logic so that the seqlock critical section covers the read of the counter register as well as accesses to the data page. Since reads of the counter system registers are not ordered by memory barrier instructions, introduce dependency ordering from the counter read to a subsequent memory access so that the seqlock memory barriers apply to the counter access in both the vDSO and the system call paths. Cc: <[email protected]> Cc: Marc Zyngier <[email protected]> Tested-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/linux-arm-kernel/[email protected]/ Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Will Deacon <[email protected]>
2019-04-30KVM: PPC: Book3S HV: XIVE: Clear escalation interrupt pointers on device closePaul Mackerras1-0/+15
This adds code to ensure that after a XIVE or XICS-on-XIVE KVM device is closed, KVM will not try to enable or disable any of the escalation interrupts for the VCPUs. We don't have to worry about races between clearing the pointers and use of the pointers by the XIVE context push/pull code, because the callers hold the vcpu->mutex, which is also taken by the KVM_RUN code. Therefore the vcpu cannot be entering or exiting the guest concurrently. Signed-off-by: Paul Mackerras <[email protected]>