aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)AuthorFilesLines
2022-04-29KVM: x86: Clean up and document nested #PF workaroundSean Christopherson1-0/+2
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX interceptsBabu Moger1-1/+1
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID instructions. The TSC_AUX value is set using the WRMSR instruction to the TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID instructions. If the read/write of the TSC_AUX MSR is intercepted, then RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization is present. However, the RDPID instruction can't be intercepted. This means that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR read/write must not be intercepted for SEV-ES (or SEV-SNP) guests. Signed-off-by: Babu Moger <[email protected]> Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29x86/cpufeatures: Add virtual TSC_AUX feature bitBabu Moger1-0/+1
The TSC_AUX Virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) MSR in RDTSCP and RDPID instructions. The TSC_AUX MSR is typically initialized to APIC ID or another unique identifier so that software can quickly associate returned TSC value with the logical processor. Add the feature bit and also include it in the kvm for detection. Signed-off-by: Babu Moger <[email protected]> Acked-by: Borislav Petkov <[email protected]> Message-Id: <165040157111.1399644.6123821125319995316.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29Revert "x86/mm: Introduce lookup_address_in_mm()"Sean Christopherson1-4/+0
Drop lookup_address_in_mm() now that KVM is providing it's own variant of lookup_address_in_pgd() that is safe for use with user addresses, e.g. guards against page tables being torn down. A variant that provides a non-init mm is inherently dangerous and flawed, as the only reason to use an mm other than init_mm is to walk a userspace mapping, and lookup_address_in_pgd() does not play nice with userspace mappings, e.g. doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE() to ensure an upper level entry isn't converted to a huge page between checking the PAGE_SIZE bit and grabbing the address of the next level down. This reverts commit 13c72c060f1ba6f4eddd7b1c4f52a8aded43d6d9. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <YmwIi3bXr/1yhYV/@google.com> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-28x86/mm: enable ARCH_HAS_VM_GET_PAGE_PROTChristoph Hellwig2-19/+0
This defines and exports a platform specific custom vm_get_page_prot() via subscribing ARCH_HAS_VM_GET_PAGE_PROT. This also unsubscribes from config ARCH_HAS_FILTER_PGPROT, after dropping off arch_filter_pgprot() and arch_vm_get_page_prot(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Anshuman Khandual <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-04-27amd_hsmp: Add HSMP protocol version 5 messagesSuma Hegde1-5/+109
HSMP protocol version 5 is supported on AMD family 19h model 10h EPYC processors. This version brings new features such as -- DIMM statistics -- Bandwidth for IO and xGMI links -- Monitor socket and core frequency limits -- Configure power efficiency modes, DF pstate range etc Signed-off-by: Suma Hegde <[email protected]> Reviewed-by: Carlos Bilbao <[email protected]> Signed-off-by: Naveen Krishna Chatradhi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2022-04-27x86/aperfmperf: Make parts of the frequency invariance code unconditionalThomas Gleixner2-4/+2
The frequency invariance support is currently limited to x86/64 and SMP, which is the vast majority of machines. arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF and MPERF MSRs. The CPU frequency getters function do the same via dedicated IPIs. While it could be argued that on systems where frequency invariance support is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can be avoided, it does not make sense to keep the extra code and the resulting runtime issues of mass IPIs around. As a first step split out the non frequency invariance specific initialization code and the read MSR portion of arch_scale_freq_tick(). The rest of the code is still conditional and guarded with a static key. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/aperfmperf: Untangle Intel and AMD frequency invariance initThomas Gleixner1-9/+4
AMD boot CPU initialization happens late via ACPI/CPPC which prevents the Intel parts from being marked __init. Split out the common code and provide a dedicated interface for the AMD initialization and mark the Intel specific code and data __init. The remaining text size is almost cut in half: text: 2614 -> 1350 init.text: 0 -> 786 Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/aperfmperf: Separate AP/BP frequency invariance initThomas Gleixner1-7/+5
This code is convoluted and because it can be invoked post init via the ACPI/CPPC code, all of the initialization functionality is built in instead of being part of init text and init data. As a first step create separate calls for the boot and the application processors. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/split-lock: Remove unused TIF_SLD bitTony Luck2-5/+1
Changes to the "warn" mode of split lock handling mean that TIF_SLD is never set. Remove the bit, and the functions that use it. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/pm: Fix false positive kmemleak report in msr_build_context()Matthieu Baerts2-5/+9
Since e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") kmemleak reports this issue: unreferenced object 0xffff888009cedc00 (size 256): comm "swapper/0", pid 1, jiffies 4294693823 (age 73.764s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 48 00 00 00 00 00 00 00 ........H....... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: msr_build_context (include/linux/slab.h:621) pm_check_save_msr (arch/x86/power/cpu.c:520) do_one_initcall (init/main.c:1298) kernel_init_freeable (init/main.c:1370) kernel_init (init/main.c:1504) ret_from_fork (arch/x86/entry/entry_64.S:304) Reproducer: - boot the VM with a debug kernel config (see https://github.com/multipath-tcp/mptcp_net-next/issues/268) - wait ~1 minute - start a kmemleak scan The root cause here is alignment within the packed struct saved_context (from suspend_64.h). Kmemleak only searches for pointers that are aligned (see how pointers are scanned in kmemleak.c), but pahole shows that the saved_msrs struct member and all members after it in the structure are unaligned: struct saved_context { struct pt_regs regs; /* 0 168 */ /* --- cacheline 2 boundary (128 bytes) was 40 bytes ago --- */ u16 ds; /* 168 2 */ ... u64 misc_enable; /* 232 8 */ bool misc_enable_saved; /* 240 1 */ /* Note below odd offset values for the remainder of this struct */ struct saved_msrs saved_msrs; /* 241 16 */ /* --- cacheline 4 boundary (256 bytes) was 1 bytes ago --- */ long unsigned int efer; /* 257 8 */ u16 gdt_pad; /* 265 2 */ struct desc_ptr gdt_desc; /* 267 10 */ u16 idt_pad; /* 277 2 */ struct desc_ptr idt; /* 279 10 */ u16 ldt; /* 289 2 */ u16 tss; /* 291 2 */ long unsigned int tr; /* 293 8 */ long unsigned int safety; /* 301 8 */ long unsigned int return_address; /* 309 8 */ /* size: 317, cachelines: 5, members: 25 */ /* last cacheline: 61 bytes */ } __attribute__((__packed__)); Move misc_enable_saved to the end of the struct declaration so that saved_msrs fits in before the cacheline 4 boundary. The comment above the saved_context declaration says to fix wakeup_64.S file and __save/__restore_processor_state() if the struct is modified: it looks like all the accesses in wakeup_64.S are done through offsets which are computed at build-time. Update that comment accordingly. At the end, the false positive kmemleak report is due to a limitation from kmemleak but it is always good to avoid unaligned members for optimisation purposes. Please note that it looks like this issue is not new, e.g. https://lore.kernel.org/all/[email protected]/ https://lore.kernel.org/all/[email protected]/ [ bp: Massage + cleanup commit message. ] Fixes: 7a9c2dd08ead ("x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume") Suggested-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/sev: Get the AP jump table address from secrets pageBrijesh Singh1-0/+35
The GHCB specification section 2.7 states that when SEV-SNP is enabled, a guest should not rely on the hypervisor to provide the address of the AP jump table. Instead, if a guest BIOS wants to provide an AP jump table, it should record the address in the SNP secrets page so the guest operating system can obtain it directly from there. Fix this on the guest kernel side by having SNP guests use the AP jump table address published in the secrets page rather than issuing a GHCB request to get it. [ mroth: - Improve error handling when ioremap()/memremap() return NULL - Don't mix function calls with declarations - Add missing __init - Tweak commit message ] Fixes: 0afb6b660a6b ("x86/sev: Use SEV-SNP AP creation to start secondary CPUs") Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-26asm-generic: compat: Cleanup duplicate definitionsGuo Ren1-68/+12
There are 7 64bit architectures that support Linux COMPAT mode to run 32bit applications. A lot of definitions are duplicate: - COMPAT_USER_HZ - COMPAT_RLIM_INFINITY - COMPAT_OFF_T_MAX - __compat_uid_t, __compat_uid_t - compat_dev_t - compat_ipc_pid_t - struct compat_flock - struct compat_flock64 - struct compat_statfs - struct compat_ipc64_perm, compat_semid64_ds, compat_msqid64_ds, compat_shmid64_ds Cleanup duplicate definitions and merge them into asm-generic. Signed-off-by: Guo Ren <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Acked-by: Helge Deller <[email protected]> # parisc Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-26fs: stat: compat: Add __ARCH_WANT_COMPAT_STATGuo Ren1-0/+1
RISC-V doesn't neeed compat_stat, so using __ARCH_WANT_COMPAT_STAT to exclude unnecessary SYSCALL functions. Signed-off-by: Guo Ren <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Acked-by: Helge Deller <[email protected]> # parisc Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-26compat: consolidate the compat_flock{,64} definitionChristoph Hellwig1-17/+3
Provide a single common definition for the compat_flock and compat_flock64 structures using the same tricks as for the native variants. Another extra define is added for the packing required on x86. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Acked-by: Helge Deller <[email protected]> # parisc Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-26uapi: always define F_GETLK64/F_SETLK64/F_SETLKW64 in fcntl.hChristoph Hellwig1-4/+0
The F_GETLK64/F_SETLK64/F_SETLKW64 fcntl opcodes are only implemented for the 32-bit syscall APIs, but are also needed for compat handling on 64-bit kernels. Consolidate them in unistd.h instead of definining the internal compat definitions in compat.h, which is rather error prone (e.g. parisc gets the values wrong currently). Note that before this change they were never visible to userspace due to the fact that CONFIG_64BIT is only set for kernel builds. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-25x86/fpu/xsave: Support XSAVEC in the kernelThomas Gleixner1-1/+1
XSAVEC is the user space counterpart of XSAVES which cannot save supervisor state. In virtualization scenarios the hypervisor does not expose XSAVES but XSAVEC to the guest, though the kernel does not make use of it. That's unfortunate because XSAVEC uses the compacted format of saving the XSTATE. This is more efficient in terms of storage space vs. XSAVE[OPT] as it does not create holes for XSTATE components which are not supported or enabled by the kernel but are available in hardware. There is room for further optimizations when XSAVEC/S and XGETBV1 are supported. In order to support XSAVEC: - Define the XSAVEC ASM macro as it's not yet supported by the required minimal toolchain. - Create a software defined X86_FEATURE_XCOMPACTED to select the compacted XSTATE buffer format for both XSAVEC and XSAVES. - Make XSAVEC an option in the 'XSAVE' ASM alternatives Requested-by: Andrew Cooper <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-22objtool: Make jump label hack optionalJosh Poimboeuf1-3/+3
Objtool secretly does a jump label hack to overcome the limitations of the toolchain. Make the hack explicit (and optional for other arches) by turning it into a cmdline option and kernel config option. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Miroslav Benes <[email protected]> Link: https://lkml.kernel.org/r/3bdcbfdd27ecb01ddec13c04bdf756a583b13d24.1650300597.git.jpoimboe@redhat.com
2022-04-22objtool: Add CONFIG_OBJTOOLJosh Poimboeuf1-3/+3
Now that stack validation is an optional feature of objtool, add CONFIG_OBJTOOL and replace most usages of CONFIG_STACK_VALIDATION with it. CONFIG_STACK_VALIDATION can now be considered to be frame-pointer specific. CONFIG_UNWINDER_ORC is already inherently valid for live patching, so no need to "validate" it. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Miroslav Benes <[email protected]> Link: https://lkml.kernel.org/r/939bf3d85604b2a126412bf11af6e3bd3b872bcb.1650300597.git.jpoimboe@redhat.com
2022-04-21KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang2-0/+2
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: [email protected] Suggested-by: Sean Christpherson <[email protected]> Reported-by: Mingwei Zhang <[email protected]> Signed-off-by: Mingwei Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-21virt: sevguest: Change driver name to reflect generic SEV supportTom Lendacky1-1/+1
During patch review, it was decided the SNP guest driver name should not be SEV-SNP specific, but should be generic for use with anything SEV. However, this feedback was missed and the driver name, and many of the driver functions and structures, are SEV-SNP name specific. Rename the driver to "sev-guest" (to match the misc device that is created) and update some of the function and structure names, too. While in the file, adjust the one pr_err() message to be a dev_err() message so that the message, if issued, uses the driver name. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/307710bb5515c9088a19fd0b930268c7300479b2.1650464054.git.thomas.lendacky@amd.com
2022-04-19x86/static_call: Add ANNOTATE_NOENDBR to static call trampolineJosh Poimboeuf1-0/+1
The static call trampoline is never indirect-branched to, but is referenced by the static call key. Add ANNOTATE_NOENDBR. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/1b5b54aad7d81241dabe5e0c9b40dea64b540b00.1650300597.git.jpoimboe@redhat.com
2022-04-19x86/cpu: Load microcode during restore_processor_state()Borislav Petkov1-0/+2
When resuming from system sleep state, restore_processor_state() restores the boot CPU MSRs. These MSRs could be emulated by microcode. If microcode is not loaded yet, writing to emulated MSRs leads to unchecked MSR access error: ... PM: Calling lapic_suspend+0x0/0x210 unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr) Call Trace: <TASK> ? restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kobj_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __x64_sys_write do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7fda13c260a7 To ensure microcode emulated MSRs are available for restoration, load the microcode on the boot CPU before restoring these MSRs. [ Pawan: write commit message and productize it. ] Fixes: e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") Reported-by: Kyle D. Pelton <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Tested-by: Kyle D. Pelton <[email protected]> Cc: [email protected] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841 Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.com
2022-04-19Merge branch 'turbostat' of ↵Rafael J. Wysocki1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat changes for 5.19 from Len Brown: "Chen Yu (1): tools/power turbostat: Support thermal throttle count print Dan Merillat (1): tools/power turbostat: fix dump for AMD cpus Len Brown (5): tools/power turbostat: tweak --show and --hide capability tools/power turbostat: fix ICX DRAM power numbers tools/power turbostat: be more useful as non-root tools/power turbostat: No build warnings with -Wextra tools/power turbostat: version 2022.04.16 Sumeet Pawnikar (2): tools/power turbostat: Add Power Limit4 support tools/power turbostat: print power values upto three decimal Zephaniah E. Loss-Cutler-Hull (2): tools/power turbostat: Allow -e for all names. tools/power turbostat: Allow printing header every N iterations" * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power turbostat: version 2022.04.16 tools/power turbostat: No build warnings with -Wextra tools/power turbostat: be more useful as non-root tools/power turbostat: fix ICX DRAM power numbers tools/power turbostat: Support thermal throttle count print tools/power turbostat: Allow printing header every N iterations tools/power turbostat: Allow -e for all names. tools/power turbostat: print power values upto three decimal tools/power turbostat: Add Power Limit4 support tools/power turbostat: fix dump for AMD cpus tools/power turbostat: tweak --show and --hide capability
2022-04-19x86/cpu: Add new Alderlake and Raptorlake CPU model numbersTony Luck1-0/+3
Intel is subdividing the mobile segment with additional models with the same codename. Using the Intel "N" and "P" suffices for these will be less confusing than trying to map to some different naming convention. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-18x86: remove cruft from <asm/dma-mapping.h>Christoph Hellwig2-11/+2
<asm/dma-mapping.h> gets pulled in by all drivers using the DMA API. Remove x86 internal variables and unnecessary includes from it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Boris Ostrovsky <[email protected]>
2022-04-18swiotlb: merge swiotlb-xen initialization into swiotlbChristoph Hellwig1-5/+0
Reuse the generic swiotlb initialization for xen-swiotlb. For ARM/ARM64 this works trivially, while for x86 xen_swiotlb_fixup needs to be passed as the remap argument to swiotlb_init_remap/swiotlb_init_late. Note that the lower bound of the swiotlb size is changed to the smaller IO_TLB_MIN_SLABS based value with this patch, but that is fine as the 2MB value used in Xen before was just an optimization and is not the hard lower bound. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Boris Ostrovsky <[email protected]>
2022-04-18x86: remove the IOMMU table infrastructureChristoph Hellwig6-138/+8
The IOMMU table tries to separate the different IOMMUs into different backends, but actually requires various cross calls. Rewrite the code to do the generic swiotlb/swiotlb-xen setup directly in pci-dma.c and then just call into the IOMMU drivers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Boris Ostrovsky <[email protected]>
2022-04-17Merge tag 'x86-urgent-2022-04-17' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two x86 fixes related to TSX: - Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX to cover all CPUs which allow to disable it. - Disable TSX development mode at boot so that a microcode update which provides TSX development mode does not suddenly make the system vulnerable to TSX Asynchronous Abort" * tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsx: Disable TSX development mode at boot x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
2022-04-16tools/power turbostat: Add Power Limit4 supportSumeet Pawnikar1-0/+1
Add Power Limit4 support. Signed-off-by: Sumeet Pawnikar <[email protected]> Acked-by: Zhang Rui <[email protected]> Signed-off-by: Len Brown <[email protected]>
2022-04-15mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcoreOmar Sandoval1-2/+0
Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge the vmap areas instead of doing it lazily. Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") moved the purging from the vunmap() caller to a worker thread. Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin (possibly forever). For example, consider the following scenario: 1. Thread reads from /proc/vmcore. This eventually calls __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1. 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2 pages (one page plus the guard page) to the purge list and vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the drain_vmap_work is scheduled. 3. Thread returns from the kernel and is scheduled out. 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It frees the 2 pages on the purge list. vmap_lazy_nr is now lazy_max_pages() + 1. 5. This is still over the threshold, so it tries to purge areas again, but doesn't find anything. 6. Repeat 5. If the system is running with only one CPU (which is typicial for kdump) and preemption is disabled, then this will never make forward progress: there aren't any more pages to purge, so it hangs. If there is more than one CPU or preemption is enabled, then the worker thread will spin forever in the background. (Note that if there were already pages to be purged at the time that set_iounmap_nonlazy() was called, this bug is avoided.) This can be reproduced with anything that reads from /proc/vmcore multiple times. E.g., vmcore-dmesg /proc/vmcore. It turns out that improvements to vmap() over the years have obsoleted the need for this "optimization". I benchmarked `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed entirely: |5.17 |5.18+fix|5.18+removal 4k|40.86s| 40.09s| 26.73s 1M|24.47s| 23.98s| 21.84s The removal was the fastest (by a wide margin with 4k reads). This patch removes set_iounmap_nonlazy(). Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com Fixes: 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") Signed-off-by: Omar Sandoval <[email protected]> Acked-by: Chris Down <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-04-14x86/asm: Merge load_gs_index()Brian Gerst2-10/+4
Merge the 32- and 64-bit implementations of load_gs_index(). Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-14x86/32: Remove lazy GS macrosBrian Gerst2-6/+1
GS is always a user segment now. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-13mm/usercopy: Check kmap addresses properlyMatthew Wilcox (Oracle)1-0/+1
If you are copying to an address in the kmap region, you may not copy across a page boundary, no matter what the size of the underlying allocation. You can't kmap() a slab page because slab pages always come from low memory. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-13x86/uaccess: Implement macros for CMPXCHG on user addressesPeter Zijlstra1-0/+142
Add support for CMPXCHG loops on userspace addresses. Provide both an "unsafe" version for tight loops that do their own uaccess begin/end, as well as a "safe" version for use cases where the CMPXCHG is not buried in a loop, e.g. KVM will resume the guest instead of looping when emulation of a guest atomic accesses fails the CMPXCHG. Provide 8-byte versions for 32-bit kernels so that KVM can do CMPXCHG on guest PAE PTEs, which are accessed via userspace addresses. Guard the asm_volatile_goto() variation with CC_HAS_ASM_GOTO_TIED_OUTPUT, the "+m" constraint fails on some compilers that otherwise support CC_HAS_ASM_GOTO_OUTPUT. Cc: [email protected] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86: Use static calls to reduce kvm_pmu_ops overheadLike Xu1-0/+31
Use static calls to improve kvm_pmu_ops performance, following the same pattern and naming scheme used by kvm-x86-ops.h. Here are the worst fenced_rdtsc() cycles numbers for the kvm_pmu_ops functions that is most often called (up to 7 digits of calls) when running a single perf test case in a guest on an ICX 2.70GHz host (mitigations=on): | legacy | static call ------------------------------------------------------------ .pmc_idx_to_pmc | 1304840 | 994872 (+23%) .pmc_is_enabled | 978670 | 1011750 (-3%) .msr_idx_to_pmc | 47828 | 41690 (+12%) .is_valid_msr | 28786 | 30108 (-4%) Signed-off-by: Like Xu <[email protected]> [sean: Handle static call updates in pmu.c, tweak changelog] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86: Move .pmu_ops to kvm_x86_init_ops and tag as __initdataLike Xu1-2/+1
The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata. That'll save those precious few bytes, and more importantly make the original ops unreachable, i.e. make it harder to sneak in post-init modification bugs. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86: Move kvm_ops_static_call_update() to x86.cLike Xu1-14/+0
The kvm_ops_static_call_update() is defined in kvm_host.h. That's completely unnecessary, it should have exactly one caller, kvm_arch_hardware_setup(). Move the helper to x86.c and have it do the actual memcpy() of the ops in addition to the static call updates. This will also allow for cleanly giving kvm_pmu_ops static_call treatment. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: Move memcpy() into the helper and rename accordingly] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bitsSean Christopherson1-6/+2
Derive the mask of RWX bits reported on EPT violations from the mask of RWX bits that are shoved into EPT entries; the layout is the same, the EPT violation bits are simply shifted by three. Use the new shift and a slight copy-paste of the mask derivation instead of completely open coding the same to convert between the EPT entry bits and the exit qualification when synthesizing a nested EPT Violation. No functional change intended. Cc: SU Hang <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: VMX: replace 0x180 with EPT_VIOLATION_* definitionSU Hang1-0/+2
Using self-expressing macro definition EPT_VIOLATION_GVA_VALIDATION and EPT_VIOLATION_GVA_TRANSLATED instead of 0x180 in FNAME(walk_addr_generic)(). Signed-off-by: SU Hang <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13kvm: x86: Adjust the location of pkru_mask of kvm_mmu to reduce memoryPeng Hao1-8/+9
Adjust the field pkru_mask to the back of direct_map to make up 8-byte alignment.This reduces the size of kvm_mmu by 8 bytes. Signed-off-by: Peng Hao <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13Merge branch 'kvm-older-features' into HEADPaolo Bonzini3-14/+32
Merge branch for features that did not make it into 5.18: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13x86/apic: Clarify i82489DX bit overlap in APIC_LVT0Thomas Gleixner1-6/+0
Daniel stumbled over the bit overlap of the i82498DX external APIC and the TSC deadline timer configuration bit in modern APICs, which is neither documented in the code nor in the current SDM. Maciej provided links to the original i82489DX/486 documentation. See Link. Remove the i82489DX macro maze, use a i82489DX specific define in the apic code and document the overlap in a comment. Reported-by: Daniel Vacek <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Link: https://lore.kernel.org/r/87ee22f3ci.ffs@tglx
2022-04-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-5/+5
Pull kvm fixes from Paolo Bonzini: "x86: - Miscellaneous bugfixes - A small cleanup for the new workqueue code - Documentation syntax fix RISC-V: - Remove hgatp zeroing in kvm_arch_vcpu_put() - Fix alignment of the guest_hang() in KVM selftest - Fix PTE A and D bits in KVM selftest - Missing #include in vcpu_fp.c ARM: - Some PSCI fixes after introducing PSCIv1.1 and SYSTEM_RESET2 - Fix the MMU write-lock not being taken on THP split - Fix mixed-width VM handling - Fix potential UAF when debugfs registration fails - Various selftest updates for all of the above" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (24 commits) KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU KVM: SVM: Do not activate AVIC for SEV-enabled guest Documentation: KVM: Add SPDX-License-Identifier tag selftests: kvm: add tsc_scaling_sync to .gitignore RISC-V: KVM: include missing hwcap.h into vcpu_fp KVM: selftests: riscv: Fix alignment of the guest_hang() function KVM: selftests: riscv: Set PTE A and D bits in VS-stage page table RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put() selftests: KVM: Free the GIC FD when cleaning up in arch_timer selftests: KVM: Don't leak GIC FD across dirty log test iterations KVM: Don't create VM debugfs files outside of the VM directory KVM: selftests: get-reg-list: Add KVM_REG_ARM_FW_REG(3) KVM: avoid NULL pointer dereference in kvm_dirty_ring_push KVM: arm64: selftests: Introduce vcpu_width_config KVM: arm64: mixed-width check should be skipped for uninitialized vCPUs KVM: arm64: vgic: Remove unnecessary type castings KVM: arm64: Don't split hugepages outside of MMU write lock KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler KVM: arm64: Actually prevent SMC64 SYSTEM_RESET2 from AArch32 KVM: arm64: Generally disallow SMC64 for AArch32 guests ...
2022-04-12stat: fix inconsistency between struct stat and struct compat_statMikulas Patocka1-4/+2
struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit st_dev and st_rdev; struct compat_stat (defined in arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a 16-bit padding. This patch fixes struct compat_stat to match struct stat. [ Historical note: the old x86 'struct stat' did have that 16-bit field that the compat layer had kept around, but it was changes back in 2003 by "struct stat - support larger dev_t": https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d and back in those days, the x86_64 port was still new, and separate from the i386 code, and had already picked up the old version with a 16-bit st_dev field ] Note that we can't change compat_dev_t because it is used by compat_loop_info. Also, if the st_dev and st_rdev values are 32-bit, we don't have to use old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major number 259. Signed-off-by: Mikulas Patocka <[email protected]> Cc: Andreas Schwab <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-04-12x86/32: Simplify ELF_CORE_COPY_REGSBrian Gerst1-13/+2
GS is now always a user segment, so there is no difference between user and kernel registers. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-11KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPUVitaly Kuznetsov1-3/+1
The following WARN is triggered from kvm_vm_ioctl_set_clock(): WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm] ... CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G W O 5.16.0.stable #20 Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021 RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm] ... Call Trace: <TASK> ? kvm_write_guest+0x114/0x120 [kvm] kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm] kvm_arch_vm_ioctl+0xa26/0xc50 [kvm] ? schedule+0x4e/0xc0 ? __cond_resched+0x1a/0x50 ? futex_wait+0x166/0x250 ? __send_signal+0x1f1/0x3d0 kvm_vm_ioctl+0x747/0xda0 [kvm] ... The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty() is called without an active vCPU") but the change seems to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's no real need to actually write to guest memory to invalidate TSC page, this can be done by the first vCPU which goes through kvm_guest_time_update(). Reported-by: Maxim Levitsky <[email protected]> Reported-by: Naresh Kamboju <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]>
2022-04-11KVM: SVM: Do not activate AVIC for SEV-enabled guestSuravee Suthikulpanit1-0/+1
Since current AVIC implementation cannot support encrypted memory, inhibit AVIC for SEV-enabled guest. Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-11x86/tsx: Disable TSX development mode at bootPawan Gupta1-2/+2
A microcode update on some Intel processors causes all TSX transactions to always abort by default[*]. Microcode also added functionality to re-enable TSX for development purposes. With this microcode loaded, if tsx=on was passed on the cmdline, and TSX development mode was already enabled before the kernel boot, it may make the system vulnerable to TSX Asynchronous Abort (TAA). To be on safer side, unconditionally disable TSX development mode during boot. If a viable use case appears, this can be revisited later. [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557 [ bp: Drop unstable web link, massage heavily. ] Suggested-by: Andrew Cooper <[email protected]> Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Neelima Krishnan <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com
2022-04-10Merge tag 'x86_urgent_for_v5.18_rc2' of ↵Linus Torvalds3-20/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix the MSI message data struct definition - Use local labels in the exception table macros to avoid symbol conflicts with clang LTO builds - A couple of fixes to objtool checking of the relatively newly added SLS and IBT code - Rename a local var in the WARN* macro machinery to prevent shadowing * tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/msi: Fix msi message data shadow struct x86/extable: Prefer local labels in .set directives x86,bpf: Avoid IBT objtool warning objtool: Fix SLS validation for kcov tail-call replacement objtool: Fix IBT tail-call detection x86/bug: Prevent shadowing in __WARN_FLAGS x86/mm/tlb: Revert retpoline avoidance approach