Age | Commit message (Collapse) | Author | Files | Lines |
|
This is the same as commit 147277540bbc ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error
codes", 2016-11-23) added a new error code to aid nested page fault
handling. The commit unprotects (kvm_mmu_unprotect_page) the page when
we get a NPF due to guest page table walk where the page was marked RO.
However, if an L0->L2 shadow nested page table can also be marked read-only
when a page is read only in L1's nested page table. If such a page
is accessed by L2 while walking page tables it can cause a nested
page fault (page table walks are write accesses). However, after
kvm_mmu_unprotect_page we may get another page fault, and again in an
endless stream.
To cover this use case, we qualify the new error_code check with
vcpu->arch.mmu_direct_map so that the error_code check would run on L1
guest, and not the L2 guest. This avoids hitting the above scenario.
Fixes: 147277540bbc54119172481c8ef6d930cc9fbfc2
Cc: [email protected]
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Thomas Lendacky <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Reported by syzkaller:
The kvm-intel.unrestricted_guest=0
WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
Call Trace:
? put_pid+0x3a/0x50
? rcu_read_lock_sched_held+0x79/0x80
? kmem_cache_free+0x2f2/0x350
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
The syszkaller folks reported a residual mmio emulation request to userspace
due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
several threads to launch the same vCPU, the thread which lauch this vCPU after
the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
trigger the warning.
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/kvm.h>
#include <stdio.h>
int kvmcpu;
struct kvm_run *run;
void* thr(void* arg)
{
int res;
res = ioctl(kvmcpu, KVM_RUN, 0);
printf("ret1=%d exit_reason=%d suberror=%d\n",
res, run->exit_reason, run->internal.suberror);
return 0;
}
void test()
{
int i, kvm, kvmvm;
pthread_t th[4];
kvm = open("/dev/kvm", O_RDWR);
kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
srand(getpid());
for (i = 0; i < 4; i++) {
pthread_create(&th[i], 0, thr, 0);
usleep(rand() % 10000);
}
for (i = 0; i < 4; i++)
pthread_join(th[i], 0);
}
int main()
{
for (;;) {
int pid = fork();
if (pid < 0)
exit(1);
if (pid == 0) {
test();
exit(0);
}
int status;
while (waitpid(pid, &status, __WALL) != pid) {}
}
return 0;
}
This patch fixes it by resetting the vcpu->mmio_needed once we receive
the triple fault to avoid the residue.
Reported-by: Dmitry Vyukov <[email protected]>
Tested-by: Dmitry Vyukov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Since the kernel segment registers are not prepared at the
entry of irq-entry code, if a kprobe on such code is
jump-optimized, accessing per-CPU variables may cause a
kernel panic.
However, if the kprobe is not optimized, it triggers an int3
exception and sets segment registers correctly.
With this patch we check the probe-address and if it is in the
irq-entry code, it prohibits optimizing such kprobes.
This means we can continue probing such interrupt handlers by kprobes
but it is not optimized anymore.
Reported-by: Francis Deslauriers <[email protected]>
Tested-by: Francis Deslauriers <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Cc: Ananth N Mavinakayanahalli <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David S . Miller <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/150172795654.27216.9824039077047777477.stgit@devbox
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Generate irqentry and softirqentry text sections without
any Kconfig dependencies. This will add extra sections, but
there should be no performace impact.
Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Cc: Ananth N Mavinakayanahalli <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David S . Miller <[email protected]>
Cc: Francis Deslauriers <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/150172789110.27216.3955739126693102122.stgit@devbox
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Make this structure const as it is only used during copy operation.
Signed-off-by: Bhumika Goyal <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Subarchitecture support (mflags-y) was removed from x86 in this commit:
6bda2c8b32fe ("x86: remove subarchitecture support")
So drop the mflags-y usage from the Makefile.
Signed-off-by: Cao jin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Apparently the binutils 2.20 assembler can't handle the '&&' operator in
the UNWIND_HINT_REGS macro. Rearrange the macro to do without it.
This fixes the following error:
arch/x86/entry/entry_64.S: Assembler messages:
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
Reported-by: Andrew Morton <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 39358a033b2e ("objtool, x86: Add facility for asm code to provide unwind hints")
Link: http://lkml.kernel.org/r/e2ad97c1ae49a484644b4aaa4dd3faa4d6d969b2.1502116651.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The segment register high words on x86_32 may contain garbage.
Teach regs_get_register() to read them as u16 instead of unsigned
long.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/0b76f6dbe477b7b1a81938fddcc3c483d48f0ff2.1502314765.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Xen's raw SYSCALL entries are much less weird than native. Rather
than fudging them to look like native entries, use the Xen-provided
stack frame directly.
This lets us eliminate entry_SYSCALL_64_after_swapgs and two uses of
the SWAPGS_UNSAFE_STACK paravirt hook. The SYSENTER code would
benefit from similar treatment.
This makes one change to the native code path: the compat
instruction that clears the high 32 bits of %rax is moved slightly
later. I'd be surprised if this affects performance at all.
Tested-by: Juergen Gross <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/7c88ed36805d36841ab03ec3b48b4122c4418d71.1502164668.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This closes a hole in our SMAP implementation.
This patch comes from grsecurity. Good catch!
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/314cc9f294e8f14ed85485727556ad4f15bb1659.1502159503.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In Family 17h, the number of cores sharing a cache level is obtained
from the Cache Properties CPUID leaf (0x8000001d) by passing in the
cache level in ECX. In prior families, a cache level of 2 was used to
determine this information.
To get the right information, irrespective of Family, iterate over
the cache levels using CPUID 0x8000001d. The last level cache is the
last value to return a non-zero value in EAX.
Signed-off-by: Janakarajan Natarajan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/5ab569025b39cdfaeca55b571d78c0fc800bdb69.1497452002.git.Janakarajan.Natarajan@amd.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In Family 17h, L3 is the last level cache as opposed to L2 in previous
families. Avoid this name confusion and rename X86_FEATURE_PERFCTR_L2 to
X86_FEATURE_PERFCTR_LLC to indicate the performance counter on the last
level of cache.
Signed-off-by: Janakarajan Natarajan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/016311029fdecdc3fdc13b7ed865c6cbf48b2f15.1497452002.git.Janakarajan.Natarajan@amd.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Vince reported the following rdpmc() testcase failure:
> Failing test case:
>
> fd=perf_event_open();
> addr=mmap(fd);
> exec() // without closing or unmapping the event
> fd=perf_event_open();
> addr=mmap(fd);
> rdpmc() // GPFs due to rdpmc being disabled
The problem is of course that exec() plays tricks with what is
current->mm, only destroying the old mappings after having
installed the new mm.
Fix this confusion by passing along vma->vm_mm instead of relying on
current->mm.
Reported-by: Vince Weaver <[email protected]>
Tested-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 1e0fb9ec679c ("perf: Add pmu callbacks to track event mapping and unmapping")
Link: http://lkml.kernel.org/r/[email protected]
[ Minor cleanups. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the x86_64 eBPF JIT.
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
It was reported that the sha1 AVX2 function(sha1_transform_avx2) is
reading ahead beyond its intended data, and causing a crash if the next
block is beyond page boundary:
http://marc.info/?l=linux-crypto-vger&m=149373371023377
This patch makes sure that there is no overflow for any buffer length.
It passes the tests written by Jan Stancek that revealed this problem:
https://github.com/jstancek/sha1-avx2-crash
I have re-enabled sha1-avx2 by reverting commit
b82ce24426a4071da9529d726057e4e642948667
Cc: <[email protected]>
Fixes: b82ce24426a4 ("crypto: sha1-ssse3 - Disable avx2")
Originally-by: Ilya Albrekht <[email protected]>
Tested-by: Jan Stancek <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Reported-by: Jan Stancek <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.
Signed-off-by: Longpeng(Mike) <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If a vcpu exits due to request a user mode spinlock, then
the spinlock-holder may be preempted in user mode or kernel mode.
(Note that not all architectures trap spin loops in user mode,
only AMD x86 and ARM/ARM64 currently do).
But if a vcpu exits in kernel mode, then the holder must be
preempted in kernel mode, so we should choose a vcpu in kernel mode
as a more likely candidate for the lock holder.
This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted. kvm_vcpu_on_spin's
new argument says the same of the spinning VCPU.
Signed-off-by: Longpeng(Mike) <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add guest_cpuid_clear() and use it instead of kvm_find_cpuid_entry().
Also replace some uses of kvm_find_cpuid_entry() with guest_cpuid_has().
Signed-off-by: Radim Krčmář <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This patch turns guest_cpuid_has_XYZ(cpuid) into guest_cpuid_has(cpuid,
X86_FEATURE_XYZ), which gets rid of many very similar helpers.
When seeing a X86_FEATURE_*, we can know which cpuid it belongs to, but
this information isn't in common code, so we recreate it for KVM.
Add some BUILD_BUG_ONs to make sure that it runs nicely.
Signed-off-by: Radim Krčmář <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
bit(X86_FEATURE_NRIPS) is 3 since 2ccd71f1b278 ("x86/cpufeature: Move
some of the scattered feature bits to x86_capability"), so we can
simplify the code.
Signed-off-by: Radim Krčmář <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
Enable VMFUNC in the secondary execution controls. This simplifies the
changes necessary to expose it to nested hypervisors. VMFUNCs still
cause #UD when invoked.
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
Let's also just use the underlying functions directly here.
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
[Rebased on top of 9f744c597460 ("KVM: nVMX: do not pin the VMCS12")]
Signed-off-by: Radim Krčmář <[email protected]>
|
|
nested_get_page() just sounds confusing. All we want is a page from G1.
This is even unrelated to nested.
Let's introduce kvm_vcpu_gpa_to_page() so we don't get too lengthy
lines.
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
[Squash pasto fix from Wanpeng Li. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Expose the "Enable INVPCID" secondary execution control to the guest
and properly reflect the exit reason.
In addition, before this patch the guest was always running with
INVPCID enabled, causing pcid.flat's "Test on INVPCID when disabled"
test to fail.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
It has been experimentally confirmed that supporting these two MSRs is one
of the necessary conditions for nested Hyper-V to use the TSC page. Modern
Windows guests are noticeably slower when they fall back to reading
timestamps from the HV_X64_MSR_TIME_REF_COUNT MSR instead of using the TSC
page.
The newly supported MSRs are advertised with the AccessFrequencyRegs
partition privilege flag and CPUID.40000003H:EDX[8] "Support for
determining timer frequencies is available" (both outside of the scope of
this KVM patch).
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Ladi Prosek <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Pull KVM fixes from Radim Krčmář:
"ARM:
- Yet another race with VM destruction plugged
- A set of small vgic fixes
x86:
- Preserve pending INIT
- RCU fixes in paravirtual async pf, VM teardown, and VMXOFF
emulation
- nVMX interrupt injection and dirty tracking fixes
- initialize to make UBSAN happy"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm/arm64: vgic: Use READ_ONCE fo cmpxchg
KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"
KVM: nVMX: mark vmcs12 pages dirty on L2 exit
kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown
KVM: nVMX: do not pin the VMCS12
KVM: avoid using rcu_dereference_protected
KVM: X86: init irq->level in kvm_pv_kick_cpu_op
KVM: X86: Fix loss of pending INIT due to race
KVM: async_pf: make rcu irq exit if not triggered from idle task
KVM: nVMX: fixes to nested virt interrupt injection
KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12
KVM: arm/arm64: Handle hva aging while destroying the vm
KVM: arm/arm64: PMU: Fix overflow interrupt injection
KVM: arm/arm64: Fix bug in advertising KVM_CAP_MSI_DEVID capability
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"The recent irq core changes unearthed API abuse in the HPET code,
which manifested itself in a suspend/resume regression.
The fix replaces the cruft with the proper function calls and cures
the regression"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hpet: Cure interface abuse in the resume path
|
|
There are quite a number of occurrences in the kernel of the pattern
if (dst != src)
memcpy(dst, src, walk.total % AES_BLOCK_SIZE);
crypto_xor(dst, final, walk.total % AES_BLOCK_SIZE);
or
crypto_xor(keystream, src, nbytes);
memcpy(dst, keystream, nbytes);
where crypto_xor() is preceded or followed by a memcpy() invocation
that is only there because crypto_xor() uses its output parameter as
one of the inputs. To avoid having to add new instances of this pattern
in the arm64 code, which will be refactored to implement non-SIMD
fallbacks, add an alternative implementation called crypto_xor_cpy(),
taking separate input and output arguments. This removes the need for
the separate memcpy().
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
We're about to amend ACPI bus scan with DMI checks whether we're running
on a Mac to support Apple device properties in AML. The DMI checks are
performed for every single device, adding overhead for everything x86
that isn't Apple, which is the majority. Rafael and Andy therefore
request to perform the DMI match only once and cache the result.
Outside of ACPI various other Apple DMI checks exist and it seems
reasonable to use the cached value there as well. Rafael, Andy and
Darren suggest performing the DMI check in arch code and making it
available with a header in include/linux/platform_data/x86/.
To this end, add early_platform_quirks() to arch/x86/kernel/quirks.c
to perform the DMI check and invoke it from setup_arch(). Switch over
all existing Apple DMI checks, thereby fixing two deficiencies:
* They are now #defined to false on non-x86 arches and can thus be
optimized away if they're located in cross-arch code.
* Some of them only match "Apple Inc." but not "Apple Computer, Inc.",
which is used by BIOSes released between January 2006 (when the first
x86 Macs started shipping) and January 2007 (when the company name
changed upon introduction of the iPhone).
Suggested-by: Andy Shevchenko <[email protected]>
Suggested-by: Rafael J. Wysocki <[email protected]>
Suggested-by: Darren Hart <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Acked-by: Mika Westerberg <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
* pm-cpufreq-x86:
cpufreq: x86: Make scaling_cur_freq behave more as expected
* pm-cpufreq-docs:
cpufreq: docs: Add missing cpuinfo_cur_freq description
* intel_pstate:
cpufreq: intel_pstate: Drop ->get from intel_pstate structure
|
|
------------[ cut here ]------------
WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
Call Trace:
vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
? kvm_arch_vcpu_load+0x62/0x230 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
do_syscall_64+0x8f/0x750
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL64_slow_path+0x25/0x25
This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
means that tells the kernel to not make use of any IOAPICs that may be present
in the system.
Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
variable passed from vcpu_enter_guest() which means that the L0's userspace
requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
L0's userspace reqeusts an irq window) is true, so there is no interrupt which
L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
interrupt on exit" for the irq window requirement in this scenario.
This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
if there is no L1 requirement to inject an interrupt to L2.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
[Added code comment to make it obvious that the behavior is not correct.
We should do a userspace exit with open interrupt window instead of the
nested VM exit. This patch still improves the behavior, so it was
accepted as a (temporary) workaround.]
Signed-off-by: Radim Krčmář <[email protected]>
|
|
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.
Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.
Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
According to the Intel SDM, software cannot rely on the current VMCS to be
coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
flushes.
24.11.1 Software Use of Virtual-Machine Control Structures
...
If a logical processor leaves VMX operation, any VMCSs active on
that logical processor may be corrupted (see below). To prevent
such corruption of a VMCS that may be used either after a return
to VMX operation or on another logical processor, software should
execute VMCLEAR for that VMCS before executing the VMXOFF instruction
or removing power from the processor (e.g., as part of a transition
to the S3 and S4 power states).
...
This fixes a "suspicious rcu_dereference_check() usage!" warning during
kvm_vm_release() because nested_release_vmcs12() calls
kvm_vcpu_write_guest_page() without holding kvm->srcu.
Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
Since the current implementation of VMCS12 does a memcpy in and out
of guest memory, we do not need current_vmcs12 and current_vmcs12_page
anymore. current_vmptr is enough to read and write the VMCS12.
And David Matlack noted:
This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
VMCS12 page by using kvm_write_guest. nested_release_page() only marks
the struct page dirty.
Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
[Added David Matlack's note and nested_release_page_clean() fix.]
Signed-off-by: Radim Krčmář <[email protected]>
|
|
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:
UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
[<ffffffff81f030b6>] dump_stack+0x1e/0x20
[<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
[<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
[<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
[<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
[<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
[<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
[<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
[<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...
Signed-off-by: Longpeng(Mike) <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
When SMP VM start, AP may lost INIT because of receiving INIT between
kvm_vcpu_ioctl_x86_get/set_vcpu_events.
vcpu 0 vcpu 1
kvm_vcpu_ioctl_x86_get_vcpu_events
events->smi.latched_init = 0
send INIT to vcpu1
set vcpu1's pending_events
kvm_vcpu_ioctl_x86_set_vcpu_events
if (events->smi.latched_init == 0)
clear INIT in pending_events
This patch fixes it by just update SMM related flags if we are in SMM.
Thanks Peng Hao for the report and original commit message.
Reported-by: Peng Hao <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
CPUID.(EAX=0x10, ECX=res#):EBX[31:0] reports a bit mask for a resource.
Each set bit within the length of the CBM indicates the corresponding
unit of the resource allocation may be used by other entities in the
platform (e.g. an integrated graphics engine or hardware units outside
the processor core and have direct access to the resource). Each
cleared bit within the length of the CBM indicates the corresponding
allocation unit can be configured to implement a priority-based
allocation scheme without interference with other hardware agents in
the system. Bits outside the length of the CBM are reserved.
More details on the bit mask are described in x86 Software Developer's
Manual.
The bitmask is shown in "info" directory for each resource. It's
up to user to decide how to use the bitmask within a CBM in a partition
to share or isolate a resource with other executing units.
Suggested-by: Reinette Chatre <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
|
|
Set up a delayed work queue for each domain that will read all
the MBM counters of active RMIDs once per second to make sure
that they don't wrap around between reads from users.
[Tony: Added the initializations for the work structure and completed
the patch]
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1501017287-28083-29-git-send-email-vikas.shivappa@linux.intel.com
|
|
MBM counters are monotonically increasing counts representing the total
memory bytes at a particular time. In order to calculate total_bytes for
an rdtgroup, we store the value of the counter when we create an
rdtgroup or when a new domain comes online.
When the total_bytes(all memory controller bytes) or local_bytes(local
memory controller bytes) file in "mon_data" is read it shows the
total bytes for that rdtgroup since its creation. User can snapshot this
at different time intervals to obtain bytes/second.
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1501017287-28083-28-git-send-email-vikas.shivappa@linux.intel.com
|
|
Check CPUID bits for whether each of the MBM events is supported.
Allocate space for each RMID for each counter in each domain to save
previous MSR counter value and running total of data.
Create files in each of the monitor directories.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1501017287-28083-27-git-send-email-vikas.shivappa@linux.intel.com
|
|
Resource groups have a per domain directory under "mon_data". Add or
remove these directories as and when domains come online and go offline.
Also update the per cpu rmids and cache upon onlining and offlining
cpus.
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1501017287-28083-26-git-send-email-vikas.shivappa@linux.intel.com
|
|
OS associates an RMID/CLOSid to a task by writing the per CPU
IA32_PQR_ASSOC MSR when a task is scheduled in.
The sched_in code will stay as no-op unless we are running on Intel SKU
which supports either resource control or monitoring and we also enable
them by mounting the resctrl fs. The per cpu CLOSid/RMID values are
cached and the write is performed only when a task with a different
CLOSid/RMID is scheduled in.
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1501017287-28083-25-git-send-email-vikas.shivappa@linux.intel.com
|
|
Introduce the usage of rdt_enable_key in sched_in code as a preparation
to add RDT monitoring support for sched_in.
Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/1501017287-28083-24-git-send-email-vikas.shivappa@linux.intel.com
|