Age | Commit message (Collapse) | Author | Files | Lines |
|
The nested VMX controls MSRs can be initialized by the global capability
values stored in vmcs_config.
Signed-off-by: Chenyi Qiang <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the
system ioctl KVM_GET_MSR. Add them to the setup of nested VMX controls MSR.
Signed-off-by: Chenyi Qiang <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The save and ctl pointers are passed uninitialized to kfree() when
svm_set_nested_state() follows the 'goto out_set_gif' path. While the
issue could've been fixed by initializing these on-stack varialbles to
NULL, it seems preferable to eliminate 'out_set_gif' label completely as
it is not actually a failure path and duplicating a single svm_set_gif()
call doesn't look too bad.
[ bp: Drop obscure Addresses-Coverity: tag. ]
Fixes: 6ccbd29ade0d ("KVM: SVM: nested: Don't allocate VMCB structures on stack")
Reported-by: Dan Carpenter <[email protected]>
Reported-by: Joerg Roedel <[email protected]>
Reported-by: Colin King <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Acked-by: Joerg Roedel <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In the architecture independent version of hyperv-tlfs.h, commit c55a844f46f958b
removed the "X64" in the symbol names so they would make sense for both x86 and
ARM64. That commit added aliases with the "X64" in the x86 version of hyperv-tlfs.h
so that existing x86 code would continue to compile.
As a cleanup, update the x86 code to use the symbols without the "X64", then remove
the aliases. There's no functional change.
Signed-off-by: Joseph Salisbury <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Pull more kvm fixes from Paolo Bonzini:
"Five small fixes.
The nested migration bug will be fixed with a better API in 5.10 or
5.11, for now this is a fix that works with existing userspace but
keeps the current ugly API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Add a dedicated INVD intercept routine
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
KVM: x86: fix MSR_IA32_TSC read for nested migration
selftests: kvm: Fix assert failure in single-step test
KVM: x86: VMX: Make smaller physical guest address space support user-configurable
|
|
The INVD instruction intercept performs emulation. Emulation can't be done
on an SEV guest because the guest memory is encrypted.
Provide a dedicated intercept routine for the INVD intercept. And since
the instruction is emulated as a NOP, just skip it instead.
Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command")
Signed-off-by: Tom Lendacky <[email protected]>
Message-Id: <a0b9a19ffa7fef86a3cc700c7ea01cb2731e04e5.1600972918.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled.
Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are
toggled inadvertantly skipped the MMU context reset due to the mask
of bits that triggers PDPTR loads also being used to trigger MMU context
resets.
Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode")
Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode")
Cc: Jim Mattson <[email protected]>
Cc: Peter Shier <[email protected]>
Cc: Oliver Upton <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
MSR reads/writes should always access the L1 state, since the (nested)
hypervisor should intercept all the msrs it wants to adjust, and these
that it doesn't should be read by the guest as if the host had read it.
However IA32_TSC is an exception. Even when not intercepted, guest still
reads the value + TSC offset.
The write however does not take any TSC offset into account.
This is documented in Intel's SDM and seems also to happen on AMD as well.
This creates a problem when userspace wants to read the IA32_TSC value and then
write it. (e.g for migration)
In this case it reads L2 value but write is interpreted as an L1 value.
To fix this make the userspace initiated reads of IA32_TSC return L1 value
as well.
Huge thanks to Dave Gilbert for helping me understand this very confusing
semantic of MSR writes.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
user-configurable
This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
Since smaller physical address spaces are only supported on VMX, the
parameter is only exposed in the kvm_intel module.
For now disable support by default, and let the user decide if they want
to enable it.
Modifications to VMX page fault and EPT violation handling will depend
on whether that parameter is enabled.
Signed-off-by: Mohammed Gamal <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
|
|
encryption domains
In some hardware implementations, coherency between the encrypted and
unencrypted mappings of the same physical page in a VM is enforced. In
such a system, it is not required for software to flush the VM's page
from all CPU caches in the system prior to changing the value of the
C-bit for the page.
So check that bit before flushing the cache.
Signed-off-by: Krish Sadhukhan <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Pull kvm fixes from Paolo Bonzini:
"A bit on the bigger side, mostly due to me being on vacation, then
busy, then on parental leave, but there's nothing worrisome.
ARM:
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced (dirty
logging, for example)
- Fix tracing output of 64bit values
x86:
- nSVM state restore fixes
- Async page fault fixes
- Lots of small fixes everywhere"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: emulator: more strict rsm checks.
KVM: nSVM: more strict SMM checks when returning to nested guest
SVM: nSVM: setup nested msr permission bitmap on nested state load
SVM: nSVM: correctly restore GIF on vmexit from nesting after migration
x86/kvm: don't forget to ACK async PF IRQ
x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macro
KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit
KVM: SVM: avoid emulation with stale next_rip
KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN
KVM: SVM: Periodically schedule when unregistering regions on destroy
KVM: MIPS: Change the definition of kvm type
kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed
KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control
KVM: fix memory leak in kvm_io_bus_unregister_dev()
KVM: Check the allocation of pv cpu mask
KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
KVM: arm64: Update page shift if stage 2 block mapping not supported
KVM: arm64: Fix address truncation in traces
KVM: arm64: Do not try to map PUDs when they are folded into PMD
arm64/x86: KVM: Introduce steal-time cap
...
|
|
Don't ignore return values in rsm_load_state_64/32 to avoid
loading invalid state from SMM state area if it was tampered with
by the guest.
This is primarly intended to avoid letting guest set bits in EFER
(like EFER.SVME when nesting is disabled) by manipulating SMM save area.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
* check that guest is 64 bit guest, otherwise the SVM related fields
in the smm state area are not defined
* If the SMM area indicates that SMM interrupted a running guest,
check that EFER.SVME which is also saved in this area is set, otherwise
the guest might have tampered with SMM save area, and so indicate
emulation failure which should triple fault the guest.
* Check that that guest CPUID supports SVM (due to the same issue as above)
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This code was missing and was forcing the L2 run with L1's msr
permission bitmap
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Currently code in svm_set_nested_state copies the current vmcb control
area to L1 control area (hsave->control), under assumption that
it mostly reflects the defaults that kvm choose, and later qemu
overrides these defaults with L2 state using standard KVM interfaces,
like KVM_SET_REGS.
However nested GIF (which is AMD specific thing) is by default is true,
and it is copied to hsave area as such.
This alone is not a big deal since on VMexit, GIF is always set to false,
regardless of what it was on VM entry. However in nested_svm_vmexit we
were first were setting GIF to false, but then we overwrite the control
fields with value from the hsave area. (including the nested GIF field
itself if GIF virtualization is enabled).
Now on normal vm entry this is not a problem, since GIF is usually false
prior to normal vm entry, and this is the value that copied to hsave,
and then restored, but this is not always the case when the nested state
is loaded as explained above.
To fix this issue, move svm_set_gif after we restore the L1 control
state in nested_svm_vmexit, so that even with wrong GIF in the
saved L1 control area, we still clear GIF as the spec says.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
According to SDM 27.2.4, Event delivery causes an APIC-access VM exit.
Don't report internal error and freeze guest when event delivery causes
an APIC-access exit, it is handleable and the event will be re-injected
during the next vmentry.
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
svm->next_rip is reset in svm_vcpu_run() only after calling
svm_exit_handlers_fastpath(), which will cause SVM's
skip_emulated_instruction() to write a stale RIP.
We can move svm_exit_handlers_fastpath towards the end of
svm_vcpu_run(). To align VMX with SVM, keep svm_complete_interrupts()
close as well.
Suggested-by: Sean Christopherson <[email protected]>
Cc: Paul K. <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
[Also move vmcb_mark_all_clean before any possible write to the VMCB.
- Paolo]
|
|
Even without in-kernel LAPIC we should allow writing '0' to
MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In
particular, QEMU with 'kernel-irqchip=off' fails to start
a guest with
qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0
Fixes: 9d3c447c72fb2 ("KVM: X86: Fix async pf caused null-ptr-deref")
Reported-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
[Actually commit the version proposed by Sean Christopherson. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
There may be many encrypted regions that need to be unregistered when a
SEV VM is destroyed. This can lead to soft lockups. For example, on a
host running 4.15:
watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
CPU: 206 PID: 194348 Comm: t_virtual_machi
RIP: 0010:free_unref_page_list+0x105/0x170
...
Call Trace:
[<0>] release_pages+0x159/0x3d0
[<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
[<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
[<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
[<0>] kvm_arch_destroy_vm+0x47/0x200
[<0>] kvm_put_kvm+0x1a8/0x2f0
[<0>] kvm_vm_release+0x25/0x30
[<0>] do_exit+0x335/0xc10
[<0>] do_group_exit+0x3f/0xa0
[<0>] get_signal+0x1bc/0x670
[<0>] do_signal+0x31/0x130
Although the CLFLUSH is no longer issued on every encrypted region to be
unregistered, there are no other changes that can prevent soft lockups for
very large SEV VMs in the latest kernel.
Periodically schedule if necessary. This still holds kvm->lock across the
resched, but since this only happens when the VM is destroyed this is
assumed to be acceptable.
Signed-off-by: David Rientjes <[email protected]>
Message-Id: <alpine.DEB.2.23.453.2008251255240.2987727@chino.kir.corp.google.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When kvm_mmu_get_page() gets a page with unsynced children, the spt
pagetable is unsynchronized with the guest pagetable. But the
guest might not issue a "flush" operation on it when the pagetable
entry is changed from zero or other cases. The hypervisor has the
responsibility to synchronize the pagetables.
KVM behaved as above for many years, But commit 8c8560b83390
("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
inadvertently included a line of code to change it without giving any
reason in the changelog. It is clear that the commit's intention was to
change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
needlessly flush other contexts; however, one of the hunks changed
a nearby KVM_REQ_MMU_SYNC instead. This patch changes it back.
Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Sean Christopherson <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <[email protected]>
fixes: 8c8560b83390 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
in exit_ctls_high.
Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control")
Signed-off-by: Chenyi Qiang <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
load_pdptrs to read the possibly updated PDPTEs from the guest
physical address referenced by CR3. It loads them into
vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
vcpu->arch.regs_dirty.
At the subsequent assumed reentry into L2, the mmu will call
vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
if the L2 CRn write intercept always resumes L2.
The resume path calls vmx_check_nested_events which checks for
exceptions, MTF, and expired VMX preemption timers. If
vmx_check_nested_events finds any of these conditions pending it will
reflect the corresponding exit into L1. Live migration at this point
would also cause a missed immediate reentry into L2.
After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next
resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
values stored at last L2 exit. A repro of this bug showed L2 entering
triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
When L2 is in PAE paging mode add a call to ept_load_pdptrs before
leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
vcpu->arch.walk_mmu->pdptrs[].
Tested:
kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
Verified that test fails without the fix.
Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
failures.
Signed-off-by: Peter Shier <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for Linux 5.9, take #1
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values
|
|
Header frame.h is getting more code annotations to help objtool analyze
object files.
Rename the file to objtool.h.
[ jpoimboe: add objtool.h to MAINTAINERS ]
Signed-off-by: Julien Thierry <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
|
|
Extend the vmcb_safe_area with SEV-ES fields and add a new
'struct ghcb' which will be used for guest-hypervisor communication.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Do not allocate a vmcb_control_area and a vmcb_save_area on the stack,
as these structures will become larger with future extenstions of
SVM and thus the svm_set_nested_state() function will become a too large
stack frame.
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Pick up work happening in parallel to avoid nasty merge conflicts later.
Signed-off-by: Borislav Petkov <[email protected]>
|
|
TSX suspend load tracking instruction is supported by the Intel uarch
Sapphire Rapids. It aims to give a way to choose which memory accesses
do not need to be tracked in the TSX read set. It's availability is
indicated as CPUID.(EAX=7,ECX=0):EDX[bit 16].
Expose TSX Suspend Load Address Tracking feature in KVM CPUID, so KVM
could pass this information to guests and they can make use of this
feature accordingly.
Signed-off-by: Cathy Zhang <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu()
as it also checkes if the right lock is held.
Using hlist_for_each_entry_rcu() with a condition argument will not
report the cases where a SRCU protected list is traversed using
rcu_read_lock(). Hence, use hlist_for_each_entry_srcu().
Signed-off-by: Madhuparna Bhowmik <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: <[email protected]>
|
|
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <[email protected]>
|
|
Pull kvm fixes from Paolo Bonzini:
- PAE and PKU bugfixes for x86
- selftests fix for new binutils
- MMU notifier fix for arm64
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set
KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()
kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode
kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode
KVM: x86: fix access code passed to gva_to_gpa
selftests: kvm: Use a shorter encoding to clear RAX
|
|
The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.
Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.
Cc: <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: James Morse <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe
support for steal-time. However this is unnecessary, as only a KVM
fd is required, and it complicates userspace (userspace may prefer
delaying vcpu creation until after feature probing). Introduce a cap
that can be checked instead. While x86 can already probe steal-time
support with a kvm fd (KVM_GET_SUPPORTED_CPUID), we add the cap there
too for consistency.
Signed-off-by: Andrew Jones <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest")
Cc: Huaitong Han <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The PK bit of the error code is computed dynamically in permission_fault
and therefore need not be passed to gva_to_gpa: only the access bits
(fetch, user, write) need to be passed down.
Not doing so causes a splat in the pku test:
WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
FS: 00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
paging64_gva_to_gpa+0x3f/0xb0 [kvm]
kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
ksys_ioctl+0x92/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x3e/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace d17eb998aee991da ]---
Reported-by: Sean Christopherson <[email protected]>
Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Pull more KVM updates from Paolo Bonzini:
"PPC:
- Improvements and bugfixes for secure VM support, giving reduced
startup time and memory hotplug support.
- Locking fixes in nested KVM code
- Increase number of guests supported by HV KVM to 4094
- Preliminary POWER10 support
ARM:
- Split the VHE and nVHE hypervisor code bases, build the EL2 code
separately, allowing for the VHE code to now be built with
instrumentation
- Level-based TLB invalidation support
- Restructure of the vcpu register storage to accomodate the NV code
- Pointer Authentication available for guests on nVHE hosts
- Simplification of the system register table parsing
- MMU cleanups and fixes
- A number of post-32bit cleanups and other fixes
MIPS:
- compilation fixes
x86:
- bugfixes
- support for the SERIALIZE instruction"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup
x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled
MIPS: KVM: Convert a fallthrough comment to fallthrough
MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64
x86: Expose SERIALIZE for supported cpuid
KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled
KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort()
KVM: arm64: Don't skip cache maintenance for read-only memslots
KVM: arm64: Handle data and instruction external aborts the same way
KVM: arm64: Rename kvm_vcpu_dabt_isextabt()
KVM: arm: Add trace name for ARM_NISV
KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text
KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS
KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE
KVM: PPC: Book3S HV: Rework secure mem slot dropping
KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up
KVM: PPC: Book3S HV: Migrate hot plugged memory
KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs
KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs
KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
...
|
|
Pull virtio updates from Michael Tsirkin:
- IRQ bypass support for vdpa and IFC
- MLX5 vdpa driver
- Endianness fixes for virtio drivers
- Misc other fixes
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits)
vdpa/mlx5: fix up endian-ness for mtu
vdpa: Fix pointer math bug in vdpasim_get_config()
vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
vdpa/mlx5: fix memory allocation failure checks
vdpa/mlx5: Fix uninitialised variable in core/mr.c
vdpa_sim: init iommu lock
virtio_config: fix up warnings on parisc
vdpa/mlx5: Add VDPA driver for supported mlx5 devices
vdpa/mlx5: Add shared memory registration code
vdpa/mlx5: Add support library for mlx5 VDPA implementation
vdpa/mlx5: Add hardware descriptive header file
vdpa: Modify get_vq_state() to return error code
net/vdpa: Use struct for set/get vq state
vdpa: remove hard coded virtq num
vdpasim: support batch updating
vhost-vdpa: support IOTLB batching hints
vhost-vdpa: support get/set backend features
vhost: generialize backend features setting/getting
vhost-vdpa: refine ioctl pre-processing
vDPA: dont change vq irq after DRIVER_OK
...
|
|
Based on an analysis of the HyperV firmwares (Gen1 and Gen2) it seems
like the SCONTROL is not being set to the ENABLED state as like we have
thought.
Also from a test done by Vitaly Kuznetsov, running a nested HyperV it
was concluded that the first access to the SCONTROL MSR with a read
resulted with the value of 0x1, aka HV_SYNIC_CONTROL_ENABLE.
It's important to note that this diverges from the value states in the
HyperV TLFS of 0.
Signed-off-by: Jon Doron <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The SERIALIZE instruction is supported by Tntel processors, like
Sapphire Rapids. SERIALIZE is a faster serializing instruction which
does not modify registers, arithmetic flags or memory, will not cause VM
exit. It's availability is indicated by CPUID.(EAX=7,ECX=0):ECX[bit 14].
Expose it in KVM supported CPUID. This way, KVM could pass this
information to guests and they can make use of these features accordingly.
Signed-off-by: Cathy Zhang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Don't attempt to load PDPTRs if EFER.LME=1, i.e. if 64-bit mode is
enabled. A recent change to reload the PDTPRs when CR0.CD or CR0.NW is
toggled botched the EFER.LME handling and sends KVM down the PDTPR path
when is_paging() is true, i.e. when the guest toggles CD/NW in 64-bit
mode.
Split the CR0 checks for 64-bit vs. 32-bit PAE into separate paths. The
64-bit path is specifically checking state when paging is toggled on,
i.e. CR0.PG transititions from 0->1. The PDPTR path now needs to run if
the new CR0 state has paging enabled, irrespective of whether paging was
already enabled. Trying to shave a few cycles to make the PDPTR path an
"else if" case is a mess.
Fixes: d42e3fae6faed ("kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes")
Cc: Jim Mattson <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Peter Shier <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Pull KVM updates from Paolo Bonzini:
"s390:
- implement diag318
x86:
- Report last CPU for debugging
- Emulate smaller MAXPHYADDR in the guest than in the host
- .noinstr and tracing fixes from Thomas
- nested SVM page table switching optimization and fixes
Generic:
- Unify shadow MMU cache data structures across architectures"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
KVM: SVM: Fix sev_pin_memory() error handling
KVM: LAPIC: Set the TDCR settable bits
KVM: x86: Specify max TDP level via kvm_configure_mmu()
KVM: x86/mmu: Rename max_page_level to max_huge_page_level
KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
KVM: VMX: Make vmx_load_mmu_pgd() static
KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
KVM: VMX: Drop a duplicate declaration of construct_eptp()
KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
KVM: Using macros instead of magic values
MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
KVM: VMX: Add guest physical address check in EPT violation and misconfig
KVM: VMX: introduce vmx_need_pf_intercept
KVM: x86: update exception bitmap on CPUID changes
KVM: x86: rename update_bp_intercept to update_exception_bitmap
...
|
|
vDPA devices has dedicated backed hardware like
passthrough-ed devices. Then it is possible to setup irq
offloading to vCPU for vDPA devices. Thus this patch tries to
manipulated assigned device counters by
kvm_arch_start/end_assignment() in irqbypass manager, so that
assigned devices could be detected in update_pi_irte()
We will increase/decrease the assigned device counter in kvm/x86.
Both vDPA and VFIO would go through this code path.
Only X86 uses these counters and kvm_arch_start/end_assignment(),
so this code path only affect x86 for now.
Signed-off-by: Zhu Lingshan <[email protected]>
Suggested-by: Jason Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fsgsbase from Thomas Gleixner:
"Support for FSGSBASE. Almost 5 years after the first RFC to support
it, this has been brought into a shape which is maintainable and
actually works.
This final version was done by Sasha Levin who took it up after Intel
dropped the ball. Sasha discovered that the SGX (sic!) offerings out
there ship rogue kernel modules enabling FSGSBASE behind the kernels
back which opens an instantanious unpriviledged root hole.
The FSGSBASE instructions provide a considerable speedup of the
context switch path and enable user space to write GSBASE without
kernel interaction. This enablement requires careful handling of the
exception entries which go through the paranoid entry path as they
can no longer rely on the assumption that user GSBASE is positive (as
enforced via prctl() on non FSGSBASE enabled systemn).
All other entries (syscalls, interrupts and exceptions) can still just
utilize SWAPGS unconditionally when the entry comes from user space.
Converting these entries to use FSGSBASE has no benefit as SWAPGS is
only marginally slower than WRGSBASE and locating and retrieving the
kernel GSBASE value is not a free operation either. The real benefit
of RD/WRGSBASE is the avoidance of the MSR reads and writes.
The changes come with appropriate selftests and have held up in field
testing against the (sanitized) Graphene-SGX driver"
* tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/fsgsbase: Fix Xen PV support
x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase
selftests/x86/fsgsbase: Add a missing memory constraint
selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test
selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE
selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE
selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write
Documentation/x86/64: Add documentation for GS/FS addressing mode
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit
x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit
x86/entry/64: Introduce the FIND_PERCPU_BASE macro
x86/entry/64: Switch CR3 before SWAPGS in paranoid entry
x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation
x86/process/64: Use FSGSBASE instructions on thread copy and ptrace
x86/process/64: Use FSBSBASE in switch_to() if available
x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE
x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 conversion to generic entry code from Thomas Gleixner:
"The conversion of X86 syscall, interrupt and exception entry/exit
handling to the generic code.
Pretty much a straight-forward 1:1 conversion plus the consolidation
of the KVM handling of pending work before entering guest mode"
* tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu()
x86/kvm: Use generic xfer to guest work function
x86/entry: Cleanup idtentry_enter/exit
x86/entry: Use generic interrupt entry/exit code
x86/entry: Cleanup idtentry_entry/exit_user
x86/entry: Use generic syscall exit functionality
x86/entry: Use generic syscall entry function
x86/ptrace: Provide pt_regs helper for entry/exit
x86/entry: Move user return notifier out of loop
x86/entry: Consolidate 32/64 bit syscall entry
x86/entry: Consolidate check_user_regs()
x86: Correct noinstr qualifiers
x86/idtentry: Remove stale comment
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull uninitialized_var() macro removal from Kees Cook:
"This is long overdue, and has hidden too many bugs over the years. The
series has several "by hand" fixes, and then a trivial treewide
replacement.
- Clean up non-trivial uses of uninitialized_var()
- Update documentation and checkpatch for uninitialized_var() removal
- Treewide removal of uninitialized_var()"
* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
compiler: Remove uninitialized_var() macro
treewide: Remove uninitialized_var() usage
checkpatch: Remove awareness of uninitialized_var() macro
mm/debug_vm_pgtable: Remove uninitialized_var() usage
f2fs: Eliminate usage of uninitialized_var() macro
media: sur40: Remove uninitialized_var() usage
KVM: PPC: Book3S PR: Remove uninitialized_var() usage
clk: spear: Remove uninitialized_var() usage
clk: st: Remove uninitialized_var() usage
spi: davinci: Remove uninitialized_var() usage
ide: Remove uninitialized_var() usage
rtlwifi: rtl8192cu: Remove uninitialized_var() usage
b43: Remove uninitialized_var() usage
drbd: Remove uninitialized_var() usage
x86/mm/numa: Remove uninitialized_var() usage
docs: deprecated.rst: Add uninitialized_var()
|
|
The sev_pin_memory() function was modified to return error pointers
instead of NULL but there are two problems. The first problem is that
if "npages" is zero then it still returns NULL. Secondly, several of
the callers were not updated to check for error pointers instead of
NULL.
Either one of these issues will lead to an Oops.
Fixes: a8d908b5873c ("KVM: x86: report sev_pin_memory errors with PTR_ERR")
Signed-off-by: Dan Carpenter <[email protected]>
Message-Id: <20200714142351.GA315374@mwanda>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
It is a little different between Intel and AMD, Intel's bit 2
is 0 and AMD is reserved. On bare-metal, Intel will refuse to set
APIC_TDCR once bits except 0, 1, 3 are setting, however, AMD will
accept bits 0, 1, 3 and ignore other bits setting as patch does.
Before the patch, we can get back anything what we set to the
APIC_TDCR, this patch improves it.
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
'Commit 8566ac8b8e7c ("KVM: SVM: Implement pause loop exit logic in SVM")'
drops disable pause loop exit/pause filtering capability completely, I
guess it is a merge fault by Radim since disable vmexits capabilities and
pause loop exit for SVM patchsets are merged at the same time. This patch
reintroduces the disable pause loop exit/pause filtering capability support.
Reported-by: Haiwei Li <[email protected]>
Tested-by: Haiwei Li <[email protected]>
Fixes: 8566ac8b ("KVM: SVM: Implement pause loop exit logic in SVM")
Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|